-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I believe this only works when we need to drop duplicate ROWS
Here I want to drop cols which contains one unique value
Le 2018-05-31 11:16, Divya Gehlot a écrit :
you can try dropduplicate function
Hi there !
I have a potentially large dataset ( regarding number of rows and cols )
And I want to find the fastest way to drop some useless cols for me,
i.e. cols containing only an unique value !
I want to know what do you think that I could do to do this as fast as
possible using spark.
Hi dear spark community !
I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp (
Hi dear spark community !
I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp (
Ok thanks !
That's exactly the kind of thing I was imagining with Apache BEAM.
I still have a few questions.
- regarding performances will this be efficient ? Even with large
"window" / many id / values / timestamps ... ?
- my goal after all this is to store it in cassandra and/or use the
Hi there !
Let's imagine I have a large number of very small dataframe with the
same schema ( a list of DataFrames : allDFs)
and I want to create one large dataset with this.
I have been trying this :
-> allDFs.reduce ( (a,b) => a.union(b) )
And after this one :
-> allDFs.reduce ( (a,b) =>
Hello,
I want to create a lib which generates features for potentially very
large datasets.
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( int or string )
I want my tool to :
- compute aggregate function for
Hello,
I want to create a lib which generates features for potentially very
large datasets.
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( int or string )
I want my tool to :
- compute aggregate function for many