unsubscribe

2020-02-19 Thread julio . cesare


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Fastest way to drop useless columns

2018-05-31 Thread julio . cesare

I believe this only works when we need to drop duplicate ROWS

Here I want to drop cols which contains one unique value


Le 2018-05-31 11:16, Divya Gehlot a écrit :

you can try dropduplicate function

https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala

On 31 May 2018 at 16:34,  wrote:


Hi there !

I have a potentially large dataset ( regarding number of rows and
cols )

And I want to find the fastest way to drop some useless cols for me,
i.e. cols containing only an unique value !

I want to know what do you think that I could do to do this as fast
as possible using spark.

I already have a solution using distinct().count() or
approxCountDistinct()
But, they may not be the best choice as this requires to go through
all the data, even if the 2 first tested values for a col are
already different ( and in this case I know that I can keep the col
)

Thx for your ideas !

Julien



-

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fastest way to drop useless columns

2018-05-31 Thread julio . cesare

Hi there !

I have a potentially large dataset ( regarding number of rows and cols )

And I want to find the fastest way to drop some useless cols for me, 
i.e. cols containing only an unique value !


I want to know what do you think that I could do to do this as fast as 
possible using spark.



I already have a solution using distinct().count() or 
approxCountDistinct()
But, they may not be the best choice as this requires to go through all 
the data, even if the 2 first tested values for a col are already 
different ( and in this case I know that I can keep the col )



Thx for your ideas !

Julien

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Feature generation / aggregate functions / timeseries

2017-12-14 Thread julio . cesare

Hi dear spark community !

I want to create a lib which generates features for potentially very 
large datasets, so I believe spark could be a nice tool for that.

Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
= compute for the instant 't = 2001-01-01' aggregate functions for 
data between 't-1month and t' and 't-12months and t-9months' and this, 
FOR EACH ID !
( aggregate functions such as 
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )


My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' ( 
potentially large )
- My 'window' defined by the duration may be really large ( but may 
contain only a few values... )

- I may have many id...
- I may have many timestamps...





Let me describe this with some kind of example to see if SPARK ( SPARK 
STREAMING ? ) may help me to do that :


Let's imagine that I have all my data in a DB or a file with the 
following columns :

id | timestamp(ms) | value
A | 100 |  100
A | 1000500 |  66
B | 100 |  100
B | 110 |  50
B | 120 |  200
B | 250 |  500

( The timestamp is a long value, so as to be able to express date in ms 
from -01-01. to today )


I want to compute operations such as min, max, average, last on the 
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ 
t-1000ms and t ]
-> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between 
[ t-5000ms and t-2500ms ]



And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
---
A | 1000500| min...| max   | avg   | last
B | 1000500| min...| max   | avg   | last
A | 133| min...| max   | avg   | last
B | 133| min...| max   | avg   | last



Do you think we can do this efficiently with spark and/or spark 
streaming, and do you have an idea on "how" ?

( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Fwd: Feature Generation for Large datasets composed of many time series

2017-12-14 Thread julio . cesare

Hi dear spark community !

I want to create a lib which generates features for potentially very 
large datasets, so I believe spark could be a nice tool for that.

Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
= compute for the instant 't = 2001-01-01' aggregate functions for 
data between 't-1month and t' and 't-12months and t-9months' and this, 
FOR EACH ID !
( aggregate functions such as 
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )


My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' ( 
potentially large )
- My 'window' defined by the duration may be really large ( but may 
contain only a few values... )

- I may have many id...
- I may have many timestamps...





Let me describe this with some kind of example to see if SPARK ( SPARK 
STREAMING ? ) may help me to do that :


Let's imagine that I have all my data in a DB or a file with the 
following columns :

id | timestamp(ms) | value
A | 100 |  100
A | 1000500 |  66
B | 100 |  100
B | 110 |  50
B | 120 |  200
B | 250 |  500

( The timestamp is a long value, so as to be able to express date in ms 
from -01-01. to today )


I want to compute operations such as min, max, average, last on the 
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ 
t-1000ms and t ]
-> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between 
[ t-5000ms and t-2500ms ]



And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
---
A | 1000500| min...| max   | avg   | last
B | 1000500| min...| max   | avg   | last
A | 133| min...| max   | avg   | last
B | 133| min...| max   | avg   | last



Do you think we can do this efficiently with spark and/or spark 
streaming, and do you have an idea on "how" ?

( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Union large number of DataFrames

2017-07-24 Thread julio . cesare

Hi there !

Let's imagine I have a large number of very small dataframe with the 
same schema ( a list of DataFrames : allDFs)

and I want to create one large dataset with this.

I have been trying this :
-> allDFs.reduce ( (a,b) => a.union(b) )

And after this one :
-> allDFs.reduce ( (a,b) => a.union(b).repartition(200) )
to prevent df with large number of partitions


Two questions :
1) Will the reduce operation be done in parallel in the previous code ? 
or may be should I replace my reduce by allDFs.par.reduce ?

2) Is there a better way to concatenate them ?


Thanks !
Julio

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Feature Generation for Large datasets composed of many time series

2017-07-19 Thread julio . cesare

Hello,

I want to create a lib which generates features for potentially very 
large datasets.


Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( int or string )

I want my tool to :
- compute aggregate function for many couple 'instants + duration'
===> FOR EXAMPLE :
= compute for the instant 't = 2001-01-01' aggregate functions for 
data between 't-1month and t' and 't-12months and t-9months' and this, 
FOR EACH ID !
( aggregate function such as min/max/count/distinct/last/mode or user 
defined )


My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' ( 
potentially large )
- My 'window' defined by the duration may be really large ( but may 
contain only a few values... )

- I may have many id...
- I may have many timestamps...





Let me describe this with some kind of example to see if SPARK ( SPARK 
STREAMING ? ) may help me to do that :


Let's imagine that I have all my data in a DB or a file with the 
following columns :

id | timestamp(ms) | value
A | 100 |  100
A | 1000500 |  66
B | 100 |  100
B | 110 |  50
B | 120 |  200
B | 250 |  500

( The timestamp is a long value, so as to be able to express date in ms 
from -01-01 to today )


I want to compute operations such as min, max, average, last on the 
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ 
t-1000ms and t ]
-> instant = 133 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [ 
t-5000ms and t-2500ms ]



And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
---
A | 1000500| min...| max   | avg   | last
B | 1000500| min...| max   | avg   | last
A | 133| min...| max   | avg   | last
B | 133| min...| max   | avg   | last



Do you think we can do this efficiently with spark and/or spark 
streaming, and do you have an idea on "how" ?



Thanks a lot !

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org