unsubscribe
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Fastest way to drop useless columns
I believe this only works when we need to drop duplicate ROWS Here I want to drop cols which contains one unique value Le 2018-05-31 11:16, Divya Gehlot a écrit : you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fastest way to drop useless columns
Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Feature generation / aggregate functions / timeseries
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( generaly a double ) I want my tool to : - compute aggregate function for many pairs 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate functions such as min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01. to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? ( I have tested some solutions but I'm not really satisfied ATM... ) Thanks a lot Community :) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Fwd: Feature Generation for Large datasets composed of many time series
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( generaly a double ) I want my tool to : - compute aggregate function for many pairs 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate functions such as min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01. to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? ( I have tested some solutions but I'm not really satisfied ATM... ) Thanks a lot Community :) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Union large number of DataFrames
Hi there ! Let's imagine I have a large number of very small dataframe with the same schema ( a list of DataFrames : allDFs) and I want to create one large dataset with this. I have been trying this : -> allDFs.reduce ( (a,b) => a.union(b) ) And after this one : -> allDFs.reduce ( (a,b) => a.union(b).repartition(200) ) to prevent df with large number of partitions Two questions : 1) Will the reduce operation be done in parallel in the previous code ? or may be should I replace my reduce by allDFs.par.reduce ? 2) Is there a better way to concatenate them ? Thanks ! Julio - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Feature Generation for Large datasets composed of many time series
Hello, I want to create a lib which generates features for potentially very large datasets. Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( int or string ) I want my tool to : - compute aggregate function for many couple 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate function such as min/max/count/distinct/last/mode or user defined ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01 to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : agg. data betweem [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : agg. data betweem [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? Thanks a lot ! - To unsubscribe e-mail: user-unsubscr...@spark.apache.org