Re: Feature generation / aggregate functions / timeseries
Also the rdd stat counter will already conpute most of your desired metrics as well as df.describe https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html Georg Heilerschrieb am Do. 14. Dez. 2017 um 19:40: > Look at custom UADF functions > schrieb am Do. 14. Dez. 2017 um 09:31: > >> Hi dear spark community ! >> >> I want to create a lib which generates features for potentially very >> large datasets, so I believe spark could be a nice tool for that. >> Let me explain what I need to do : >> >> Each file 'F' of my dataset is composed of at least : >> - an id ( string or int ) >> - a timestamp ( or a long value ) >> - a value ( generaly a double ) >> >> I want my tool to : >> - compute aggregate function for many pairs 'instants + duration' >> ===> FOR EXAMPLE : >> = compute for the instant 't = 2001-01-01' aggregate functions for >> data between 't-1month and t' and 't-12months and t-9months' and this, >> FOR EACH ID ! >> ( aggregate functions such as >> min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) >> >> My constraints : >> - I don't want to compute aggregate for each tuple of 'F' >> ---> I want to provide a list of couples 'instants + duration' ( >> potentially large ) >> - My 'window' defined by the duration may be really large ( but may >> contain only a few values... ) >> - I may have many id... >> - I may have many timestamps... >> >> >> >> >> >> Let me describe this with some kind of example to see if SPARK ( SPARK >> STREAMING ? ) may help me to do that : >> >> Let's imagine that I have all my data in a DB or a file with the >> following columns : >> id | timestamp(ms) | value >> A | 100 | 100 >> A | 1000500 | 66 >> B | 100 | 100 >> B | 110 | 50 >> B | 120 | 200 >> B | 250 | 500 >> >> ( The timestamp is a long value, so as to be able to express date in ms >> from -01-01. to today ) >> >> I want to compute operations such as min, max, average, last on the >> value column, for a these couples : >> -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ >> t-1000ms and t ] >> -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between >> [ t-5000ms and t-2500ms ] >> >> >> And this will produce this kind of output : >> >> id | timestamp(ms) | min_value | max_value | avg_value | last_value >> --- >> A | 1000500| min...| max | avg | last >> B | 1000500| min...| max | avg | last >> A | 133| min...| max | avg | last >> B | 133| min...| max | avg | last >> >> >> >> Do you think we can do this efficiently with spark and/or spark >> streaming, and do you have an idea on "how" ? >> ( I have tested some solutions but I'm not really satisfied ATM... ) >> >> >> Thanks a lot Community :) >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: Feature generation / aggregate functions / timeseries
Look at custom UADF functions.schrieb am Do. 14. Dez. 2017 um 09:31: > Hi dear spark community ! > > I want to create a lib which generates features for potentially very > large datasets, so I believe spark could be a nice tool for that. > Let me explain what I need to do : > > Each file 'F' of my dataset is composed of at least : > - an id ( string or int ) > - a timestamp ( or a long value ) > - a value ( generaly a double ) > > I want my tool to : > - compute aggregate function for many pairs 'instants + duration' > ===> FOR EXAMPLE : > = compute for the instant 't = 2001-01-01' aggregate functions for > data between 't-1month and t' and 't-12months and t-9months' and this, > FOR EACH ID ! > ( aggregate functions such as > min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) > > My constraints : > - I don't want to compute aggregate for each tuple of 'F' > ---> I want to provide a list of couples 'instants + duration' ( > potentially large ) > - My 'window' defined by the duration may be really large ( but may > contain only a few values... ) > - I may have many id... > - I may have many timestamps... > > > > > > Let me describe this with some kind of example to see if SPARK ( SPARK > STREAMING ? ) may help me to do that : > > Let's imagine that I have all my data in a DB or a file with the > following columns : > id | timestamp(ms) | value > A | 100 | 100 > A | 1000500 | 66 > B | 100 | 100 > B | 110 | 50 > B | 120 | 200 > B | 250 | 500 > > ( The timestamp is a long value, so as to be able to express date in ms > from -01-01. to today ) > > I want to compute operations such as min, max, average, last on the > value column, for a these couples : > -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ > t-1000ms and t ] > -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between > [ t-5000ms and t-2500ms ] > > > And this will produce this kind of output : > > id | timestamp(ms) | min_value | max_value | avg_value | last_value > --- > A | 1000500| min...| max | avg | last > B | 1000500| min...| max | avg | last > A | 133| min...| max | avg | last > B | 133| min...| max | avg | last > > > > Do you think we can do this efficiently with spark and/or spark > streaming, and do you have an idea on "how" ? > ( I have tested some solutions but I'm not really satisfied ATM... ) > > > Thanks a lot Community :) > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Feature generation / aggregate functions / timeseries
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( generaly a double ) I want my tool to : - compute aggregate function for many pairs 'instants + duration' ===> FOR EXAMPLE : = compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate functions such as min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 100 | 100 A | 1000500 | 66 B | 100 | 100 B | 110 | 50 B | 120 | 200 B | 250 | 500 ( The timestamp is a long value, so as to be able to express date in ms from -01-01. to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ t-1000ms and t ] -> instant = 133 / [-5000ms, -2500 ] ( i.e. : aggregate data between [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value --- A | 1000500| min...| max | avg | last B | 1000500| min...| max | avg | last A | 133| min...| max | avg | last B | 133| min...| max | avg | last Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? ( I have tested some solutions but I'm not really satisfied ATM... ) Thanks a lot Community :) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org