Re: RDD Moving Average

2015-01-09 Thread Mohit Jaggi
Read this:
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
 
http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E





Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi,

On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis asimja...@gmail.com wrote:

 One approach I was considering was to use mapPartitions. It is
 straightforward to compute the moving average over a partition, except for
 near the end point. Does anyone see how to fix that?


Well, I guess this is not a perfect use case for mapPartitions, in
particular since you would have to implement the behavior near the
beginning and end of a partition yourself. I would rather go with the
high-level RDD functions that are partition-independent.

By the way, I am now also trying to implement sliding windows based on
count and embedded timestamp... seems like I should have had a look at
rdd.sliding() before...

Tobias


Re: RDD Moving Average

2015-01-06 Thread Sean Owen
So you want windows covering the same length of time, some of which will be
fuller than others? You could, for example, simply bucket the data by
minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
a timestamp in ms, you could:

tickerRDD.groupBy(ticker = (ticker.timestamp / 6) * 6))

... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment
at the start of each minute, and the values are the Tickers within the
following minute. You can try variations on this to bucket in different
ways.

Just be careful because a minute with a huge number of values might cause
you to run out of memory. If you're just doing aggregations of some kind
there are more efficient methods than this most generic method, like the
aggregate methods.

On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis asimja...@gmail.com wrote:

 ​Thanks. Another question. ​I have event data with timestamps. I want to
 create a sliding window using timestamps. Some windows will have a lot of
 events in them others won’t. Is there a way to get an RDD made of this kind
 of a variable length window?


 On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen so...@cloudera.com wrote:

 First you'd need to sort the RDD to give it a meaningful order, but I
 assume you have some kind of timestamp in your data you can sort on.

 I think you might be after the sliding() function, a developer API in
 MLlib:


 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43

 On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis asimja...@gmail.com wrote:

 Is there an easy way to do a moving average across a single RDD (in a
 non-streaming app). Here is the use case. I have an RDD made up of stock
 prices. I want to calculate a moving average using a window size of N.

 Thanks.

 Asim






Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
​Thanks. Another question. ​I have event data with timestamps. I want to
create a sliding window using timestamps. Some windows will have a lot of
events in them others won’t. Is there a way to get an RDD made of this kind
of a variable length window?


On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen so...@cloudera.com wrote:

 First you'd need to sort the RDD to give it a meaningful order, but I
 assume you have some kind of timestamp in your data you can sort on.

 I think you might be after the sliding() function, a developer API in
 MLlib:


 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43

 On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis asimja...@gmail.com wrote:

 Is there an easy way to do a moving average across a single RDD (in a
 non-streaming app). Here is the use case. I have an RDD made up of stock
 prices. I want to calculate a moving average using a window size of N.

 Thanks.

 Asim





Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
Except I want it to be a sliding window. So the same record could be in
multiple buckets.

On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen so...@cloudera.com wrote:

 So you want windows covering the same length of time, some of which will
 be fuller than others? You could, for example, simply bucket the data by
 minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
 a timestamp in ms, you could:

 tickerRDD.groupBy(ticker = (ticker.timestamp / 6) * 6))

 ... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment
 at the start of each minute, and the values are the Tickers within the
 following minute. You can try variations on this to bucket in different
 ways.

 Just be careful because a minute with a huge number of values might cause
 you to run out of memory. If you're just doing aggregations of some kind
 there are more efficient methods than this most generic method, like the
 aggregate methods.

 On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis asimja...@gmail.com wrote:

 ​Thanks. Another question. ​I have event data with timestamps. I want to
 create a sliding window using timestamps. Some windows will have a lot of
 events in them others won’t. Is there a way to get an RDD made of this kind
 of a variable length window?


 On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen so...@cloudera.com wrote:

 First you'd need to sort the RDD to give it a meaningful order, but I
 assume you have some kind of timestamp in your data you can sort on.

 I think you might be after the sliding() function, a developer API in
 MLlib:


 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43

 On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis asimja...@gmail.com wrote:

 Is there an easy way to do a moving average across a single RDD (in a
 non-streaming app). Here is the use case. I have an RDD made up of stock
 prices. I want to calculate a moving average using a window size of N.

 Thanks.

 Asim







Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One problem with this is that we are creating a lot of iterables containing
a lot of repeated data. Is there a way to do this so that we can calculate
a moving average incrementally?

On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen so...@cloudera.com wrote:

 Yes, if you break it down to...

 tickerRDD.map(ticker =
   (ticker.timestamp, ticker)
 ).map { case(ts, ticker) =
   ((ts / 6) * 6, ticker)
 }.groupByKey

 ... as Michael alluded to, then it more naturally extends to the sliding
 window, since you can flatMap one Ticker to many (bucket, ticker) pairs,
 then group. I think this would implementing 1 minute buckets, sliding by 10
 seconds:

 tickerRDD.flatMap(ticker =
   (ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts = (ts,
 ticker))
 ).map { case(ts, ticker) =
   ((ts / 6) * 6, ticker)
 }.groupByKey

 On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis asimja...@gmail.com wrote:

 I guess I can use a similar groupBy approach. Map each event to all the
 windows that it can belong to. Then do a groupBy, etc. I was wondering if
 there was a more elegant approach.

 On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote:

 Except I want it to be a sliding window. So the same record could be in
 multiple buckets.




Re: RDD Moving Average

2015-01-06 Thread Sean Owen
Interesting, I am not sure the order in which fold() encounters elements is
guaranteed, although from reading the code, I imagine in practice it is
first-to-last by partition and then folded first-to-last from those results
on the driver. I don't know this would lead to a solution though as the
result here needs to be an RDD, not one value.

On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter paolo.plat...@agilelab.it
wrote:

  In my opinion you should use fold pattern. Obviously after an sort by
 trasformation.

 Paolo

 Inviata dal mio Windows Phone
  --
 Da: Asim Jalis asimja...@gmail.com
 Inviato: ‎06/‎01/‎2015 23:11
 A: Sean Owen so...@cloudera.com
 Cc: user@spark.apache.org
 Oggetto: Re: RDD Moving Average

   One problem with this is that we are creating a lot of iterables
 containing a lot of repeated data. Is there a way to do this so that we can
 calculate a moving average incrementally?

 On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen so...@cloudera.com wrote:

 Yes, if you break it down to...

  tickerRDD.map(ticker =
   (ticker.timestamp, ticker)
 ).map { case(ts, ticker) =
   ((ts / 6) * 6, ticker)
 }.groupByKey

  ... as Michael alluded to, then it more naturally extends to the
 sliding window, since you can flatMap one Ticker to many (bucket, ticker)
 pairs, then group. I think this would implementing 1 minute buckets,
 sliding by 10 seconds:

  tickerRDD.flatMap(ticker =
   (ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts = (ts,
 ticker))
 ).map { case(ts, ticker) =
   ((ts / 6) * 6, ticker)
 }.groupByKey

 On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis asimja...@gmail.com wrote:

  I guess I can use a similar groupBy approach. Map each event to all
 the windows that it can belong to. Then do a groupBy, etc. I was wondering
 if there was a more elegant approach.

 On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote:

  Except I want it to be a sliding window. So the same record could be
 in multiple buckets.





Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One approach I was considering was to use mapPartitions. It is
straightforward to compute the moving average over a partition, except for
near the end point. Does anyone see how to fix that?

On Tue, Jan 6, 2015 at 7:20 PM, Sean Owen so...@cloudera.com wrote:

 Interesting, I am not sure the order in which fold() encounters elements
 is guaranteed, although from reading the code, I imagine in practice it is
 first-to-last by partition and then folded first-to-last from those results
 on the driver. I don't know this would lead to a solution though as the
 result here needs to be an RDD, not one value.

 On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter paolo.plat...@agilelab.it
 wrote:

  In my opinion you should use fold pattern. Obviously after an sort by
 trasformation.

 Paolo

 Inviata dal mio Windows Phone
  --
 Da: Asim Jalis asimja...@gmail.com
 Inviato: ‎06/‎01/‎2015 23:11
 A: Sean Owen so...@cloudera.com
 Cc: user@spark.apache.org
 Oggetto: Re: RDD Moving Average

   One problem with this is that we are creating a lot of iterables
 containing a lot of repeated data. Is there a way to do this so that we can
 calculate a moving average incrementally?

 On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen so...@cloudera.com wrote:

 Yes, if you break it down to...

  tickerRDD.map(ticker =
   (ticker.timestamp, ticker)
 ).map { case(ts, ticker) =
   ((ts / 6) * 6, ticker)
 }.groupByKey

  ... as Michael alluded to, then it more naturally extends to the
 sliding window, since you can flatMap one Ticker to many (bucket, ticker)
 pairs, then group. I think this would implementing 1 minute buckets,
 sliding by 10 seconds:

  tickerRDD.flatMap(ticker =
   (ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts = (ts,
 ticker))
 ).map { case(ts, ticker) =
   ((ts / 6) * 6, ticker)
 }.groupByKey

 On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis asimja...@gmail.com wrote:

  I guess I can use a similar groupBy approach. Map each event to all
 the windows that it can belong to. Then do a groupBy, etc. I was wondering
 if there was a more elegant approach.

 On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote:

  Except I want it to be a sliding window. So the same record could be
 in multiple buckets.






R: RDD Moving Average

2015-01-06 Thread Paolo Platter
In my opinion you should use fold pattern. Obviously after an sort by 
trasformation.

Paolo

Inviata dal mio Windows Phone

Da: Asim Jalismailto:asimja...@gmail.com
Inviato: ‎06/‎01/‎2015 23:11
A: Sean Owenmailto:so...@cloudera.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: RDD Moving Average

One problem with this is that we are creating a lot of iterables containing a 
lot of repeated data. Is there a way to do this so that we can calculate a 
moving average incrementally?

On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:
Yes, if you break it down to...

tickerRDD.map(ticker =
  (ticker.timestamp, ticker)
).map { case(ts, ticker) =
  ((ts / 6) * 6, ticker)
}.groupByKey

... as Michael alluded to, then it more naturally extends to the sliding 
window, since you can flatMap one Ticker to many (bucket, ticker) pairs, then 
group. I think this would implementing 1 minute buckets, sliding by 10 seconds:

tickerRDD.flatMap(ticker =
  (ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts = (ts, 
ticker))
).map { case(ts, ticker) =
  ((ts / 6) * 6, ticker)
}.groupByKey

On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis 
asimja...@gmail.commailto:asimja...@gmail.com wrote:
I guess I can use a similar groupBy approach. Map each event to all the windows 
that it can belong to. Then do a groupBy, etc. I was wondering if there was a 
more elegant approach.

On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis 
asimja...@gmail.commailto:asimja...@gmail.com wrote:
Except I want it to be a sliding window. So the same record could be in 
multiple buckets.




Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
I guess I can use a similar groupBy approach. Map each event to all the
windows that it can belong to. Then do a groupBy, etc. I was wondering if
there was a more elegant approach.

On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote:

 Except I want it to be a sliding window. So the same record could be in
 multiple buckets.

 On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen so...@cloudera.com wrote:

 So you want windows covering the same length of time, some of which will
 be fuller than others? You could, for example, simply bucket the data by
 minute to get this kind of effect. If you an RDD[Ticker], where Ticker has
 a timestamp in ms, you could:

 tickerRDD.groupBy(ticker = (ticker.timestamp / 6) * 6))

 ... to get an RDD[(Long,Iterable[Ticker])], where the keys are the moment
 at the start of each minute, and the values are the Tickers within the
 following minute. You can try variations on this to bucket in different
 ways.

 Just be careful because a minute with a huge number of values might cause
 you to run out of memory. If you're just doing aggregations of some kind
 there are more efficient methods than this most generic method, like the
 aggregate methods.

 On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis asimja...@gmail.com wrote:

 ​Thanks. Another question. ​I have event data with timestamps. I want to
 create a sliding window using timestamps. Some windows will have a lot of
 events in them others won’t. Is there a way to get an RDD made of this kind
 of a variable length window?


 On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen so...@cloudera.com wrote:

 First you'd need to sort the RDD to give it a meaningful order, but I
 assume you have some kind of timestamp in your data you can sort on.

 I think you might be after the sliding() function, a developer API in
 MLlib:


 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43

 On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis asimja...@gmail.com wrote:

 Is there an easy way to do a moving average across a single RDD (in a
 non-streaming app). Here is the use case. I have an RDD made up of stock
 prices. I want to calculate a moving average using a window size of N.

 Thanks.

 Asim








RDD Moving Average

2015-01-06 Thread Asim Jalis
Is there an easy way to do a moving average across a single RDD (in a
non-streaming app). Here is the use case. I have an RDD made up of stock
prices. I want to calculate a moving average using a window size of N.

Thanks.

Asim


Re: RDD Moving Average

2015-01-06 Thread Sean Owen
First you'd need to sort the RDD to give it a meaningful order, but I
assume you have some kind of timestamp in your data you can sort on.

I think you might be after the sliding() function, a developer API in MLlib:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43

On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis asimja...@gmail.com wrote:

 Is there an easy way to do a moving average across a single RDD (in a
 non-streaming app). Here is the use case. I have an RDD made up of stock
 prices. I want to calculate a moving average using a window size of N.

 Thanks.

 Asim