Re: RDD Moving Average

2015-01-09 Thread Mohit Jaggi
Read this: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E

Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis wrote: > One approach I was considering was to use mapPartitions. It is > straightforward to compute the moving average over a partition, except for > near the end point. Does anyone see how to fix that? > Well, I guess this is not a perfect use ca

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
>> Da: Asim Jalis >> Inviato: ‎06/‎01/‎2015 23:11 >> A: Sean Owen >> Cc: user@spark.apache.org >> Oggetto: Re: RDD Moving Average >> >> One problem with this is that we are creating a lot of iterables >> containing a lot of repeated dat

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
gt; Da: Asim Jalis > Inviato: ‎06/‎01/‎2015 23:11 > A: Sean Owen > Cc: user@spark.apache.org > Oggetto: Re: RDD Moving Average > > One problem with this is that we are creating a lot of iterables > containing a lot of repeated data. Is there a way to do this so that we can >

R: RDD Moving Average

2015-01-06 Thread Paolo Platter
k.apache.org<mailto:user@spark.apache.org> Oggetto: Re: RDD Moving Average One problem with this is that we are creating a lot of iterables containing a lot of repeated data. Is there a way to do this so that we can calculate a moving average incrementally? On Tue, Jan 6, 2015 at 4:44 PM, Se

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
One problem with this is that we are creating a lot of iterables containing a lot of repeated data. Is there a way to do this so that we can calculate a moving average incrementally? On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen wrote: > Yes, if you break it down to... > > tickerRDD.map(ticker => >

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
Yes, if you break it down to... tickerRDD.map(ticker => (ticker.timestamp, ticker) ).map { case(ts, ticker) => ((ts / 6) * 6, ticker) }.groupByKey ... as Michael alluded to, then it more naturally extends to the sliding window, since you can flatMap one Ticker to many (bucket, ticker)

Re: RDD Moving Average

2015-01-06 Thread Michael Malak
Asim Jalis writes: > > ​Thanks. Another question. ​I have event data with timestamps. I want to > create a sliding window > using timestamps. Some windows will have a lot of events in them others > won’t. Is there a way > to get an RDD made of this kind of a variable length window? You should c

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
I guess I can use a similar groupBy approach. Map each event to all the windows that it can belong to. Then do a groupBy, etc. I was wondering if there was a more elegant approach. On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis wrote: > Except I want it to be a sliding window. So the same record cou

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
Except I want it to be a sliding window. So the same record could be in multiple buckets. On Tue, Jan 6, 2015 at 3:43 PM, Sean Owen wrote: > So you want windows covering the same length of time, some of which will > be fuller than others? You could, for example, simply bucket the data by > minut

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
So you want windows covering the same length of time, some of which will be fuller than others? You could, for example, simply bucket the data by minute to get this kind of effect. If you an RDD[Ticker], where Ticker has a timestamp in ms, you could: tickerRDD.groupBy(ticker => (ticker.timestamp /

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis
​Thanks. Another question. ​I have event data with timestamps. I want to create a sliding window using timestamps. Some windows will have a lot of events in them others won’t. Is there a way to get an RDD made of this kind of a variable length window? On Tue, Jan 6, 2015 at 1:03 PM, Sean Owen wr

Re: RDD Moving Average

2015-01-06 Thread Sean Owen
First you'd need to sort the RDD to give it a meaningful order, but I assume you have some kind of timestamp in your data you can sort on. I think you might be after the sliding() function, a developer API in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark

RDD Moving Average

2015-01-06 Thread Asim Jalis
Is there an easy way to do a moving average across a single RDD (in a non-streaming app). Here is the use case. I have an RDD made up of stock prices. I want to calculate a moving average using a window size of N. Thanks. Asim