One approach I was considering was to use mapPartitions. It is straightforward to compute the moving average over a partition, except for near the end point. Does anyone see how to fix that?
On Tue, Jan 6, 2015 at 7:20 PM, Sean Owen <so...@cloudera.com> wrote: > Interesting, I am not sure the order in which fold() encounters elements > is guaranteed, although from reading the code, I imagine in practice it is > first-to-last by partition and then folded first-to-last from those results > on the driver. I don't know this would lead to a solution though as the > result here needs to be an RDD, not one value. > > On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter <paolo.plat...@agilelab.it> > wrote: > >> In my opinion you should use fold pattern. Obviously after an sort by >> trasformation. >> >> Paolo >> >> Inviata dal mio Windows Phone >> ------------------------------ >> Da: Asim Jalis <asimja...@gmail.com> >> Inviato: 06/01/2015 23:11 >> A: Sean Owen <so...@cloudera.com> >> Cc: user@spark.apache.org >> Oggetto: Re: RDD Moving Average >> >> One problem with this is that we are creating a lot of iterables >> containing a lot of repeated data. Is there a way to do this so that we can >> calculate a moving average incrementally? >> >> On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> Yes, if you break it down to... >>> >>> tickerRDD.map(ticker => >>> (ticker.timestamp, ticker) >>> ).map { case(ts, ticker) => >>> ((ts / 60000) * 60000, ticker) >>> }.groupByKey >>> >>> ... as Michael alluded to, then it more naturally extends to the >>> sliding window, since you can flatMap one Ticker to many (bucket, ticker) >>> pairs, then group. I think this would implementing 1 minute buckets, >>> sliding by 10 seconds: >>> >>> tickerRDD.flatMap(ticker => >>> (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts, >>> ticker)) >>> ).map { case(ts, ticker) => >>> ((ts / 60000) * 60000, ticker) >>> }.groupByKey >>> >>> On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimja...@gmail.com> wrote: >>> >>>> I guess I can use a similar groupBy approach. Map each event to all >>>> the windows that it can belong to. Then do a groupBy, etc. I was wondering >>>> if there was a more elegant approach. >>>> >>>> On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimja...@gmail.com> wrote: >>>> >>>>> Except I want it to be a sliding window. So the same record could be >>>>> in multiple buckets. >>>>> >>>>> >> >