Interesting, I am not sure the order in which fold() encounters elements is
guaranteed, although from reading the code, I imagine in practice it is
first-to-last by partition and then folded first-to-last from those results
on the driver. I don't know this would lead to a solution though as the
result here needs to be an RDD, not one value.

On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter <paolo.plat...@agilelab.it>
wrote:

>  In my opinion you should use fold pattern. Obviously after an sort by
> trasformation.
>
> Paolo
>
> Inviata dal mio Windows Phone
>  ------------------------------
> Da: Asim Jalis <asimja...@gmail.com>
> Inviato: ‎06/‎01/‎2015 23:11
> A: Sean Owen <so...@cloudera.com>
> Cc: user@spark.apache.org
> Oggetto: Re: RDD Moving Average
>
>   One problem with this is that we are creating a lot of iterables
> containing a lot of repeated data. Is there a way to do this so that we can
> calculate a moving average incrementally?
>
> On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes, if you break it down to...
>>
>>  tickerRDD.map(ticker =>
>>   (ticker.timestamp, ticker)
>> ).map { case(ts, ticker) =>
>>   ((ts / 60000) * 60000, ticker)
>> }.groupByKey
>>
>>  ... as Michael alluded to, then it more naturally extends to the
>> sliding window, since you can flatMap one Ticker to many (bucket, ticker)
>> pairs, then group. I think this would implementing 1 minute buckets,
>> sliding by 10 seconds:
>>
>>  tickerRDD.flatMap(ticker =>
>>   (ticker.timestamp - 60000 to ticker.timestamp by 15000).map(ts => (ts,
>> ticker))
>> ).map { case(ts, ticker) =>
>>   ((ts / 60000) * 60000, ticker)
>> }.groupByKey
>>
>> On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis <asimja...@gmail.com> wrote:
>>
>>>  I guess I can use a similar groupBy approach. Map each event to all
>>> the windows that it can belong to. Then do a groupBy, etc. I was wondering
>>> if there was a more elegant approach.
>>>
>>> On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis <asimja...@gmail.com> wrote:
>>>
>>>>  Except I want it to be a sliding window. So the same record could be
>>>> in multiple buckets.
>>>>
>>>>
>

Reply via email to