Re: spark challenge: zip with next???

Koert Kuipers Fri, 30 Jan 2015 09:22:02 -0800

and if its a single giant timeseries that is already sorted then Mohit's
solution sounds good to me.


On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak <michaelma...@yahoo.com>
wrote:

> But isn't foldLeft() overkill for the originally stated use case of max
> diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative
> non-associative accumulation as opposed to an embarrassingly parallel
> operation such as this one?
>
> This use case reminds me of FIR filtering in DSP. It seems that RDDs could
> use something that serves the same purpose as
> scala.collection.Iterator.sliding.
>
>   ------------------------------
>  *From:* Koert Kuipers <ko...@tresata.com>
> *To:* Mohit Jaggi <mohitja...@gmail.com>
> *Cc:* Tobias Pfeiffer <t...@preferred.jp>; "Ganelin, Ilya" <
> ilya.gane...@capitalone.com>; derrickburns <derrickrbu...@gmail.com>; "
> user@spark.apache.org" <user@spark.apache.org>
> *Sent:* Friday, January 30, 2015 7:11 AM
> *Subject:* Re: spark challenge: zip with next???
>
> assuming the data can be partitioned then you have many timeseries for
> which you want to detect potential gaps. also assuming the resulting gaps
> info per timeseries is much smaller data then the timeseries data itself,
> then this is a classical example to me of a sorted (streaming) foldLeft,
> requiring an efficient secondary sort in the spark shuffle. i am trying to
> get that into spark here:
> https://issues.apache.org/jira/browse/SPARK-3655
>
>
>
> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitja...@gmail.com>
> wrote:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for this.
> “key” them by the partition index. at the end, you will have a map
>    (partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala’s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the “boundaries” of the partitions but it might be
> acceptable for your use case.
>
>
>
> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>
> Hi,
>
> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <
> ilya.gane...@capitalone.com> wrote:
>
>  Make a copy of your RDD with an extra entry in the beginning to offset.
> The you can zip the two RDDs and run a map to generate an RDD of
> differences.
>
>
> Does that work? I recently tried something to compute differences between
> each entry and the next, so I did
>   val rdd1 = ... // null element + rdd
>   val rdd2 = ... // rdd + null element
> but got an error message about zip requiring data sizes in each partition
> to match.
>
> Tobias
>
>
>
>
>
>

Re: spark challenge: zip with next???

Reply via email to