and if its a single giant timeseries that is already sorted then Mohit's solution sounds good to me.
On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak <michaelma...@yahoo.com> wrote: > But isn't foldLeft() overkill for the originally stated use case of max > diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative > non-associative accumulation as opposed to an embarrassingly parallel > operation such as this one? > > This use case reminds me of FIR filtering in DSP. It seems that RDDs could > use something that serves the same purpose as > scala.collection.Iterator.sliding. > > ------------------------------ > *From:* Koert Kuipers <ko...@tresata.com> > *To:* Mohit Jaggi <mohitja...@gmail.com> > *Cc:* Tobias Pfeiffer <t...@preferred.jp>; "Ganelin, Ilya" < > ilya.gane...@capitalone.com>; derrickburns <derrickrbu...@gmail.com>; " > user@spark.apache.org" <user@spark.apache.org> > *Sent:* Friday, January 30, 2015 7:11 AM > *Subject:* Re: spark challenge: zip with next??? > > assuming the data can be partitioned then you have many timeseries for > which you want to detect potential gaps. also assuming the resulting gaps > info per timeseries is much smaller data then the timeseries data itself, > then this is a classical example to me of a sorted (streaming) foldLeft, > requiring an efficient secondary sort in the spark shuffle. i am trying to > get that into spark here: > https://issues.apache.org/jira/browse/SPARK-3655 > > > > On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitja...@gmail.com> > wrote: > > > http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E > > you can use the MLLib function or do the following (which is what I had > done): > > - in first pass over the data, using mapPartitionWithIndex, gather the > first item in each partition. you can use collect (or aggregator) for this. > “key” them by the partition index. at the end, you will have a map > (partition index) --> first item > - in the second pass over the data, using mapPartitionWithIndex again, > look at two (or in the general case N items at a time, you can use scala’s > sliding iterator) items at a time and check the time difference(or any > sliding window computation). To this mapParitition, pass the map created in > previous step. You will need to use them to check the last item in this > partition. > > If you can tolerate a few inaccuracies then you can just do the second > step. You will miss the “boundaries” of the partitions but it might be > acceptable for your use case. > > > > On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > > Hi, > > On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya < > ilya.gane...@capitalone.com> wrote: > > Make a copy of your RDD with an extra entry in the beginning to offset. > The you can zip the two RDDs and run a map to generate an RDD of > differences. > > > Does that work? I recently tried something to compute differences between > each entry and the next, so I did > val rdd1 = ... // null element + rdd > val rdd2 = ... // rdd + null element > but got an error message about zip requiring data sizes in each partition > to match. > > Tobias > > > > > >