Koert, thanks for the referral to your current pull request! I found it very thoughtful and thought-provoking.
On Fri, Jan 30, 2015 at 9:19 AM, Koert Kuipers <ko...@tresata.com> wrote: > and if its a single giant timeseries that is already sorted then Mohit's > solution sounds good to me. > > On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak <michaelma...@yahoo.com> > wrote: > >> But isn't foldLeft() overkill for the originally stated use case of max >> diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative >> non-associative accumulation as opposed to an embarrassingly parallel >> operation such as this one? >> >> This use case reminds me of FIR filtering in DSP. It seems that RDDs >> could use something that serves the same purpose as >> scala.collection.Iterator.sliding. >> >> ------------------------------ >> *From:* Koert Kuipers <ko...@tresata.com> >> *To:* Mohit Jaggi <mohitja...@gmail.com> >> *Cc:* Tobias Pfeiffer <t...@preferred.jp>; "Ganelin, Ilya" < >> ilya.gane...@capitalone.com>; derrickburns <derrickrbu...@gmail.com>; " >> user@spark.apache.org" <user@spark.apache.org> >> *Sent:* Friday, January 30, 2015 7:11 AM >> *Subject:* Re: spark challenge: zip with next??? >> >> assuming the data can be partitioned then you have many timeseries for >> which you want to detect potential gaps. also assuming the resulting gaps >> info per timeseries is much smaller data then the timeseries data itself, >> then this is a classical example to me of a sorted (streaming) foldLeft, >> requiring an efficient secondary sort in the spark shuffle. i am trying to >> get that into spark here: >> https://issues.apache.org/jira/browse/SPARK-3655 >> >> >> >> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitja...@gmail.com> >> wrote: >> >> >> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E >> >> you can use the MLLib function or do the following (which is what I had >> done): >> >> - in first pass over the data, using mapPartitionWithIndex, gather the >> first item in each partition. you can use collect (or aggregator) for this. >> “key” them by the partition index. at the end, you will have a map >> (partition index) --> first item >> - in the second pass over the data, using mapPartitionWithIndex again, >> look at two (or in the general case N items at a time, you can use scala’s >> sliding iterator) items at a time and check the time difference(or any >> sliding window computation). To this mapParitition, pass the map created in >> previous step. You will need to use them to check the last item in this >> partition. >> >> If you can tolerate a few inaccuracies then you can just do the second >> step. You will miss the “boundaries” of the partitions but it might be >> acceptable for your use case. >> >> >> >> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: >> >> Hi, >> >> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya < >> ilya.gane...@capitalone.com> wrote: >> >> Make a copy of your RDD with an extra entry in the beginning to offset. >> The you can zip the two RDDs and run a map to generate an RDD of >> differences. >> >> >> Does that work? I recently tried something to compute differences between >> each entry and the next, so I did >> val rdd1 = ... // null element + rdd >> val rdd2 = ... // rdd + null element >> but got an error message about zip requiring data sizes in each partition >> to match. >> >> Tobias >> >> >> >> >> >> >