Re: filling missing values in a sequence

Xiangrui Meng Mon, 19 May 2014 22:14:26 -0700

Actually there is a sliding method implemented in
mllib.rdd.RDDFunctions. Since this is not for general use cases, we
didn't include it in spark-core. You can take a look at the
implementation there and see whether it fits. -Xiangrui


On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:
> Thanks Sean. Yes, your solution works :-) I did oversimplify my real
> problem, which has other parameters that go along with the sequence.
>
>
> On Fri, May 16, 2014 at 3:03 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Not sure if this is feasible, but this literally does what I think you
>> are describing:
>>
>> sc.parallelize(rdd1.first to rdd1.last)
>>
>> On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:
>> > Hi,
>> > I am trying to find a way to fill in missing values in an RDD. The RDD
>> > is a
>> > sorted sequence.
>> > For example, (1, 2, 3, 5, 8, 11, ...)
>> > I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
>> >
>> > One way to do this is to "slide and zip"
>> > rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
>> > x = rdd1.first
>> > rdd2 = rdd1 filter (_ != x)
>> > rdd3 = rdd2 zip rdd1
>> > rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x and
>> > y }
>> >
>> > Another method which I think is more efficient is to use
>> > mapParititions() on
>> > rdd1 to be able to iterate on elements of rdd1 in each partition.
>> > However,
>> > that leaves the boundaries of the partitions to be "unfilled". Is there
>> > a
>> > way within the function passed to mapPartitions, to read the first
>> > element
>> > in the next partition?
>> >
>> > The latter approach also appears to work for a general "sliding window"
>> > calculation on the RDD. The former technique requires a lot of "sliding
>> > and
>> > zipping" and I believe it is not efficient. If only I could read the
>> > next
>> > partition...I have tried passing a pointer to rdd1 to the function
>> > passed to
>> > mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
>> > Spark cannot deal with a mapper calling another mapper (since it happens
>> > on
>> > a worker not the driver)
>> >
>> > Mohit.
>
>

Re: filling missing values in a sequence

Reply via email to