Hi,
I am trying to find a way to fill in missing values in an RDD. The RDD is a
sorted sequence.
For example, (1, 2, 3, 5, 8, 11, ...)
I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)

One way to do this is to "slide and zip"
rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
x = rdd1.first
rdd2 = rdd1 filter (_ != x)
rdd3 = rdd2 zip rdd1
rdd4 = rdd3 flatmap { (x, y) => generate missing elements between x and y }

Another method which I think is more efficient is to use mapParititions()
on rdd1 to be able to iterate on elements of rdd1 in each partition.
However, that leaves the boundaries of the partitions to be "unfilled". Is
there a way within the function passed to mapPartitions, to read the first
element in the next partition?

The latter approach also appears to work for a general "sliding window"
calculation on the RDD. The former technique requires a lot of "sliding and
zipping" and I believe it is not efficient. If only I could read the next
partition...I have tried passing a pointer to rdd1 to the function passed
to mapPartitions but the rdd1 pointer turns out to be NULL, I guess because
Spark cannot deal with a mapper calling another mapper (since it happens on
a worker not the driver)

Mohit.

Reply via email to