Mohit, if you want to end up with (1 .. N) , why don't you skip the logic
for finding missing values, and generate it directly?
val max = myCollection.reduce(math.max)
sc.parallelize((0 until max))
In either case, you don't need to call cache, which will force it into
memory - you can do
Austin,
I made up a mock example...my real use case is more complex. I used
foreach() instead of collect/cache..that forces the accumulable to be
evaluated. On another thread Xiangrui pointed me to a sliding window rdd in
mlllib that is a great alternative (although I did not switch to using it)
I don't think there's a direct way of bleeding elements across partitions.
But you could write it yourself relatively succinctly:
A) Sort the RDD
B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex( )
method. Map each partition to its partition ID, and its maximum element.
Hi,
I am trying to find a way to fill in missing values in an RDD. The RDD is a
sorted sequence.
For example, (1, 2, 3, 5, 8, 11, ...)
I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
One way to do this is to slide and zip
rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11,