merge elements in a Spark RDD under custom condition

Pengcheng YIN Mon, 01 Dec 2014 00:08:21 -0800

Hi Pro,
I want to merge elements in a Spark RDD when the two elements satisfy certain 
condition


Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain 
overlapping elements. The task is to merge all overlapping Seq[Int] in this 
RDD, and store the result into a new RDD.

For example, suppose RDD[Seq[Int]] = [[1,2,3], [2,4,5], [1,2], [7,8,9]], the 
result should be [[1,2,3,4,5], [7,8,9]].

Since RDD[Seq[Int]] is very large, I cannot do it in driver program. Is it 
possible to get it done using distributed groupBy/map/reduce, etc?

Thanks in advance,

Pengcheng

merge elements in a Spark RDD under custom condition

Reply via email to