Hi Pro, I want to merge elements in a Spark RDD when the two elements satisfy certain condition
Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD. For example, suppose RDD[Seq[Int]] = [[1,2,3], [2,4,5], [1,2], [7,8,9]], the result should be [[1,2,3,4,5], [7,8,9]]. Since RDD[Seq[Int]] is very large, I cannot do it in driver program. Is it possible to get it done using distributed groupBy/map/reduce, etc? Thanks in advance, Pengcheng