We tried that but couldn't figure out a way to efficiently filter it. Lets take two RDDs.
rdd1: (1,2) (1,5) (2,3) (3,20) (3,16) rdd2: (1,2) (3,30) (3,16) (5,12) rdd1.leftOuterJoin(rdd2) and get rdd1.subtract(rdd2): (1,(2,Some(2))) (1,(5,Some(2))) (2,(3,None)) (3,(20,Some(30))) (3,(20,Some(16))) (3,(16,Some(30))) (3,(16,Some(16))) case (x, (y, z)) => Apart from allowing z == None and filtering on y == z, we also should filter out (3, (16, Some(30))). How can we do that efficiently without resorting to broadcast of any elements of rdd2? Regards, Raghava. On Mon, May 9, 2016 at 6:27 AM, ayan guha <guha.a...@gmail.com> wrote: > How about outer join? > On 9 May 2016 13:18, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> > wrote: > >> Hello All, >> >> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key >> (number of partitions are same for both the RDDs). We would like to >> subtract rdd2 from rdd1. >> >> The subtract code at >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala >> seems to group the elements of both the RDDs using (x, null) where x is the >> element of the RDD and partition them. Then it makes use of >> subtractByKey(). This way, RDDs have to be repartitioned on x (which in our >> case, is both key and value combined). In our case, both the RDDs are >> already hash partitioned on the key of x. Can we take advantage of this by >> having a PairRDD/HashPartitioner-aware subtract? Is there a way to use >> mapPartitions() for this? >> >> We tried to broadcast rdd2 and use mapPartitions. But this turns out to >> be memory consuming and inefficient. We tried to do a local set difference >> between rdd1 and the broadcasted rdd2 (in mapPartitions of rdd1). We did >> use destroy() on the broadcasted value, but it does not help. >> >> The current subtract method is slow for us. rdd1 and rdd2 are around >> 700MB each and the subtract takes around 14 seconds. >> >> Any ideas on this issue is highly appreciated. >> >> Regards, >> Raghava. >> > -- Regards, Raghava http://raghavam.github.io