Re: partitioner aware subtract

2016-05-10 Thread Raghava Mutharaju
Thank you for the response. This does not work on the test case that I mentioned in the previous email. val data1 = Seq((1 -> 2), (1 -> 5), (2 -> 3), (3 -> 20), (3 -> 16)) val data2 = Seq((1 -> 2), (3 -> 30), (3 -> 16), (5 -> 12)) val rdd1 = sc.parallelize(data1, 8) val rdd2 =

Re: partitioner aware subtract

2016-05-10 Thread Rishi Mishra
As you have same partitioner and number of partitions probably you can use zipPartition and provide a user defined function to substract . A very primitive example being. val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7) val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6) val rdd1 =

Re: partitioner aware subtract

2016-05-09 Thread Raghava Mutharaju
We tried that but couldn't figure out a way to efficiently filter it. Lets take two RDDs. rdd1: (1,2) (1,5) (2,3) (3,20) (3,16) rdd2: (1,2) (3,30) (3,16) (5,12) rdd1.leftOuterJoin(rdd2) and get rdd1.subtract(rdd2): (1,(2,Some(2))) (1,(5,Some(2))) (2,(3,None)) (3,(20,Some(30)))

Re: partitioner aware subtract

2016-05-09 Thread ayan guha
How about outer join? On 9 May 2016 13:18, "Raghava Mutharaju" wrote: > Hello All, > > We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key > (number of partitions are same for both the RDDs). We would like to > subtract rdd2 from rdd1. > > The subtract

partitioner aware subtract

2016-05-08 Thread Raghava Mutharaju
Hello All, We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number of partitions are same for both the RDDs). We would like to subtract rdd2 from rdd1. The subtract code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala seems to