Thank you for the response.
This does not work on the test case that I mentioned in the previous email.
val data1 = Seq((1 -> 2), (1 -> 5), (2 -> 3), (3 -> 20), (3 -> 16))
val data2 = Seq((1 -> 2), (3 -> 30), (3 -> 16), (5 -> 12))
val rdd1 = sc.parallelize(data1, 8)
val rdd2 =
As you have same partitioner and number of partitions probably you can use
zipPartition and provide a user defined function to substract .
A very primitive example being.
val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7)
val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6)
val rdd1 =
We tried that but couldn't figure out a way to efficiently filter it. Lets
take two RDDs.
rdd1:
(1,2)
(1,5)
(2,3)
(3,20)
(3,16)
rdd2:
(1,2)
(3,30)
(3,16)
(5,12)
rdd1.leftOuterJoin(rdd2) and get rdd1.subtract(rdd2):
(1,(2,Some(2)))
(1,(5,Some(2)))
(2,(3,None))
(3,(20,Some(30)))
How about outer join?
On 9 May 2016 13:18, "Raghava Mutharaju" wrote:
> Hello All,
>
> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key
> (number of partitions are same for both the RDDs). We would like to
> subtract rdd2 from rdd1.
>
> The subtract
Hello All,
We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number
of partitions are same for both the RDDs). We would like to subtract rdd2
from rdd1.
The subtract code at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
seems to