Hi, I have a RDD of pairs of strings like below :
(A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D) is out because A (and D) has been used etc. I was thinking of a option of using a shared variable to keep track of what has already been used but that may only work for a single partition and would not scale for larger dataset. Is there any other efficient way to accomplish this ? -- Thanks & Regards Himanish