It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be
(A,B),(A,C),(A,D),(B,C),(B,D),(C,D) On successful filtering using the original condition it will transform to (A,B) and (C,D) On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld < [email protected]> wrote: > What would it do with the following dataset? > > (A, B) > (A, C) > (B, D) > > > On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary <[email protected]> > wrote: > >> Hi, >> >> I have a RDD of pairs of strings like below : >> >> (A,B) >> (B,C) >> (C,D) >> (A,D) >> (E,F) >> (B,F) >> >> I need to transform/filter this into a RDD of pairs that does not repeat >> a string once it has been used once. So something like , >> >> (A,B) >> (C,D) >> (E,F) >> >> (B,C) is out because B has already ben used in (A,B), (A,D) is out >> because A (and D) has been used etc. >> >> I was thinking of a option of using a shared variable to keep track of >> what has already been used but that may only work for a single partition >> and would not scale for larger dataset. >> >> Is there any other efficient way to accomplish this ? >> >> -- >> Thanks & Regards >> Himanish >> > > -- Thanks & Regards Himanish
