Actually they may not be sequentially generated and also the list (RDD)
could come from a different component.

For example from this RDD :

(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)

I would like to separate into two RDD's :

1) (105,918)
     (502,516)

 2) (105,757)
     (105,137)
      (516,816)
      (350,502)

Right now I am using a mutable Set variable to track the elements already
selected. After coalescing the RDD to a single partition I am doing
something like :

val evalCombinations = collection.mutable.Set.empty[String]

val currentValidCombinations = allCombinations

  .filter(p => {
  if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
    evalCombinations += p._1;evalCombinations += p._2; true
  } else
    false
})

This approach is limited by memory of the executor this runs
on.Appreciate any better more scalable solution.

Thanks



On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld <
[email protected]> wrote:

> You're generating all possible pairs?
>
> In that case, why not just generate the sequential pairs you want from the
> start?
>
> On Wed, Mar 25, 2015 at 3:11 PM, Himanish Kushary <[email protected]>
> wrote:
>
>> It will only give (A,B). I am generating the pair from combinations of
>> the the strings A,B,C and D, so the pairs (ignoring order) would be
>>
>> (A,B),(A,C),(A,D),(B,C),(B,D),(C,D)
>>
>> On successful filtering using the original condition it will transform to
>> (A,B) and (C,D)
>>
>> On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld <
>> [email protected]> wrote:
>>
>>> What would it do with the following dataset?
>>>
>>> (A, B)
>>> (A, C)
>>> (B, D)
>>>
>>>
>>> On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a RDD of pairs of strings like below :
>>>>
>>>> (A,B)
>>>> (B,C)
>>>> (C,D)
>>>> (A,D)
>>>> (E,F)
>>>> (B,F)
>>>>
>>>> I need to transform/filter this into a RDD of pairs that does not
>>>> repeat a string once it has been used once. So something like ,
>>>>
>>>> (A,B)
>>>> (C,D)
>>>> (E,F)
>>>>
>>>> (B,C) is out because B has already ben used in (A,B), (A,D) is out
>>>> because A (and D) has been used etc.
>>>>
>>>> I was thinking of a option of using a shared variable to keep track of
>>>> what has already been used but that may only work for a single partition
>>>> and would not scale for larger dataset.
>>>>
>>>> Is there any other efficient way to accomplish this ?
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Himanish
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish



-- 
Thanks & Regards
Himanish
  • [no subject] Himanish Kushary
    • Re: Nathan Kronenfeld
    • Re: Himanish Kushary
      • Fwd: Himanish Kushary

Reply via email to