Hello Imran,

Thanks for your response. I noticed the "intersection" and "subtract"
methods for a RDD, does they work based on hash off all the fields in a RDD
record ?

- Himanish

On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid <iras...@cloudera.com> wrote:

> the more scalable alternative is to do a join (or a variant like cogroup,
> leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)
>
> the downside is this requires a shuffle of both your RDDs
>
> On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary <himan...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have two RDD's with csv data as below :
>>
>> RDD-1
>>
>> 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
>> 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
>> 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
>> 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603
>> 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643
>> 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385
>> 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164
>>
>> RDD-2
>>
>> 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160
>> 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238
>> 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164
>> 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164
>> 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164
>>
>> 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164
>> 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164
>>
>> I need to filter RDD-2 to include only those records where the first
>> column value in RDD-2 matches any of the first column values in RDD-1
>>
>> Currently , I am broadcasting the first column values from RDD-1 as a
>> list and then filtering RDD-2 based on that list.
>>
>> val rdd1broadcast = sc.broadcast(rdd1.map { uu => uu.split(",")(0) 
>> }.collect().toSet)
>>
>> val rdd2filtered = rdd2.filter{ h => 
>> rdd1broadcast.value.contains(h.split(",")(0)) }
>>
>> This will result in data with first column "101970_5854301838" (last two 
>> records) to be filtered out from RDD-2.
>>
>> Is this is the best way to accomplish this ? I am worried that for large 
>> data volume , the broadcast step may become an issue. Appreciate any other 
>> suggestion.
>>
>> -----------
>> Thanks
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish

Reply via email to