Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
Hello Imran,

Thanks for your response. I noticed the intersection and subtract
methods for a RDD, does they work based on hash off all the fields in a RDD
record ?

- Himanish

On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid iras...@cloudera.com wrote:

 the more scalable alternative is to do a join (or a variant like cogroup,
 leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)

 the downside is this requires a shuffle of both your RDDs

 On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com
 wrote:

 Hi,

 I have two RDD's with csv data as below :

 RDD-1

 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603
 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643
 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385
 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164

 RDD-2

 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160
 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238
 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164
 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164
 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164

 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164
 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164

 I need to filter RDD-2 to include only those records where the first
 column value in RDD-2 matches any of the first column values in RDD-1

 Currently , I am broadcasting the first column values from RDD-1 as a
 list and then filtering RDD-2 based on that list.

 val rdd1broadcast = sc.broadcast(rdd1.map { uu = uu.split(,)(0) 
 }.collect().toSet)

 val rdd2filtered = rdd2.filter{ h = 
 rdd1broadcast.value.contains(h.split(,)(0)) }

 This will result in data with first column 101970_5854301838 (last two 
 records) to be filtered out from RDD-2.

 Is this is the best way to accomplish this ? I am worried that for large 
 data volume , the broadcast step may become an issue. Appreciate any other 
 suggestion.

 ---
 Thanks
 Himanish





-- 
Thanks  Regards
Himanish


Filter data from one RDD based on data from another RDD

2015-02-19 Thread Himanish Kushary
Hi,

I have two RDD's with csv data as below :

RDD-1

101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603
101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643
101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385
101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164

RDD-2

101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160
101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164
101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164

101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164
101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164

I need to filter RDD-2 to include only those records where the first column
value in RDD-2 matches any of the first column values in RDD-1

Currently , I am broadcasting the first column values from RDD-1 as a list
and then filtering RDD-2 based on that list.

val rdd1broadcast = sc.broadcast(rdd1.map { uu = uu.split(,)(0)
}.collect().toSet)

val rdd2filtered = rdd2.filter{ h =
rdd1broadcast.value.contains(h.split(,)(0)) }

This will result in data with first column 101970_5854301838 (last
two records) to be filtered out from RDD-2.

Is this is the best way to accomplish this ? I am worried that for
large data volume , the broadcast step may become an issue. Appreciate
any other suggestion.

---
Thanks
Himanish


Re: Filter data from one RDD based on data from another RDD

2015-02-19 Thread Imran Rashid
the more scalable alternative is to do a join (or a variant like cogroup,
leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)

the downside is this requires a shuffle of both your RDDs

On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com
wrote:

 Hi,

 I have two RDD's with csv data as below :

 RDD-1

 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603
 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643
 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385
 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164

 RDD-2

 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160
 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238
 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164
 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164
 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164

 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164
 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164

 I need to filter RDD-2 to include only those records where the first
 column value in RDD-2 matches any of the first column values in RDD-1

 Currently , I am broadcasting the first column values from RDD-1 as a list
 and then filtering RDD-2 based on that list.

 val rdd1broadcast = sc.broadcast(rdd1.map { uu = uu.split(,)(0) 
 }.collect().toSet)

 val rdd2filtered = rdd2.filter{ h = 
 rdd1broadcast.value.contains(h.split(,)(0)) }

 This will result in data with first column 101970_5854301838 (last two 
 records) to be filtered out from RDD-2.

 Is this is the best way to accomplish this ? I am worried that for large data 
 volume , the broadcast step may become an issue. Appreciate any other 
 suggestion.

 ---
 Thanks
 Himanish