Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
Hello Imran, Thanks for your response. I noticed the intersection and subtract methods for a RDD, does they work based on hash off all the fields in a RDD record ? - Himanish On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid iras...@cloudera.com wrote: the more scalable alternative is to do a

Filter data from one RDD based on data from another RDD

2015-02-19 Thread Himanish Kushary
Hi, I have two RDD's with csv data as below : RDD-1 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647

Re: Filter data from one RDD based on data from another RDD

2015-02-19 Thread Imran Rashid
the more scalable alternative is to do a join (or a variant like cogroup, leftOuterJoin, subtractByKey etc. found in PairRDDFunctions) the downside is this requires a shuffle of both your RDDs On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have two RDD's