If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey.
Thanks Best Regards On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > What is your data like? Are you looking at exact matching or are you > interested in nearly same records? Do you need to merge similar records to > get a canonical value? > > Best Regards, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> Maybe you could implement something like this (i don't know if something >> similar already exists in spark): >> >> http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf >> >> Best, >> Flavio >> On Oct 8, 2014 9:58 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> >> wrote: >> >>> Multiple values may be different, yet still be considered duplicates >>> depending on how the dedup criteria is selected. Is that correct? Do you >>> care in that case what value you select for a given key? >>> >>> On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote: >>> >>>> I need to do deduplication processing in Spark. The current plan is >>>> to generate a tuple where key is the dedup criteria and value is the >>>> original input. I am thinking to use reduceByKey to discard duplicate >>>> values. If I do that, can I simply return the first argument or should I >>>> return a copy of the first argument. Is there are better way to do dedup in >>>> Spark? >>>> >>>> >>>> >>>> -Yao >>>> >>> >>> >