RE: Dedup

2016-01-12 Thread gpmacalalad
h Sean! >> >> -Yao >> >> -Original Message- >> From: Sean Owen [mailto: > sowen@ > ] >> Sent: Thursday, October 09, 2014 3:04 AM >> To: Ge, Yao (Y.) >> Cc: > user@.apache >> Subject: Re: Dedup >> >> I think t

Re: Dedup

2014-10-09 Thread Akhil Das
If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey. Thanks Best Regards On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal sonalgoy...@gmail.com wrote: What is your data like? Are you looking at exact matching or

Re: Dedup

2014-10-09 Thread Sean Owen
I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return. This works fine although you will

RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
much Sean! -Yao -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just

RE: Dedup

2014-10-09 Thread Sean Owen
, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like

Re: Dedup

2014-10-08 Thread Nicholas Chammas
Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication

Re: Dedup

2014-10-08 Thread Flavio Pompermaier
Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Multiple values may be different, yet

Re: Dedup

2014-10-08 Thread Sonal Goyal
What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 9, 2014 at 2:31