h Sean!
>>
>> -Yao
>>
>> -Original Message-
>> From: Sean Owen [mailto:
> sowen@
> ]
>> Sent: Thursday, October 09, 2014 3:04 AM
>> To: Ge, Yao (Y.)
>> Cc:
> user@.apache
>> Subject: Re: Dedup
>>
>> I think t
If you are looking to eliminate duplicate rows (or similar) then you can
define a key from the data and on that key you can do reduceByKey.
Thanks
Best Regards
On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
What is your data like? Are you looking at exact matching or
I think the question is about copying the argument. If it's an
immutable value like String, yes just return the first argument and
ignore the second. If you're dealing with a notoriously mutable value
like a Hadoop Writable, you need to copy the value you return.
This works fine although you will
much Sean!
-Yao
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Thursday, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Cc: user@spark.apache.org
Subject: Re: Dedup
I think the question is about copying the argument. If it's an immutable value
like String, yes just
, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Cc: user@spark.apache.org
Subject: Re: Dedup
I think the question is about copying the argument. If it's an immutable
value like String, yes just return the first argument and ignore the
second. If you're dealing with a notoriously mutable value like
Multiple values may be different, yet still be considered duplicates
depending on how the dedup criteria is selected. Is that correct? Do you
care in that case what value you select for a given key?
On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote:
I need to do deduplication
Maybe you could implement something like this (i don't know if something
similar already exists in spark):
http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf
Best,
Flavio
On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Multiple values may be different, yet
What is your data like? Are you looking at exact matching or are you
interested in nearly same records? Do you need to merge similar records to
get a canonical value?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, Oct 9, 2014 at 2:31