Yes. I was using String array as arguments in the reduceByKey. I think String 
array is actually immutable and simply returning the first argument without 
cloning one should work. I will look into mapPartitions as we can have up to 
40% duplicates. Will follow up on this if necessary. Thanks very much Sean!

-Yao  

-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Cc: user@spark.apache.org
Subject: Re: Dedup

I think the question is about copying the argument. If it's an immutable value 
like String, yes just return the first argument and ignore the second. If 
you're dealing with a notoriously mutable value like a Hadoop Writable, you 
need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling all of 
those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but would be 
faster, to do something like this: mapPartitions and retain one input value 
each unique dedup criteria, and then output those pairs, and then reduceByKey 
the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) <y...@ford.com> wrote:
> I need to do deduplication processing in Spark. The current plan is to 
> generate a tuple where key is the dedup criteria and value is the 
> original input. I am thinking to use reduceByKey to discard duplicate 
> values. If I do that, can I simply return the first argument or should 
> I return a copy of the first argument. Is there are better way to do dedup in 
> Spark?
>
>
>
> -Yao

Reply via email to