RE: Dedup

gpmacalalad Tue, 12 Jan 2016 18:43:58 -0800

sowen wrote
> Arrays are not immutable and do not have the equals semantics you want to
> use them as a key.  Use a Scala immutable List.
> On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" &lt;


> yge@

> &gt; wrote:
> 
>> Yes. I was using String array as arguments in the reduceByKey. I think
>> String array is actually immutable and simply returning the first
>> argument
>> without cloning one should work. I will look into mapPartitions as we can
>> have up to 40% duplicates. Will follow up on this if necessary. Thanks
>> very
>> much Sean!
>>
>> -Yao
>>
>> -----Original Message-----
>> From: Sean Owen [mailto:

> sowen@

> ]
>> Sent: Thursday, October 09, 2014 3:04 AM
>> To: Ge, Yao (Y.)
>> Cc: 

> user@.apache

>> Subject: Re: Dedup
>>
>> I think the question is about copying the argument. If it's an immutable
>> value like String, yes just return the first argument and ignore the
>> second. If you're dealing with a notoriously mutable value like a Hadoop
>> Writable, you need to copy the value you return.
>>
>> This works fine although you will spend a fair bit of time marshaling all
>> of those duplicates together just to discard all but one.
>>
>> If there are lots of duplicates, It would take a bit more work, but would
>> be faster, to do something like this: mapPartitions and retain one input
>> value each unique dedup criteria, and then output those pairs, and then
>> reduceByKey the result.
>>
>> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) &lt;

> yge@

> &gt; wrote:
>> > I need to do deduplication processing in Spark. The current plan is to
>> > generate a tuple where key is the dedup criteria and value is the
>> > original input. I am thinking to use reduceByKey to discard duplicate
>> > values. If I do that, can I simply return the first argument or should
>> > I return a copy of the first argument. Is there are better way to do
>> dedup in Spark?
>> >
>> >
>> >
>> > -Yao
>>

Hi I'm a bit new at (scala/spark), we are doing data deduplication. so far I
can handle exact match for 3M line of data. but I'm  on a delema on fuzzy
match using cosine and jaro winkler. My biggest problem is on what way to
optimize my method getting a match with a 90% above. I am planning to group
first before matching but this may result to missingout some important
match. can someone help me,much appreciated.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dedup-tp15967p25951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Dedup

Reply via email to