sowen wrote > Arrays are not immutable and do not have the equals semantics you want to > use them as a key. Use a Scala immutable List. > On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" <
> yge@ > > wrote: > >> Yes. I was using String array as arguments in the reduceByKey. I think >> String array is actually immutable and simply returning the first >> argument >> without cloning one should work. I will look into mapPartitions as we can >> have up to 40% duplicates. Will follow up on this if necessary. Thanks >> very >> much Sean! >> >> -Yao >> >> -----Original Message----- >> From: Sean Owen [mailto: > sowen@ > ] >> Sent: Thursday, October 09, 2014 3:04 AM >> To: Ge, Yao (Y.) >> Cc: > user@.apache >> Subject: Re: Dedup >> >> I think the question is about copying the argument. If it's an immutable >> value like String, yes just return the first argument and ignore the >> second. If you're dealing with a notoriously mutable value like a Hadoop >> Writable, you need to copy the value you return. >> >> This works fine although you will spend a fair bit of time marshaling all >> of those duplicates together just to discard all but one. >> >> If there are lots of duplicates, It would take a bit more work, but would >> be faster, to do something like this: mapPartitions and retain one input >> value each unique dedup criteria, and then output those pairs, and then >> reduceByKey the result. >> >> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) < > yge@ > > wrote: >> > I need to do deduplication processing in Spark. The current plan is to >> > generate a tuple where key is the dedup criteria and value is the >> > original input. I am thinking to use reduceByKey to discard duplicate >> > values. If I do that, can I simply return the first argument or should >> > I return a copy of the first argument. Is there are better way to do >> dedup in Spark? >> > >> > >> > >> > -Yao >> Hi I'm a bit new at (scala/spark), we are doing data deduplication. so far I can handle exact match for 3M line of data. but I'm on a delema on fuzzy match using cosine and jaro winkler. My biggest problem is on what way to optimize my method getting a match with a 90% above. I am planning to group first before matching but this may result to missingout some important match. can someone help me,much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dedup-tp15967p25951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org