RE: Dedup
sowen wrote > Arrays are not immutable and do not have the equals semantics you want to > use them as a key. Use a Scala immutable List. > On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" > yge@ > wrote: > >> Yes. I was using String array as arguments in the reduceByKey. I think >> String array is actually immutable and simply returning the first >> argument >> without cloning one should work. I will look into mapPartitions as we can >> have up to 40% duplicates. Will follow up on this if necessary. Thanks >> very >> much Sean! >> >> -Yao >> >> -Original Message- >> From: Sean Owen [mailto: > sowen@ > ] >> Sent: Thursday, October 09, 2014 3:04 AM >> To: Ge, Yao (Y.) >> Cc: > user@.apache >> Subject: Re: Dedup >> >> I think the question is about copying the argument. If it's an immutable >> value like String, yes just return the first argument and ignore the >> second. If you're dealing with a notoriously mutable value like a Hadoop >> Writable, you need to copy the value you return. >> >> This works fine although you will spend a fair bit of time marshaling all >> of those duplicates together just to discard all but one. >> >> If there are lots of duplicates, It would take a bit more work, but would >> be faster, to do something like this: mapPartitions and retain one input >> value each unique dedup criteria, and then output those pairs, and then >> reduceByKey the result. >> >> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) > yge@ > wrote: >> > I need to do deduplication processing in Spark. The current plan is to >> > generate a tuple where key is the dedup criteria and value is the >> > original input. I am thinking to use reduceByKey to discard duplicate >> > values. If I do that, can I simply return the first argument or should >> > I return a copy of the first argument. Is there are better way to do >> dedup in Spark? >> > >> > >> > >> > -Yao >> Hi I'm a bit new at (scala/spark), we are doing data deduplication. so far I can handle exact match for 3M line of data. but I'm on a delema on fuzzy match using cosine and jaro winkler. My biggest problem is on what way to optimize my method getting a match with a 90% above. I am planning to group first before matching but this may result to missingout some important match. can someone help me,much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dedup-tp15967p25951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dedup
If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey. Thanks Best Regards On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal sonalgoy...@gmail.com wrote: What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao
Re: Dedup
I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return. This works fine although you will spend a fair bit of time marshaling all of those duplicates together just to discard all but one. If there are lots of duplicates, It would take a bit more work, but would be faster, to do something like this: mapPartitions and retain one input value each unique dedup criteria, and then output those pairs, and then reduceByKey the result. On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Dedup
Yes. I was using String array as arguments in the reduceByKey. I think String array is actually immutable and simply returning the first argument without cloning one should work. I will look into mapPartitions as we can have up to 40% duplicates. Will follow up on this if necessary. Thanks very much Sean! -Yao -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return. This works fine although you will spend a fair bit of time marshaling all of those duplicates together just to discard all but one. If there are lots of duplicates, It would take a bit more work, but would be faster, to do something like this: mapPartitions and retain one input value each unique dedup criteria, and then output those pairs, and then reduceByKey the result. On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao
RE: Dedup
Arrays are not immutable and do not have the equals semantics you want to use them as a key. Use a Scala immutable List. On Oct 9, 2014 12:32 PM, Ge, Yao (Y.) y...@ford.com wrote: Yes. I was using String array as arguments in the reduceByKey. I think String array is actually immutable and simply returning the first argument without cloning one should work. I will look into mapPartitions as we can have up to 40% duplicates. Will follow up on this if necessary. Thanks very much Sean! -Yao -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return. This works fine although you will spend a fair bit of time marshaling all of those duplicates together just to discard all but one. If there are lots of duplicates, It would take a bit more work, but would be faster, to do something like this: mapPartitions and retain one input value each unique dedup criteria, and then output those pairs, and then reduceByKey the result. On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao
Re: Dedup
Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao
Re: Dedup
Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao
Re: Dedup
What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Maybe you could implement something like this (i don't know if something similar already exists in spark): http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf Best, Flavio On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy of the first argument. Is there are better way to do dedup in Spark? -Yao