RE: Dedup

2016-01-12 Thread gpmacalalad
sowen wrote
> Arrays are not immutable and do not have the equals semantics you want to
> use them as a key.  Use a Scala immutable List.
> On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" 

> yge@

>  wrote:
> 
>> Yes. I was using String array as arguments in the reduceByKey. I think
>> String array is actually immutable and simply returning the first
>> argument
>> without cloning one should work. I will look into mapPartitions as we can
>> have up to 40% duplicates. Will follow up on this if necessary. Thanks
>> very
>> much Sean!
>>
>> -Yao
>>
>> -Original Message-
>> From: Sean Owen [mailto:

> sowen@

> ]
>> Sent: Thursday, October 09, 2014 3:04 AM
>> To: Ge, Yao (Y.)
>> Cc: 

> user@.apache

>> Subject: Re: Dedup
>>
>> I think the question is about copying the argument. If it's an immutable
>> value like String, yes just return the first argument and ignore the
>> second. If you're dealing with a notoriously mutable value like a Hadoop
>> Writable, you need to copy the value you return.
>>
>> This works fine although you will spend a fair bit of time marshaling all
>> of those duplicates together just to discard all but one.
>>
>> If there are lots of duplicates, It would take a bit more work, but would
>> be faster, to do something like this: mapPartitions and retain one input
>> value each unique dedup criteria, and then output those pairs, and then
>> reduceByKey the result.
>>
>> On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) 

> yge@

>  wrote:
>> > I need to do deduplication processing in Spark. The current plan is to
>> > generate a tuple where key is the dedup criteria and value is the
>> > original input. I am thinking to use reduceByKey to discard duplicate
>> > values. If I do that, can I simply return the first argument or should
>> > I return a copy of the first argument. Is there are better way to do
>> dedup in Spark?
>> >
>> >
>> >
>> > -Yao
>>

Hi I'm a bit new at (scala/spark), we are doing data deduplication. so far I
can handle exact match for 3M line of data. but I'm  on a delema on fuzzy
match using cosine and jaro winkler. My biggest problem is on what way to
optimize my method getting a match with a 90% above. I am planning to group
first before matching but this may result to missingout some important
match. can someone help me,much appreciated.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dedup-tp15967p25951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dedup

2014-10-09 Thread Akhil Das
If you are looking to eliminate duplicate rows (or similar) then you can
define a key from the data and on that key you can do reduceByKey.

Thanks
Best Regards

On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal sonalgoy...@gmail.com wrote:

 What is your data like? Are you looking at exact matching or are you
 interested in nearly same records? Do you need to merge similar records to
 get a canonical value?

 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal



 On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Maybe you could implement something like this (i don't know if something
 similar already exists in spark):

 http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf

 Best,
 Flavio
 On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Multiple values may be different, yet still be considered duplicates
 depending on how the dedup criteria is selected. Is that correct? Do you
 care in that case what value you select for a given key?

 On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote:

  I need to do deduplication processing in Spark. The current plan is
 to generate a tuple where key is the dedup criteria and value is the
 original input. I am thinking to use reduceByKey to discard duplicate
 values. If I do that, can I simply return the first argument or should I
 return a copy of the first argument. Is there are better way to do dedup in
 Spark?



 -Yao






Re: Dedup

2014-10-09 Thread Sean Owen
I think the question is about copying the argument. If it's an
immutable value like String, yes just return the first argument and
ignore the second. If you're dealing with a notoriously mutable value
like a Hadoop Writable, you need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling
all of those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but
would be faster, to do something like this: mapPartitions and retain
one input value each unique dedup criteria, and then output those
pairs, and then reduceByKey the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote:
 I need to do deduplication processing in Spark. The current plan is to
 generate a tuple where key is the dedup criteria and value is the original
 input. I am thinking to use reduceByKey to discard duplicate values. If I do
 that, can I simply return the first argument or should I return a copy of
 the first argument. Is there are better way to do dedup in Spark?



 -Yao

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
Yes. I was using String array as arguments in the reduceByKey. I think String 
array is actually immutable and simply returning the first argument without 
cloning one should work. I will look into mapPartitions as we can have up to 
40% duplicates. Will follow up on this if necessary. Thanks very much Sean!

-Yao  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, October 09, 2014 3:04 AM
To: Ge, Yao (Y.)
Cc: user@spark.apache.org
Subject: Re: Dedup

I think the question is about copying the argument. If it's an immutable value 
like String, yes just return the first argument and ignore the second. If 
you're dealing with a notoriously mutable value like a Hadoop Writable, you 
need to copy the value you return.

This works fine although you will spend a fair bit of time marshaling all of 
those duplicates together just to discard all but one.

If there are lots of duplicates, It would take a bit more work, but would be 
faster, to do something like this: mapPartitions and retain one input value 
each unique dedup criteria, and then output those pairs, and then reduceByKey 
the result.

On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote:
 I need to do deduplication processing in Spark. The current plan is to 
 generate a tuple where key is the dedup criteria and value is the 
 original input. I am thinking to use reduceByKey to discard duplicate 
 values. If I do that, can I simply return the first argument or should 
 I return a copy of the first argument. Is there are better way to do dedup in 
 Spark?



 -Yao


RE: Dedup

2014-10-09 Thread Sean Owen
Arrays are not immutable and do not have the equals semantics you want to
use them as a key.  Use a Scala immutable List.
On Oct 9, 2014 12:32 PM, Ge, Yao (Y.) y...@ford.com wrote:

 Yes. I was using String array as arguments in the reduceByKey. I think
 String array is actually immutable and simply returning the first argument
 without cloning one should work. I will look into mapPartitions as we can
 have up to 40% duplicates. Will follow up on this if necessary. Thanks very
 much Sean!

 -Yao

 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.com]
 Sent: Thursday, October 09, 2014 3:04 AM
 To: Ge, Yao (Y.)
 Cc: user@spark.apache.org
 Subject: Re: Dedup

 I think the question is about copying the argument. If it's an immutable
 value like String, yes just return the first argument and ignore the
 second. If you're dealing with a notoriously mutable value like a Hadoop
 Writable, you need to copy the value you return.

 This works fine although you will spend a fair bit of time marshaling all
 of those duplicates together just to discard all but one.

 If there are lots of duplicates, It would take a bit more work, but would
 be faster, to do something like this: mapPartitions and retain one input
 value each unique dedup criteria, and then output those pairs, and then
 reduceByKey the result.

 On Wed, Oct 8, 2014 at 8:37 PM, Ge, Yao (Y.) y...@ford.com wrote:
  I need to do deduplication processing in Spark. The current plan is to
  generate a tuple where key is the dedup criteria and value is the
  original input. I am thinking to use reduceByKey to discard duplicate
  values. If I do that, can I simply return the first argument or should
  I return a copy of the first argument. Is there are better way to do
 dedup in Spark?
 
 
 
  -Yao



Re: Dedup

2014-10-08 Thread Nicholas Chammas
Multiple values may be different, yet still be considered duplicates
depending on how the dedup criteria is selected. Is that correct? Do you
care in that case what value you select for a given key?

On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote:

  I need to do deduplication processing in Spark. The current plan is to
 generate a tuple where key is the dedup criteria and value is the original
 input. I am thinking to use reduceByKey to discard duplicate values. If I
 do that, can I simply return the first argument or should I return a copy
 of the first argument. Is there are better way to do dedup in Spark?



 -Yao



Re: Dedup

2014-10-08 Thread Flavio Pompermaier
Maybe you could implement something like this (i don't know if something
similar already exists in spark):

http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf

Best,
Flavio
On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Multiple values may be different, yet still be considered duplicates
 depending on how the dedup criteria is selected. Is that correct? Do you
 care in that case what value you select for a given key?

 On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote:

  I need to do deduplication processing in Spark. The current plan is to
 generate a tuple where key is the dedup criteria and value is the original
 input. I am thinking to use reduceByKey to discard duplicate values. If I
 do that, can I simply return the first argument or should I return a copy
 of the first argument. Is there are better way to do dedup in Spark?



 -Yao





Re: Dedup

2014-10-08 Thread Sonal Goyal
What is your data like? Are you looking at exact matching or are you
interested in nearly same records? Do you need to merge similar records to
get a canonical value?

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Thu, Oct 9, 2014 at 2:31 AM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Maybe you could implement something like this (i don't know if something
 similar already exists in spark):

 http://www.cs.berkeley.edu/~jnwang/papers/icde14_massjoin.pdf

 Best,
 Flavio
 On Oct 8, 2014 9:58 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Multiple values may be different, yet still be considered duplicates
 depending on how the dedup criteria is selected. Is that correct? Do you
 care in that case what value you select for a given key?

 On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote:

  I need to do deduplication processing in Spark. The current plan is to
 generate a tuple where key is the dedup criteria and value is the original
 input. I am thinking to use reduceByKey to discard duplicate values. If I
 do that, can I simply return the first argument or should I return a copy
 of the first argument. Is there are better way to do dedup in Spark?



 -Yao