RE: Dedup

2016-01-12 Thread gpmacalalad
h Sean! >> >> -Yao >> >> -Original Message- >> From: Sean Owen [mailto: > sowen@ > ] >> Sent: Thursday, October 09, 2014 3:04 AM >> To: Ge, Yao (Y.) >> Cc: > user@.apache >> Subject: Re: Dedup >> >> I think t

Re: spark as a lookup engine for dedup

2015-07-27 Thread Shushant Arora
its for 1 day events in range of 1 billions and processing is in streaming application of ~10-15 sec interval so lookup should be fast. RDD need to be updated with new events and old events of current time-24 hours back should be removed at each processing. So is spark RDD not fit for this

Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
RDD is immutable, it cannot be changed, you can only create a new one from data or from transformation. It sounds inefficient to create one each 15 sec for the last 24 hours. I think a key-value store will be much more fitted for this purpose. On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora

Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
What the throughput of processing and for how long do you need to remember duplicates? You can take all the events, put them in an RDD, group by the key, and then process each key only once. But if you have a long running application where you want to check that you didn't see the same value

spark as a lookup engine for dedup

2015-07-26 Thread Shushant Arora
Hi I have a requirement for processing large events but ignoring duplicate at the same time. Events are consumed from kafka and each event has a eventid. It may happen that an event is already processed and came again at some other offset. 1.Can I use Spark RDD to persist processed events and

Re: Dedup

2014-10-09 Thread Akhil Das
, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current

Re: Dedup

2014-10-09 Thread Sean Owen
will spend a fair bit of time marshaling all of those duplicates together just to discard all but one. If there are lots of duplicates, It would take a bit more work, but would be faster, to do something like this: mapPartitions and retain one input value each unique dedup criteria, and then output

RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
much Sean! -Yao -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just

RE: Dedup

2014-10-09 Thread Sean Owen
, October 09, 2014 3:04 AM To: Ge, Yao (Y.) Cc: user@spark.apache.org Subject: Re: Dedup I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like

Dedup

2014-10-08 Thread Ge, Yao (Y.)
I need to do deduplication processing in Spark. The current plan is to generate a tuple where key is the dedup criteria and value is the original input. I am thinking to use reduceByKey to discard duplicate values. If I do that, can I simply return the first argument or should I return a copy

Re: Dedup

2014-10-08 Thread Nicholas Chammas
Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication

Re: Dedup

2014-10-08 Thread Flavio Pompermaier
still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need to do deduplication processing in Spark. The current plan

Re: Dedup

2014-10-08 Thread Sonal Goyal
...@gmail.com wrote: Multiple values may be different, yet still be considered duplicates depending on how the dedup criteria is selected. Is that correct? Do you care in that case what value you select for a given key? On Wed, Oct 8, 2014 at 3:37 PM, Ge, Yao (Y.) y...@ford.com wrote: I need