Re: ideas on de duplication for spark streaming?

Jörn Franke Sat, 24 Sep 2016 09:50:58 -0700

As Cody said, Spark is not going to help you here. 
There are two issues you need to look at here: duplicated (or even more) 
messages processed by two different processes and the case of failure of any 
component (including the message broker). Keep in mind that duplicated messages 
can even occur weeks later (e.g. Something from experience: restart of message 
broker and message send weeks later again). 
As said, a Dht can help, but you will have a lot of (erroneous) effort to 
implement it.
You may want to look at (dedicated) redis nodes. Redis has support for 
partitioning, is very fast (but please create only one connection/ node and not 
per lookup) and provides you a lot of different data structures to solve your 
problem (e.g. Atomic counters).


> On 24 Sep 2016, at 08:49, kant kodali <kanth...@gmail.com> wrote:
> 
> 
> Hi Guys,
> 
> I have bunch of data coming in to my spark streaming cluster from a message 
> queue(not kafka). And this message queue guarantees at least once delivery 
> only so there is potential that some of the messages that come in to the 
> spark streaming cluster are actually duplicates and I am trying to figure out 
> a best way to filter them ? I was thinking if I should have a hashmap as a 
> broadcast variable but then I saw that broadcast variables are read only. 
> Also instead of having a global hashmap variable across every worker node I 
> am thinking Distributed hash table would be a better idea. any suggestions on 
> how best I could approach this problem by leveraging the existing 
> functionality?
> 
> Thanks,
> kant

Re: ideas on de duplication for spark streaming?

Reply via email to