As Cody said, Spark is not going to help you here. There are two issues you need to look at here: duplicated (or even more) messages processed by two different processes and the case of failure of any component (including the message broker). Keep in mind that duplicated messages can even occur weeks later (e.g. Something from experience: restart of message broker and message send weeks later again). As said, a Dht can help, but you will have a lot of (erroneous) effort to implement it. You may want to look at (dedicated) redis nodes. Redis has support for partitioning, is very fast (but please create only one connection/ node and not per lookup) and provides you a lot of different data structures to solve your problem (e.g. Atomic counters).
> On 24 Sep 2016, at 08:49, kant kodali <kanth...@gmail.com> wrote: > > > Hi Guys, > > I have bunch of data coming in to my spark streaming cluster from a message > queue(not kafka). And this message queue guarantees at least once delivery > only so there is potential that some of the messages that come in to the > spark streaming cluster are actually duplicates and I am trying to figure out > a best way to filter them ? I was thinking if I should have a hashmap as a > broadcast variable but then I saw that broadcast variables are read only. > Also instead of having a global hashmap variable across every worker node I > am thinking Distributed hash table would be a better idea. any suggestions on > how best I could approach this problem by leveraging the existing > functionality? > > Thanks, > kant