ideas on de duplication for spark streaming?

kant kodali Fri, 23 Sep 2016 23:50:36 -0700

Hi Guys,
I have bunch of data coming in to my spark streaming cluster from a message
queue(not kafka). And this message queue guarantees at least once delivery only
so there is potential that some of the messages that come in to the spark
streaming cluster are actually duplicates and I am trying to figure out a best
way to filter them ? I was thinking if I should have a hashmap as a broadcast
variable but then I saw that broadcast variables are read only. Also instead of
having a global hashmap variable across every worker node I am thinking
Distributed hash table would be a better idea. any suggestions on how best I
could approach this problem by leveraging the existing functionality?
Thanks,kant

ideas on de duplication for spark streaming?

Reply via email to