You have to handle de-duplication upstream or downstream. It might technically be possible to handle this in Spark but you'll probably have a better time handling duplicates in the service that reads from Kafka.
On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart <mau...@cuberonlabs.com> wrote: > Hi, > we are trying to build a spark streaming solution that subscribe and push > to kafka. > > But we are running into the problem of duplicates events. > > Right now, I am doing a “forEachRdd” and loop over the message of each > partition and send those message to kafka. > > > > Is there any good way of solving that issue? > > > > thanks > -- Regards, Matt Data Engineer https://www.linkedin.com/in/mdeaver http://mattdeav.pythonanywhere.com/