You have to handle de-duplication upstream or downstream. It might
technically be possible to handle this in Spark but you'll probably have a
better time handling duplicates in the service that reads from Kafka.

On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart <mau...@cuberonlabs.com>
wrote:

> Hi,
> we are trying to build a spark streaming solution that subscribe and push
> to kafka.
>
> But we are running into the problem of duplicates events.
>
> Right now, I am doing a “forEachRdd” and loop over the message of each
> partition and send those message to kafka.
>
>
>
> Is there any good way of solving that issue?
>
>
>
> thanks
>



-- 
Regards,

Matt
Data Engineer
https://www.linkedin.com/in/mdeaver
http://mattdeav.pythonanywhere.com/

Reply via email to