Re: Spark streaming to kafka exactly once

Maurin Lenglart Thu, 23 Mar 2017 12:07:58 -0700

Ok,
Thanks for your answers

On 3/22/17, 1:34 PM, "Cody Koeninger" <c...@koeninger.org> wrote:


    If you're talking about reading the same message multiple times in a
    failure situation, see
    
    https://github.com/koeninger/kafka-exactly-once
    
    If you're talking about producing the same message multiple times in a
    failure situation, keep an eye on
    
    
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
    
    If you're talking about producers just misbehaving and sending
    different copies of what is essentially the same message from a domain
    perspective, you have to dedupe that with your own logic.
    
    On Wed, Mar 22, 2017 at 2:52 PM, Matt Deaver <mattrdea...@gmail.com> wrote:
    > You have to handle de-duplication upstream or downstream. It might
    > technically be possible to handle this in Spark but you'll probably have a
    > better time handling duplicates in the service that reads from Kafka.
    >
    > On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart <mau...@cuberonlabs.com>
    > wrote:
    >>
    >> Hi,
    >> we are trying to build a spark streaming solution that subscribe and push
    >> to kafka.
    >>
    >> But we are running into the problem of duplicates events.
    >>
    >> Right now, I am doing a “forEachRdd” and loop over the message of each
    >> partition and send those message to kafka.
    >>
    >>
    >>
    >> Is there any good way of solving that issue?
    >>
    >>
    >>
    >> thanks
    >
    >
    >
    >
    > --
    > Regards,
    >
    > Matt
    > Data Engineer
    > https://www.linkedin.com/in/mdeaver
    > http://mattdeav.pythonanywhere.com/

Re: Spark streaming to kafka exactly once

Reply via email to