Thanks, Matthias. Will read the doc you referenced.

The duplicates are on the consumer side. We've been trying to curtail this
by increasing the consumer session timeout. Would that potentially help?

Basically, we're grappling with the causes of the behavior. Why would
messages be ever delivered multiple times?

If we have to roll with a lookup table of "already seen" messages, it
significantly complicates the architecture of our application. In the
distributed case, we'll have to add something like Redis or memcached and
the logic for doing the "distributed hashset" of seen messages. We'll also
need a policy for purging of this hashset periodically.

I would think that "exactly once" would have to be exactly that. Consumer
gets a given message just once.

Basically, we're developing what is a mission critical application, and
having data loss due to "at most once" or data duplication due to "at least
once" is pretty much unacceptable. The data loss we can't "sell" to
application stakeholders.  The data duplication breaks our internal
bookkeeping of the data processing flows.

In other words, we would ideally like to see message queueing capabilities
in Kafka with very high exactly-once delivery guarantees...

Thanks,
- Dmitry


On Thu, Apr 13, 2017 at 7:00 PM, Matthias J. Sax <matth...@confluent.io>
wrote:

> Hi,
>
> the first question to ask would be, if you get duplicate writes at the
> producer or duplicate reads at the consumer...
>
> For exactly-once: it's work in progress and we aim for 0.11 release
> (what might still be a beta version).
>
> In short, there will be an idempotent producer that will avoid duplicate
> writes. Furthermore, the will be "transactions" that allow for
> exactly-once "read-process-write" scenarios -- Kafka Streams will
> leverage this feature.
>
> For reads, exactly-once will allow to only consumer committed messages.
> But it does not help with duplicate reads.
>
> For duplicate reads, you cannot assume that "Kafka just does the right
> thing" -- however, you can influence the potential number of duplicates
> heavily. For example, you can reduce commit interval or even commit
> manually (in the extreme case after each message). But even if you
> commit after each message, your application needs to "track" the
> progress of the currently processed message -- if you are in the middle
> of processing and fail, Kafka cannot know what progress your application
> made for the current message -- thus, it's up to you to decide on
> restart, if you want to receive the message again or not... Kafka cannot
> know this.
>
> If you want to get the full details about exactly-once, you can have a
> look into the KIP:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 98+-+Exactly+Once+Delivery+and+Transactional+Messaging
>
> Hope this helps.
>
>
> -Matthias
>
>
> On 4/13/17 9:35 AM, Dmitry Goldenberg wrote:
> > Thanks, Jayesh and Vincent.
> >
> > It seems rather extreme that one has to implement a cache of already seen
> > messages using Redis, memcached or some such.  I would expect Kafka to
> "do
> > the right thing".  The data loss is a worse problem, especially for
> mission
> > critical applications.  So what is the current "stance" on the
> exactly-once
> > delivery semantic?
> >
> > - Dmitry
> >
> > On Thu, Apr 13, 2017 at 12:07 PM, Thakrar, Jayesh <
> > jthak...@conversantmedia.com> wrote:
> >
> >> Hi Dmitri,
> >>
> >> This presentation might help you understand and take appropriate actions
> >> to deal with data duplication (and data loss)
> >>
> >> https://www.slideshare.net/JayeshThakrar/kafka-68540012
> >>
> >> Regards,
> >> Jayesh
> >>
> >> On 4/13/17, 10:05 AM, "Vincent Dautremont"
> <vincent.dautrem...@olamobile.com.INVALID>
> >> wrote:
> >>
> >>     One of the case where you would get a message more than once is if
> you
> >> get
> >>     disconnected / kicked off the consumer group / etc if you fail to
> >> commit
> >>     offset for messages you have already read.
> >>
> >>     What I do is that I insert the message in a in-memory cache redis
> >> database.
> >>     If it fails to insert because of primary key duplication, well that
> >> means
> >>     I've already received that message in the past.
> >>
> >>     You could even do an insert of the topic+partition+offset of the
> >> message
> >>     payload as the insert (instead of the full message) if you know for
> >> sure
> >>     that your message payload would not be duplicated in the the kafka
> >> topic.
> >>
> >>     Vincent.
> >>
> >>     On Thu, Apr 13, 2017 at 4:52 PM, Dmitry Goldenberg <
> >> dgoldenb...@hexastax.com
> >>     > wrote:
> >>
> >>     > Hi all,
> >>     >
> >>     > I was wondering if someone could list some of the causes which may
> >> lead to
> >>     > Kafka delivering the same messages more than once.
> >>     >
> >>     > We've looked around and we see no errors to notice, yet
> >> intermittently, we
> >>     > see messages being delivered more than once.
> >>     >
> >>     > Kafka documentation talks about the below delivery modes:
> >>     >
> >>     >    - *At most once*—Messages may be lost but are never
> redelivered.
> >>     >    - *At least once*—Messages are never lost but may be
> redelivered.
> >>     >    - *Exactly once*—this is what people actually want, each
> message
> >> is
> >>     >    delivered once and only once.
> >>     >
> >>     > So the default is 'at least once' and that is what we're running
> >> with (we
> >>     > don't want to do "at most once" as that appears to yield some
> >> potential for
> >>     > message loss).
> >>     >
> >>     > We had not seen duplicated deliveries for a while previously but
> just
> >>     > started seeing them quite frequently in our test cluster.
> >>     >
> >>     > What are some of the possible causes for this?  What are some of
> the
> >>     > available tools for troubleshooting this issue? What are some of
> the
> >>     > possible fixes folks have developed or instrumented for this
> issue?
> >>     >
> >>     > Also, is there an effort underway on Kafka side to provide support
> >> for the
> >>     > "exactly once" semantic?  That is exactly the semantic we want and
> >> we're
> >>     > wondering how that may be achieved.
> >>     >
> >>     > Thanks,
> >>     > - Dmitry
> >>     >
> >>
> >>     --
> >>     The information transmitted is intended only for the person or
> entity
> >> to
> >>     which it is addressed and may contain confidential and/or privileged
> >>     material. Any review, retransmission, dissemination or other use
> of, or
> >>     taking of any action in reliance upon, this information by persons
> or
> >>     entities other than the intended recipient is prohibited. If you
> >> received
> >>     this in error, please contact the sender and delete the material
> from
> >> any
> >>     computer.
> >>
> >>
> >>
> >
>
>

Reply via email to