The out-of-the-box support for this in Kafka isn't great right now.

Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.

There are two approaches to getting exactly once semantics during data
production.

1. Use a single-writer per partition and every time you get a network error
check the last message in that partition to see if your last write succeeded
2. Include a primary key (UUID or something) in the message and deduplicate
on the consumer.

If you do one of these things the log that Kafka hosts will be duplicate
free. However reading without duplicates depends on some co-operation from
the consumer too. If the consumer is periodically checkpointing its
position then if it fails and restarts it will restart from the
checkpointed position. Thus if the data output and the checkpoint are not
written atomically it will be possible to get duplicates here as well. This
problem is particular to your storage system. For example if you are using
a database you could commit these together in a transaction. The HDFS
loader Camus that LinkedIn wrote does something like this for Hadoop loads.
The other alternative that doesn't require a transaction is to store the
offset with the data loaded and deduplicate using the
topic/partition/offset combination.

I think there are two improvements that would make this a lot easier:
1. I think producer idempotence is something that could be done
automatically and much more cheaply by optionally integrating support for
this on the server.
2. The existing high-level consumer doesn't expose a lot of the more fine
grained control of offsets (e.g. to reset your position). We will be
working on that soon.

-Jay







On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington <
g.turking...@improvedigital.com> wrote:

> Hi,
>
> I've been doing some prototyping on Kafka for a few months now and like
> what I see. It's a good fit for some of my use cases in the areas of data
> distribution but also for processing - liking a lot of what I see in Samza.
> I'm now working through some of the operational issues and have a question
> to the community.
>
> I have several data sources that I want to push into Kafka but some of the
> most important are arriving as a stream of files being dropped either into
> a SFTP location or S3. Conceptually the data is really a stream but its
> being chunked and made more batch by the deployment model of the
> operational servers. So pulling the data into Kafka and seeing it more as a
> stream again is a big plus.
>
> But, I really don't want duplicate messages. I know Kafka provides at
> least once semantics and that's fine, I'm happy to have the de-dupe logic
> external to Kafka. And if I look at my producer I can build up a protocol
> around adding record metadata and using Zookeeper to give me pretty high
> confidence that my clients will know if they are reading from a file that
> was fully published into Kafka or not.
>
> I had assumed that this wouldn't be a unique use case but on doing a bunch
> of searches I really don't find much in terms of either tools that help or
> even just best practice patterns for handling this type of need to support
> exactly-once message processing.
>
> So now I'm thinking that either I just need better web search skills or
> that actually this isn't something many others are doing and if so then
> there's likely a reason for that.
>
> Any thoughts?
>
> Thanks
> Garry
>
>

Reply via email to