Hi,

I've been doing some prototyping on Kafka for a few months now and like what I 
see. It's a good fit for some of my use cases in the areas of data distribution 
but also for processing - liking a lot of what I see in Samza. I'm now working 
through some of the operational issues and have a question to the community.

I have several data sources that I want to push into Kafka but some of the most 
important are arriving as a stream of files being dropped either into a SFTP 
location or S3. Conceptually the data is really a stream but its being chunked 
and made more batch by the deployment model of the operational servers. So 
pulling the data into Kafka and seeing it more as a stream again is a big plus.

But, I really don't want duplicate messages. I know Kafka provides at least 
once semantics and that's fine, I'm happy to have the de-dupe logic external to 
Kafka. And if I look at my producer I can build up a protocol around adding 
record metadata and using Zookeeper to give me pretty high confidence that my 
clients will know if they are reading from a file that was fully published into 
Kafka or not.

I had assumed that this wouldn't be a unique use case but on doing a bunch of 
searches I really don't find much in terms of either tools that help or even 
just best practice patterns for handling this type of need to support 
exactly-once message processing.

So now I'm thinking that either I just need better web search skills or that 
actually this isn't something many others are doing and if so then there's 
likely a reason for that.

Any thoughts?

Thanks
Garry

Reply via email to