Actually, most of the duplicates I was seeing was due to a bug in an old Hive version I'm using 0.9. But I am still seeing some, although fewer duplicates. Instead of 3-13% I'm now only seeing less than 1%. This appears to be the case for each of the batch messages for my consumer which is set to be 1,000,000 messages right now. Does that seem more reasonable?
-----Original Message----- From: Joel Koshy [mailto:jjkosh...@gmail.com] Sent: Thursday, January 09, 2014 7:07 AM To: users@kafka.apache.org Subject: Re: Duplicate records in Kafka 0.7 You mean duplicate records on the consumer side? Duplicates are possible if there are consumer failures and a another consumer instance resumes from an earlier offset. It is also possible if there are producer retries due to exceptions while producing. Do you see any of these errors in your logs? Besides these scenarios though, you shouldn't be seeing duplicates. Thanks, Joel On Wed, Jan 8, 2014 at 5:21 PM, Xuyen On <x...@ancestry.com> wrote: > Hi, > > I would like to check to see if other people are seeing duplicate records > with Kafka 0.7. I read the Jira's and I believe that duplicates are still > possible when using message compression on Kafka 0.7. I'm seeing duplicate > records from the range of 6-13%. Is this normal? > > If you're using Kafka 0.7 with message compression enabled, can you please > let me know any duplicate records and if so, what %? > > Also, please let me know what sort of deduplication strategy you're using. > > Thanks! > >