When you publish, is acks=0,1 or all (-1)? What is max.in.flight.requests.per.connection (default is 5)?
It sounds to me like your publishers are using acks=0 and so they are not actually succeeding in publishing (i.e. you are getting no acks) but they will retry over and over and will have up to 5 retries in flight, so when the broker comes back up, you are getting 4 or 5 copies of the same message. Try setting max.in.flight.requests.per.connection=1 to get rid of duplicates Try setting acks=all to ensure the messages are being persisted by the leader and all the available replicas in the kafka cluster. -hans /** * Hans Jespersen, Principal Systems Engineer, Confluent Inc. * h...@confluent.io (650)924-2670 */ On Tue, Apr 18, 2017 at 4:10 PM, Shrikant Patel <spa...@pdxinc.com> wrote: > Hi All, > > I am seeing strange behavior between ZK and Kafka. We have 5 node in ZK > and Kafka cluster each. Kafka version - 2.11-0.10.1.1 > > The min.insync.replicas is 3, replication.factor is 5 for all topics, > unclean.leader.election.enable is false. We have 15 partitions for each > topic. > > The step we are following in our testing. > > > * My understanding is that ZK needs aleast 3 out of 5 server to be > functional. Kafka could not be functional without zookeeper. In out > testing, we bring down 3 ZK nodes and don't touch Kafka nodes. Kafka is > still functional, consumer\producer can still consume\publish from Kafka > cluster. We then bring down all ZK nodes, Kafka consumer\producers are > still functional. I am not able to understand why Kafka cluster is not > failing as soon as majority of ZK nodes are down. I do see error in Kafka > that it cannot connection to ZK cluster. > > > > * With all or majority of ZK node down, we bring down 1 Kafka > nodes (out of 5, so 4 are running). And at that point the consumer and > producer start failing. My guess is the new leadership election cannot > happen without ZK. > > > > * Then we bring up the majority of ZK node up. (1st Kafka is still > down) Now the Kafka cluster become functional, consumer and producer now > start working again. But Consumer sees big junk of message from kafka, and > many of them are duplicates. It's like these messages were held up > somewhere, Where\Why I don't know? And why the duplicates? I can > understand few duplicates for messages that consumer would not commit > before 1st node when down. But why so many duplicates and like 4 copy for > each message. I cannot understand this behavior. > > Appreciate some insight about our issues. Also if there are blogs that > describe the ZK and Kafka failover scenario behaviors, that would be > extremely helpful. > > Thanks, > Shri > > This e-mail and its contents (to include attachments) are the property of > National Health Systems, Inc., its subsidiaries and affiliates, including > but not limited to Rx.com Community Healthcare Network, Inc. and its > subsidiaries, and may contain confidential and proprietary or privileged > information. If you are not the intended recipient of this e-mail, you are > hereby notified that any unauthorized disclosure, copying, or distribution > of this e-mail or of its attachments, or the taking of any unauthorized > action based on information contained herein is strictly prohibited. > Unauthorized use of information contained herein may subject you to civil > and criminal prosecution and penalties. If you are not the intended > recipient, please immediately notify the sender by telephone at > 800-433-5719 or return e-mail and permanently delete the original e-mail. >