Re: ZK and Kafka failover testing

Hans Jespersen Tue, 18 Apr 2017 17:58:27 -0700

When you publish, is acks=0,1 or all (-1)?
What is max.in.flight.requests.per.connection (default is 5)?


It sounds to me like your publishers are using acks=0 and so they are not
actually succeeding in publishing (i.e. you are getting no acks) but they
will retry over and over and will have up to 5 retries in flight, so when
the broker comes back up, you are getting 4 or 5 copies of the same
message.

Try setting max.in.flight.requests.per.connection=1 to get rid of duplicates
Try setting acks=all to ensure the messages are being persisted by the
leader and all the available replicas in the kafka cluster.

-hans

/**
 * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
 * h...@confluent.io (650)924-2670
 */

On Tue, Apr 18, 2017 at 4:10 PM, Shrikant Patel <spa...@pdxinc.com> wrote:

> Hi All,
>
> I am seeing strange behavior between ZK and Kafka. We have 5 node in ZK
> and Kafka cluster each. Kafka version - 2.11-0.10.1.1
>
> The min.insync.replicas is 3, replication.factor is 5 for all topics,
> unclean.leader.election.enable is false. We have 15 partitions for each
> topic.
>
> The step we are following in our testing.
>
>
> *         My understanding is that ZK needs aleast 3 out of 5 server to be
> functional. Kafka could not be functional without zookeeper. In out
> testing, we bring down 3 ZK nodes and don't touch Kafka nodes. Kafka is
> still functional, consumer\producer can still consume\publish from Kafka
> cluster. We then bring down all ZK nodes, Kafka consumer\producers are
> still functional. I am not able to understand why Kafka cluster is not
> failing as soon as majority of ZK nodes are down. I do see error in Kafka
> that it cannot connection to ZK cluster.
>
>
>
> *         With all or majority of ZK node down, we bring down 1 Kafka
> nodes (out of 5, so 4 are running). And at that point the consumer and
> producer start failing. My guess is the new leadership election cannot
> happen without ZK.
>
>
>
> *         Then we bring up the majority of ZK node up. (1st Kafka is still
> down) Now the Kafka cluster become functional, consumer and producer now
> start working again. But Consumer sees big junk of message from kafka, and
> many of them are duplicates. It's like these messages were held up
> somewhere, Where\Why I don't know?  And why the duplicates? I can
> understand few duplicates for messages that consumer would not commit
> before 1st node when down. But why so many duplicates and like 4 copy for
> each message. I cannot understand this behavior.
>
> Appreciate some insight about our issues. Also if there are blogs that
> describe the ZK and Kafka failover scenario behaviors, that would be
> extremely helpful.
>
> Thanks,
> Shri
>
> This e-mail and its contents (to include attachments) are the property of
> National Health Systems, Inc., its subsidiaries and affiliates, including
> but not limited to Rx.com Community Healthcare Network, Inc. and its
> subsidiaries, and may contain confidential and proprietary or privileged
> information. If you are not the intended recipient of this e-mail, you are
> hereby notified that any unauthorized disclosure, copying, or distribution
> of this e-mail or of its attachments, or the taking of any unauthorized
> action based on information contained herein is strictly prohibited.
> Unauthorized use of information contained herein may subject you to civil
> and criminal prosecution and penalties. If you are not the intended
> recipient, please immediately notify the sender by telephone at
> 800-433-5719 or return e-mail and permanently delete the original e-mail.
>

Re: ZK and Kafka failover testing

Reply via email to