Hello,
I've been working on testing Kafka availability in Zookeeper mode during single broker shutdowns within a Kubernetes setup, and I've come across something interesting that I wanted to run by you. We've noticed that when a partition leader goes down, messages are not delivered until a new leader is elected. While we expect this to happen, there's a part of it that's still not adding up. The downtime, or the time it takes for the new leader to step up, is about a minute. But what's interesting is that when we increase the producer side retries to just 1, all of our messages get delivered successfully. This seems a bit odd to me because, theoretically, increasing the retries should only resend the message, giving it an extra 10 seconds before it times out, while the first few messages should still have around 40 seconds to wait for the new leader. So, this behavior is a bit of a head-scratcher. I was wondering if you might have any insights or could point me in the right direction to understand why this is happening. Any help or guidance would be greatly appreciated. Below is a log snippet from one of the test runs: Partition leader shutdown and observation of new partition leader being automatically elected in setup with 1 partition and replication factor of 3. Thu Oct 26 21:59:51 CEST 2023 - Partition leader has been shutdown Thu Oct 26 22:01:06 CEST 2023 - Change in partition leader detected Error messages from the producer client during the window when partition leader is unelected. Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. Message content: Message #39 from 2023-10-26 19:59:52 Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. Message content: Message #40 from 2023-10-26 19:59:53 . Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. Message content: Message #97 from 2023-10-26 20:00:50 Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}. Message content: Message #98 from 2023-10-26 20:00:51 The container clocks are a little out of sync, but both unavailability windows match to around one minute. Thanks a lot for your time, and looking forward to hearing from you.