Hello,

 

I've been working on testing Kafka availability in Zookeeper mode during
single broker shutdowns within a Kubernetes setup, and I've come across
something interesting that I wanted to run by you.

 

We've noticed that when a partition leader goes down, messages are not
delivered until a new leader is elected. While we expect this to happen,
there's a part of it that's still not adding up. The downtime, or the time
it takes for the new leader to step up, is about a minute. But what's
interesting is that when we increase the producer side retries to just 1,
all of our messages get delivered successfully.

 

This seems a bit odd to me because, theoretically, increasing the retries
should only resend the message, giving it an extra 10 seconds before it
times out, while the first few messages should still have around 40 seconds
to wait for the new leader. So, this behavior is a bit of a head-scratcher.


I was wondering if you might have any insights or could point me in the
right direction to understand why this is happening. Any help or guidance
would be greatly appreciated.


Below is a log snippet from one of the test runs:

Partition leader shutdown and observation of new partition leader being
automatically elected in setup with 1 partition and replication factor of 3.
Thu Oct 26 21:59:51 CEST 2023 - Partition leader has been shutdown
Thu Oct 26 22:01:06 CEST 2023 - Change in partition leader detected

Error messages from the producer client during the window when partition
leader is unelected.
Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local:
Message timed out"}. Message content: Message #39 from 2023-10-26 19:59:52

Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local:
Message timed out"}. Message content: Message #40 from 2023-10-26 19:59:53
.
Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local:
Message timed out"}. Message content: Message #97 from 2023-10-26 20:00:50

Failed to send message: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local:
Message timed out"}. Message content: Message #98 from 2023-10-26 20:00:51


The container clocks are a little out of sync, but both unavailability
windows match to around one minute.

 

Thanks a lot for your time, and looking forward to hearing from you.

 

Reply via email to