hey folks,

I have been testing a kafka cluster of 10 nodes on AWS using version
2.8.0-0.8.0
and see some behavior on failover that I want to make sure I understand.

Initially, I have a topic X with 30 partitions and a replication factor of
3. Looking at the partition 0:
partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4] in-sync:
[5, 3, 4]

While killing broker 5, the controller immediately grab the next replica in
the ISR and assign it as a leader:
partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4] in-sync:
[3, 4]

There are couple of things at this point I would like to clarify:

(1) Why is broker 5 still in the brokers array for partition 0? Note this
broker array comes from a get of the zookeeper path /brokers/topics/[topic]
as documented.
(2) Partition 0 is now under replicated and the controller does not seem to
do anything about. Is this expected?

Regarding (1), I am assuming that it is expected that brokers going down
will be brought back up soon. At which point, they will pick up from the
current leader and get back into the ISR. Am I right?

Regarding (2), I finally kicked off a reassign_partitions admin task adding
broker 7 to the replicas list for partition 0 which finally fixed the under
replicated issue:

partition: 0 - leader: 3  expected_leader: 3  brokers: [3, 4, 7]  in-sync:
[3, 4, 7]

Is this therefore expected that the user will fix up the under replication
situation? Or maybe it is expected again that broker 5 will come back soon
and this whole thing is a non-issue once that's true given that
decommissioning brokers is not something supported as of the kafka version
I am using.

Another thing I'd like to clarify is that for another topic Y, broker 5 was
never removed from the ISR array. Note that Y is an unused topic so I am
guessing that technically broker 5 is not out of sync... though it is still
dead. Is this the expected behavior?

I'd really appreciate somebody to confirm my understanding,

Thanks,

Reply via email to