Justin Chen created KAFKA-19354:
-----------------------------------

             Summary: KRaft observer unable to recover after re-bootstrapping 
to follower
                 Key: KAFKA-19354
                 URL: https://issues.apache.org/jira/browse/KAFKA-19354
             Project: Kafka
          Issue Type: Bug
          Components: kraft
    Affects Versions: 4.0.0
            Reporter: Justin Chen


[Original dev mail 
thread|https://lists.apache.org/thread/ws3390khsxhdg2b8cnv2mzv8slz5xq7q]

If an observer's FETCH request to the quorum leader experiences a 
failure/timeout, it is possible that when it re-bootstraps, it will connect to 
a follower node (random selection). Subsequently, the observer node will 
continually send FETCH requests to that follower, and in receive a response 
with a "partitionError" errorCode=6 (NOT_LEADER_OR_FOLLOWER), which does not 
trigger a re-bootstrap.
Thus, the observer will be stuck sending FETCH requests to the follower instead 
of the leader, halting metadata replication and causing it to fall out of sync.

To recover from this state, re-bootstrapping would need to occur by restarting 
the affected observer or follower, until it connects to the correct leader.

*Steps to reproduce:*
1. Spin up Kafka cluster with 3 or 5 controllers. (ideally 5 to increase 
likelihood of bootstrapping to a follower instead of the leader)
2. Enable a network delay on a particular observer broker (e.g. `tc qdisc add 
dev eth0 root netem delay 2500ms`). I picked 2500ms since default timeout is 2s 
for 
`controller.quorum.fetch.timeout.ms`/`controller.quorum.request.timeout.ms`. 
After a few seconds, disable the network delay (e.g. `tc qdisc del dev eth0 
root netem`).
3. The observer node will re-bootstrap, potentially to a follower instead of 
the leader. If so, the observer will continuously send fetch requests to the 
follower node, receive `NOT_LEADER_OR_FOLLOWER` in response, and will no longer 
replicate metadata.

*Debug logs demonstrating this scenario:*
- https://gist.github.com/justin-chen/1f3eee79d9a5066a467818a0b1bc006f
- kraftcontroller-3 (leader), kraftcontroller-4 (follower), kafka-0 (observer)




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to