Jonah Hooper created KAFKA-19541: ------------------------------------ Summary: KRaft should handle snapshot fetches under slow networks Key: KAFKA-19541 URL: https://issues.apache.org/jira/browse/KAFKA-19541 Project: Kafka Issue Type: Improvement Components: kraft Reporter: Jonah Hooper
If a "new" controller does not have any Metadata logs stored and joins a Quorum it will attempt a FETCH_SNAPSHOT to active controller to receive an up to date log. It will perform this from FollowerState. By default; KRaft allows for 2s to complete all requests before it considers the active-controller (leader) unavailable. If a request (including FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in FollowerState, will transition to CandidateState. If a controller has not fetched logs from active controller it can never become leader since it has no data. As such it will eventually transition back to follower state. If the snapshot on the active controller is larger (in size on disk) than it would take to download given network conditions between active controller and new controller, then its possible that the "new" controller will get stuck in a loop. In this state it will transition from: {code:java} Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... -> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code} Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active controller and "new" controller is 2Mbs then, it would take 10s complete FETCH_SNAPSHOT of `xxxx.checkpoint`. In this case, unless network conditions improve then "new controller" will be stuck in a loop forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)