Jonah Hooper created KAFKA-19541:
------------------------------------

             Summary: KRaft should handle snapshot fetches under slow networks
                 Key: KAFKA-19541
                 URL: https://issues.apache.org/jira/browse/KAFKA-19541
             Project: Kafka
          Issue Type: Improvement
          Components: kraft
            Reporter: Jonah Hooper


If a "new" controller does not have any Metadata logs stored and joins a Quorum 
it will attempt a FETCH_SNAPSHOT to active controller to receive an up to date 
log. It will perform this from FollowerState. 
By default; KRaft allows for 2s to complete all requests before it considers 
the active-controller (leader) unavailable. If a request (including 
FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in 
FollowerState, will transition to CandidateState. If a controller has not 
fetched logs from active controller it can never become leader since it has no 
data. As such it will eventually transition back to follower state. 
If the snapshot on the active controller is larger (in size on disk) than it 
would take to download given network conditions between active controller and 
new controller, then its possible that the "new" controller will get stuck in a 
loop. 
In this state it will transition from:
{code:java}
Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... -> 
Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code}
Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active 
controller and "new" controller is 2Mbs then, it would take 10s complete 
FETCH_SNAPSHOT of  `xxxx.checkpoint`. 
In this case, unless network conditions improve then "new controller" will be 
stuck in a loop forever. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to