[
https://issues.apache.org/jira/browse/KAFKA-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christo Lolov updated KAFKA-19541:
----------------------------------
Fix Version/s: 4.3.0
> KRaft should handle snapshot fetches under slow networks
> --------------------------------------------------------
>
> Key: KAFKA-19541
> URL: https://issues.apache.org/jira/browse/KAFKA-19541
> Project: Kafka
> Issue Type: Improvement
> Components: kraft
> Reporter: Jonah Hooper
> Assignee: Jonah Hooper
> Priority: Major
> Fix For: 4.3.0
>
>
> If a "new" controller does not have any Metadata logs stored and joins a
> Quorum it will attempt a FETCH_SNAPSHOT to active controller to receive an up
> to date log. It will perform this from FollowerState.
> By default; KRaft allows for 2s to complete all requests before it considers
> the active-controller (leader) unavailable. If a request (including
> FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in
> FollowerState, will transition to CandidateState. If a controller has not
> fetched logs from active controller it can never become leader since it has
> no data. As such it will eventually transition back to follower state.
> If the snapshot on the active controller is larger (in size on disk) than it
> would take to download given network conditions between active controller and
> new controller, then its possible that the "new" controller will get stuck in
> a loop.
> In this state it will transition from:
> {code:java}
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... ->
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code}
> Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active
> controller and "new" controller is 2Mbs then, it would take 10s complete
> FETCH_SNAPSHOT of `xxxx.checkpoint`.
> In this case, unless network conditions improve then "new controller" will be
> stuck in a loop forever.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)