[
https://issues.apache.org/jira/browse/KAFKA-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041322#comment-18041322
]
Christo Lolov commented on KAFKA-19541:
---------------------------------------
Heya [~jhooper]! I noticed that you got the necessary votes in
https://lists.apache.org/thread/yglpr9nnvrcxk0knf5xvzhqrdfdzg3f7 but the vote
wasn't closed. I will put the target version of this JIRA as 4.3 since we are
both past KIP freeze and feature freeze dates, but let me know if I have
misunderstood something!
> KRaft should handle snapshot fetches under slow networks
> --------------------------------------------------------
>
> Key: KAFKA-19541
> URL: https://issues.apache.org/jira/browse/KAFKA-19541
> Project: Kafka
> Issue Type: Improvement
> Components: kraft
> Reporter: Jonah Hooper
> Assignee: Jonah Hooper
> Priority: Major
>
> If a "new" controller does not have any Metadata logs stored and joins a
> Quorum it will attempt a FETCH_SNAPSHOT to active controller to receive an up
> to date log. It will perform this from FollowerState.
> By default; KRaft allows for 2s to complete all requests before it considers
> the active-controller (leader) unavailable. If a request (including
> FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in
> FollowerState, will transition to CandidateState. If a controller has not
> fetched logs from active controller it can never become leader since it has
> no data. As such it will eventually transition back to follower state.
> If the snapshot on the active controller is larger (in size on disk) than it
> would take to download given network conditions between active controller and
> new controller, then its possible that the "new" controller will get stuck in
> a loop.
> In this state it will transition from:
> {code:java}
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... ->
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code}
> Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active
> controller and "new" controller is 2Mbs then, it would take 10s complete
> FETCH_SNAPSHOT of `xxxx.checkpoint`.
> In this case, unless network conditions improve then "new controller" will be
> stuck in a loop forever.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)