[jira] [Commented] (KAFKA-19541) KRaft should handle snapshot fetches under slow networks

Christo Lolov (Jira) Fri, 28 Nov 2025 07:02:03 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041322#comment-18041322
 ]


Christo Lolov commented on KAFKA-19541:
---------------------------------------

Heya [~jhooper]! I noticed that you got the necessary votes in 
https://lists.apache.org/thread/yglpr9nnvrcxk0knf5xvzhqrdfdzg3f7 but the vote 
wasn't closed. I will put the target version of this JIRA as 4.3 since we are 
both past KIP freeze and feature freeze dates, but let me know if I have 
misunderstood something!

> KRaft should handle snapshot fetches under slow networks
> --------------------------------------------------------
>
>                 Key: KAFKA-19541
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19541
>             Project: Kafka
>          Issue Type: Improvement
>          Components: kraft
>            Reporter: Jonah Hooper
>            Assignee: Jonah Hooper
>            Priority: Major
>
> If a "new" controller does not have any Metadata logs stored and joins a 
> Quorum it will attempt a FETCH_SNAPSHOT to active controller to receive an up 
> to date log. It will perform this from FollowerState. 
> By default; KRaft allows for 2s to complete all requests before it considers 
> the active-controller (leader) unavailable. If a request (including 
> FETCH_SNAPSHOT) exceeds 2s it will timeout and the controller, if in 
> FollowerState, will transition to CandidateState. If a controller has not 
> fetched logs from active controller it can never become leader since it has 
> no data. As such it will eventually transition back to follower state. 
> If the snapshot on the active controller is larger (in size on disk) than it 
> would take to download given network conditions between active controller and 
> new controller, then its possible that the "new" controller will get stuck in 
> a loop. 
> In this state it will transition from:
> {code:java}
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate -> ... -> 
> Unattached -> Follower -> (Fail to FETCH_SNAPSHOT) -> Candidate ...{code}
> Consider snapshot `xxxx.checkpoint` = 20mb and the connection between Active 
> controller and "new" controller is 2Mbs then, it would take 10s complete 
> FETCH_SNAPSHOT of  `xxxx.checkpoint`. 
> In this case, unless network conditions improve then "new controller" will be 
> stuck in a loop forever. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-19541) KRaft should handle snapshot fetches under slow networks

Reply via email to