showuon commented on PR #20859: URL: https://github.com/apache/kafka/pull/20859#issuecomment-3580486460
> Consider the following case, which IMO results in equally confusing UX: > We have a voter set of (A,B,C) with leader A, KIP-853, and auto-join enabled. We remove node C via RemoveRaftVoter, and the new (A,B) VotersRecord is committed (i.e. A and B both persist it). However, C could be partitioned and not replicate the (A,B) VotersRecord, meaning its partitionState.lastVoterSet() == (A,B,C). Node C restarts and enters initialize(), and it will set hasAutoJoined = true. In this case, the user "expects" C to auto-join because when they look at the quorum status, the voter set is (A,B) and C just restarted with auto-join enabled, but C will not attempt to auto-join. The point here is that A, B, and the user believe the voter set is (A,B), whereas C believes the voter set is (A,B,C) and changes its state based on it. This is a really good point @kevin-wu24 ! Yes, the state in the `KafkaRaftClient#initialize` is just the `the VotersRecord with the highest offset in local node`. So in the case of network partition above, the state will not be update-to-date. Actually, not only the network partition case, suppose the node C is down, and during the period, C is removed from the voters set. After node C startup, the initial voters set is also [A,B,C]. So I think we should fix this issue. > I'm not sure how to fix this case at the moment, and need to think about how to set hasAutoJoined properly. Maybe this value starts as true and the local node can set it to false after successfully fetching and finding out it is not a voter? This will not work because if the node lag a lot. The 1st fetch might not be able to catch up with the leader. I think what we need to do is to make sure the node is caught up with the leader controller. By making sure it is caught up, the network partition case above can be resolved because once we catch up with the leader, we know the `latest voter state on startup`. And we can identify the node caught up or not by checking the highwatermark in fetch response from leader. The flow is like this (using the situation above): 1. node C starts up with network partition (voters = [ABC]) 5. network partition resolved 6. node C sends the 1st fetch request to the leader 7. leader responds with highwatermark=100 8. After appending with the fetch response data, the LEO in the node C is 80 (not caught up yet) 9. Node C sends 2nd fetch request 10. The LEO moves to 102 (suppose the leader has new data appended), which is >= 100 11. Node C updates the `hasJoin` value with the latest voters set at this moment (i.e. voters = [A, B]), and `hasJoined=false`) 12. Node C auto joined the voters set due to `hasJoined=false` @TaiJuWu @kevin-wu24 , does this make sense to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
