showuon commented on PR #20859:
URL: https://github.com/apache/kafka/pull/20859#issuecomment-3580486460

   > Consider the following case, which IMO results in equally confusing UX:
   > We have a voter set of (A,B,C) with leader A, KIP-853, and auto-join 
enabled. We remove node C via RemoveRaftVoter, and the new (A,B) VotersRecord 
is committed (i.e. A and B both persist it). However, C could be partitioned 
and not replicate the (A,B) VotersRecord, meaning its 
partitionState.lastVoterSet() == (A,B,C). Node C restarts and enters 
initialize(), and it will set hasAutoJoined = true. In this case, the user 
"expects" C to auto-join because when they look at the quorum status, the voter 
set is (A,B) and C just restarted with auto-join enabled, but C will not 
attempt to auto-join. The point here is that A, B, and the user believe the 
voter set is (A,B), whereas C believes the voter set is (A,B,C) and changes its 
state based on it.
   
   This is a really good point @kevin-wu24 ! Yes, the state in the 
`KafkaRaftClient#initialize` is just the `the VotersRecord with the highest 
offset in local node`. So in the case of network partition above, the state 
will not be update-to-date. Actually, not only the network partition case, 
suppose the node C is down, and during the period, C is removed from the voters 
set. After node C startup, the initial voters set is also [A,B,C]. So I think 
we should fix this issue.
   
   > I'm not sure how to fix this case at the moment, and need to think about 
how to set hasAutoJoined properly. Maybe this value starts as true and the 
local node can set it to false after successfully fetching and finding out it 
is not a voter?
   
   This will not work because if the node lag a lot. The 1st fetch might not be 
able to catch up with the leader.
   
   I think what we need to do is to make sure the node is caught up with the 
leader controller. By making sure it is caught up, the network partition case 
above can be resolved because once we catch up with the leader, we know the 
`latest voter state  on startup`. And we can identify the node caught up or not 
by checking the highwatermark in fetch response from leader. The flow is like 
this (using the situation above):
   
   1. node C starts up with network partition (voters = [ABC])
   5. network partition resolved
   6. node C sends the 1st fetch request to the leader
   7. leader responds with highwatermark=100
   8. After appending with the fetch response data, the LEO in the node C is 80 
(not caught up yet)
   9. Node C sends 2nd fetch request
   10. The LEO moves to 102 (suppose the leader has new data appended), which 
is >= 100
   11. Node C updates the `hasJoin` value with the latest voters set at this 
moment (i.e. voters = [A, B]), and `hasJoined=false`)
   12. Node C auto joined the voters set due to `hasJoined=false`
   
   @TaiJuWu @kevin-wu24 , does this make sense to you?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to