[jira] [Commented] (KAFKA-16281) Possible IllegalState with KIP-996

2024-02-20 Thread Calvin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819007#comment-17819007
 ] 

Calvin Liu commented on KAFKA-16281:


[~alivshits] Jack has corrected that the issue is with KIP-996 not KIP-966

> Possible IllegalState with KIP-996
> --
>
> Key: KAFKA-16281
> URL: https://issues.apache.org/jira/browse/KAFKA-16281
> Project: Kafka
>  Issue Type: Task
>  Components: kraft
>Reporter: Jack Vanlightly
>Priority: Major
>
> I have a TLA+ model of KIP-996 (pre-vote) and I have identified an 
> IllegalState exception that would occur with the existing 
> MaybeHandleCommonResponse behavior.
> The issue stems from the fact that a leader, let's call it r1, can resign 
> (either due to a restart or check quorum) and then later initiate a pre-vote 
> where it ends up in the same epoch as before. When r1 receives a response 
> from r2 who believes that r1 is still the leader, the logic in 
> MaybeHandleCommonResponse tries to transition r1 to follower of itself, 
> causing an IllegalState exception to be raised.
> This is an example history:
>  # r1 is the leader in epoch 1.
>  # r1 quorum resigns, or restarts and resigns.
>  # r1 experiences an election timeout and transitions to Prospective.
>  # r1 sends a pre vote request to its peers.
>  # r2 thinks r1 is still the leader, sends a vote response, not granting its 
> vote and setting leaderId=r1 and epoch=1.
>  # r1 receives the vote response and executes MaybeHandleCommonResponse which 
> tries to transition r1 to Follower of itself and an illegal state occurs.
> The relevant else if statement in MaybeHandleCommonResponse is here: 
> [https://github.com/apache/kafka/blob/a26a1d847f1884a519561e7a4fb4cd13e051c824/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1538]
> In the TLA+ specification, I fixed this issue by adding a fourth condition to 
> this statement, that replica must not be in the Prospective state. 
> [https://github.com/Vanlightly/kafka-tlaplus/blob/9b2600d1cd5c65930d666b12792d47362b64c015/kraft/kip_996/kraft_kip_996_functions.tla#L336|https://github.com/Vanlightly/kafka-tlaplus/blob/421f170ba4bd8c5eceb36b88b47901ee3d9c3d2a/kraft/kip_996/kraft_kip_996_functions.tla#L336]
>  
> Note, that I also had to implement the sending of the BeginQuorumEpoch 
> request by the leader to prevent a replica getting stuck in Prospective. If 
> the replica r2 has an election timeout but due to a transient connectivity 
> issue with the leader, but has also fallen behind slightly, then r2 will 
> remain stuck as a Prospective because none of its peers, who have 
> connectivity to the leader, will grant it a pre-vote. To enable r2 to become 
> a functional member again, the leader must give it a nudge with a 
> BeginQuorumEpoch request. The alternative (which I have also modeled) is for 
> a Prospective to transition to Follower when it receives a negative pre-vote 
> response with a non-null leaderId. This comes with a separate liveness issue 
> which I can discuss if this "transition to Follower" approach is interesting. 
> Either way, a stuck Prospective needs a way to transition to follower 
> eventually, if all other members have a stable leader.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16281) Possible IllegalState with KIP-996

2024-02-20 Thread Artem Livshits (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819004#comment-17819004
 ] 

Artem Livshits commented on KAFKA-16281:


Is this a problem with KIP-966 or just a model that was built for validating 
KIP-966 found an issue in the KRaft protocol itself?  I don't think KIP-966 
changes the voting protocol for KRaft.

> Possible IllegalState with KIP-996
> --
>
> Key: KAFKA-16281
> URL: https://issues.apache.org/jira/browse/KAFKA-16281
> Project: Kafka
>  Issue Type: Task
>  Components: kraft
>Reporter: Jack Vanlightly
>Priority: Major
>
> I have a TLA+ model of KIP-996 (pre-vote) and I have identified an 
> IllegalState exception that would occur with the existing 
> MaybeHandleCommonResponse behavior.
> The issue stems from the fact that a leader, let's call it r1, can resign 
> (either due to a restart or check quorum) and then later initiate a pre-vote 
> where it ends up in the same epoch as before. When r1 receives a response 
> from r2 who believes that r1 is still the leader, the logic in 
> MaybeHandleCommonResponse tries to transition r1 to follower of itself, 
> causing an IllegalState exception to be raised.
> This is an example history:
>  # r1 is the leader in epoch 1.
>  # r1 quorum resigns, or restarts and resigns.
>  # r1 experiences an election timeout and transitions to Prospective.
>  # r1 sends a pre vote request to its peers.
>  # r2 thinks r1 is still the leader, sends a vote response, not granting its 
> vote and setting leaderId=r1 and epoch=1.
>  # r1 receives the vote response and executes MaybeHandleCommonResponse which 
> tries to transition r1 to Follower of itself and an illegal state occurs.
> The relevant else if statement in MaybeHandleCommonResponse is here: 
> [https://github.com/apache/kafka/blob/a26a1d847f1884a519561e7a4fb4cd13e051c824/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1538]
> In the TLA+ specification, I fixed this issue by adding a fourth condition to 
> this statement, that replica must not be in the Prospective state. 
> [https://github.com/Vanlightly/kafka-tlaplus/blob/9b2600d1cd5c65930d666b12792d47362b64c015/kraft/kip_996/kraft_kip_996_functions.tla#L336|https://github.com/Vanlightly/kafka-tlaplus/blob/421f170ba4bd8c5eceb36b88b47901ee3d9c3d2a/kraft/kip_996/kraft_kip_996_functions.tla#L336]
>  
> Note, that I also had to implement the sending of the BeginQuorumEpoch 
> request by the leader to prevent a replica getting stuck in Prospective. If 
> the replica r2 has an election timeout but due to a transient connectivity 
> issue with the leader, but has also fallen behind slightly, then r2 will 
> remain stuck as a Prospective because none of its peers, who have 
> connectivity to the leader, will grant it a pre-vote. To enable r2 to become 
> a functional member again, the leader must give it a nudge with a 
> BeginQuorumEpoch request. The alternative (which I have also modeled) is for 
> a Prospective to transition to Follower when it receives a negative pre-vote 
> response with a non-null leaderId. This comes with a separate liveness issue 
> which I can discuss if this "transition to Follower" approach is interesting. 
> Either way, a stuck Prospective needs a way to transition to follower 
> eventually, if all other members have a stable leader.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)