[ 
https://issues.apache.org/jira/browse/IGNITE-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953423#comment-17953423
 ] 

Alexander Lapin edited comment on IGNITE-25452 at 5/26/25 12:45 PM:
--------------------------------------------------------------------

 Following is happening in the test:
 # Set raft electionTimeout to 3 seconds and electionHeartbeatFactor to 2.
 # Start 3 raft nodes.
 # Await leader election.
 # Prepare logic to block third and subsequent AppendEntriesRequest that 
eventually leads leader step down.
 # Await for a new leader to be elected for 10 seconds.

Let's assume that awaiting starts at t0.
 * Usually it takes 2 seconds to await for AppendEntriesRequest to be sent. 
After that moment raft followers detect leader loss and transfer themself to a 
state of initiating the process of selecting a new leader.
 * After 3 more seconds (electionTimeout) one of the two remaining nodes 
usually (19/20 on my machine) becomes the leader, meaning that after 2 + 3 = 5 
seconds new leader election await successfully completes. 
 * Sometimes both remaining nodes propose self to candidate almost at the same 
time (which is perfectly fine according to raft protocol) - in that case one 
more leader election round takes it time after electionTimeout, meaning that 
leader will be elected after 2 + 3 + 3 = 8 seconds which is still less than 10 
seconds await timeout.
 * However in very rare cases, even two leader election rounds are not enough. 
In that case leader will be elected after 2 + 3 + 3 + 3 = 11 which is greater 
than 10 seconds await timeout - test fails.

Worth mentioning that within given test we don't really need 3 seconds as 
electionTimeout. Setting the value to 1 seconds will not only make the test 
more robust but also will decrease the test length from 10-15 seconds to 4-6.

 

Besides aforementioned, it's also possible that old leader will be elected 
again as a new one. In that case however, since heartbeats are being blocked 
it'll again lose it leadership soon enough, that will bring us to one more 
leader election round.


was (Author: alapin):
 Following is happening in the test:
 # Set raft electionTimeout to 3 seconds and electionHeartbeatFactor to 2.
 # Start 3 raft nodes.
 # Await leader election.
 # Prepare logic to block 2 AppendEntriesRequest that eventually leads to 
cluster segmentation. 
 # Await for a new leader to be elected for 10 seconds.

Let's assume that awaiting starts at t0.
 * Usually it takes 2 seconds to await for AppendEntriesRequest to be sent. 
After that moment raft followers detect leader loss and transfer themself to a 
state of initiating the process of selecting a new leader.
 * Currently we do not automatically recover node after segmentation, thus 
despite the fact that AppendEntriesRequests are no longer blocked, old leader 
will never communicate with two other nodes in the cluster.
 * After 3 more seconds (electionTimeout) one of the two remaining nodes 
usually (19/20 on my machine) becomes the leader, meaning that after 2 + 3 = 5 
seconds new leader election await successfully completes. 
 * Sometimes both remaining nodes propose self to candidate almost at the same 
time (which is perfectly fine according to raft protocol) - in that case one 
more leader election round takes it time after electionTimeout, meaning that 
leader will be elected after 2 + 3 + 3 = 8 seconds which is still less than 10 
seconds await timeout.
 * However in very rare cases, even two leader election rounds are not enough. 
In that case leader will be elected after 2 + 3 + 3 + 3 = 11 which is greater 
than 10 seconds await timeout - test fails.

Worth mentioning that within given test we don't really need 3 seconds as 
electionTimeout. Setting the value to 1 seconds will not only make the test 
more robust but also will decrease the test length from 10-15 seconds to 4-6.

> ItNodeTest#testLeaseReadAfterSegmentation may fail with AssertionFailedError
> ----------------------------------------------------------------------------
>
>                 Key: IGNITE-25452
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25452
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexander Lapin
>            Assignee: Alexander Lapin
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain, ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
>  
> New leader election sometimes times out:
> {code:java}
> org.opentest4j.AssertionFailedError: expected: <true> but was: <false>     at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>      at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>      at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) 
>     at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)   
>   at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)     
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)     
> at 
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testLeaseReadAfterSegmentation(ItNodeTest.java:4287)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to