[
https://issues.apache.org/jira/browse/IGNITE-25452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953423#comment-17953423
]
Alexander Lapin edited comment on IGNITE-25452 at 5/26/25 12:45 PM:
--------------------------------------------------------------------
Following is happening in the test:
# Set raft electionTimeout to 3 seconds and electionHeartbeatFactor to 2.
# Start 3 raft nodes.
# Await leader election.
# Prepare logic to block third and subsequent AppendEntriesRequest that
eventually leads leader step down.
# Await for a new leader to be elected for 10 seconds.
Let's assume that awaiting starts at t0.
* Usually it takes 2 seconds to await for AppendEntriesRequest to be sent.
After that moment raft followers detect leader loss and transfer themself to a
state of initiating the process of selecting a new leader.
* After 3 more seconds (electionTimeout) one of the two remaining nodes
usually (19/20 on my machine) becomes the leader, meaning that after 2 + 3 = 5
seconds new leader election await successfully completes.
* Sometimes both remaining nodes propose self to candidate almost at the same
time (which is perfectly fine according to raft protocol) - in that case one
more leader election round takes it time after electionTimeout, meaning that
leader will be elected after 2 + 3 + 3 = 8 seconds which is still less than 10
seconds await timeout.
* However in very rare cases, even two leader election rounds are not enough.
In that case leader will be elected after 2 + 3 + 3 + 3 = 11 which is greater
than 10 seconds await timeout - test fails.
Worth mentioning that within given test we don't really need 3 seconds as
electionTimeout. Setting the value to 1 seconds will not only make the test
more robust but also will decrease the test length from 10-15 seconds to 4-6.
Besides aforementioned, it's also possible that old leader will be elected
again as a new one. In that case however, since heartbeats are being blocked
it'll again lose it leadership soon enough, that will bring us to one more
leader election round.
was (Author: alapin):
Following is happening in the test:
# Set raft electionTimeout to 3 seconds and electionHeartbeatFactor to 2.
# Start 3 raft nodes.
# Await leader election.
# Prepare logic to block 2 AppendEntriesRequest that eventually leads to
cluster segmentation.
# Await for a new leader to be elected for 10 seconds.
Let's assume that awaiting starts at t0.
* Usually it takes 2 seconds to await for AppendEntriesRequest to be sent.
After that moment raft followers detect leader loss and transfer themself to a
state of initiating the process of selecting a new leader.
* Currently we do not automatically recover node after segmentation, thus
despite the fact that AppendEntriesRequests are no longer blocked, old leader
will never communicate with two other nodes in the cluster.
* After 3 more seconds (electionTimeout) one of the two remaining nodes
usually (19/20 on my machine) becomes the leader, meaning that after 2 + 3 = 5
seconds new leader election await successfully completes.
* Sometimes both remaining nodes propose self to candidate almost at the same
time (which is perfectly fine according to raft protocol) - in that case one
more leader election round takes it time after electionTimeout, meaning that
leader will be elected after 2 + 3 + 3 = 8 seconds which is still less than 10
seconds await timeout.
* However in very rare cases, even two leader election rounds are not enough.
In that case leader will be elected after 2 + 3 + 3 + 3 = 11 which is greater
than 10 seconds await timeout - test fails.
Worth mentioning that within given test we don't really need 3 seconds as
electionTimeout. Setting the value to 1 seconds will not only make the test
more robust but also will decrease the test length from 10-15 seconds to 4-6.
> ItNodeTest#testLeaseReadAfterSegmentation may fail with AssertionFailedError
> ----------------------------------------------------------------------------
>
> Key: IGNITE-25452
> URL: https://issues.apache.org/jira/browse/IGNITE-25452
> Project: Ignite
> Issue Type: Bug
> Reporter: Alexander Lapin
> Assignee: Alexander Lapin
> Priority: Major
> Labels: MakeTeamcityGreenAgain, ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
>
> New leader election sometimes times out:
> {code:java}
> org.opentest4j.AssertionFailedError: expected: <true> but was: <false> at
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> at
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183)
> at
> app//org.apache.ignite.raft.jraft.core.ItNodeTest.testLeaseReadAfterSegmentation(ItNodeTest.java:4287)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)