[jira] [Updated] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce is flaky

Mirza Aliev (Jira) Wed, 31 Jan 2024 00:53:04 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mirza Aliev updated IGNITE-21381:
---------------------------------
    Description: 
{{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 

{noformat}
[05:19:12]F:                     
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
        at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
        at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
        at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
        at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
        at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
        at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
        at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
{noformat}




>From the log we can see that transfer leadership, which was supposed to be 
>successful, do not happen. Behaviour is the following:
1) Current leader is {{Leader: ClusterNodeImpl 
[id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
2) We want to transfer leadership to {{Peer to transfer leader: Peer 
[consistentId=aat_tclf_1234, idx=0]}}
3) Process of transfer is started
4) We receive warn about error during {{GetLeaderRequestImpl}}:

{noformat}
[2024-01-29T05:19:08,855][WARN 
][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
during the request occurred (will be retried on the randomly selected node) 
[request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) 
~[?:?]
        at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
[?:?]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 [?:?]
        at 
java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) 
[?:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException
        ... 7 more
{noformat}

5) After that we see that node {{aat_tclf_1236}} sends invalid 
{{RequestVoteResponse}} because it thinks that it is the leader:

{noformat}
[2024-01-29T05:19:11,370][WARN 
][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
<TestReplicationGroup/aat_tclf_1234> received invalid RequestVoteResponse from 
aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
{noformat}
 
Tests {{ActiveActorTest#testChangeLeaderForce}} and 
{{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.


Also there are some other problems with this tests, they incorrectly clean up 
resources in case of failure. Cluster is stopped in test itself, meaning that 
if some assertion is failed, the rest part of the test won't be evaluated, 
hence cluster won't be stopped.

The next problem is that if we run this test a several times, even if they pass 
successfully, we can see that at some point new test cannot be run because of 


{noformat}
 java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
{noformat}

>From visualvm we can see, that {{Raft-Group-Client}} threads leaked:

 !screenshot-1.png! 
 !screenshot-2.png! 

h4. Definition of done
1) Investigate and fix the problem with the failed transferLeadersihp
2) Correctly clean up resources if test is failed. Move all cleanup logic to 
{{AfterEach}} section of tests for all {{ActiveActorTest}} and 
{{TopologyAwareRaftGroupServiceTest}}
3) Refactor {{ActiveActorTest}} and {{TopologyAwareRaftGroupServiceTest}}, the 
code is just copy-pasted
4) Investigate the problem with leaked {{Raft-Group-Client}} threads 


  was:
{{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 

{noformat}
[05:19:12]F:                     
[org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
 org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
        at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
        at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
        at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
        at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
        at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
        at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
        at 
app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
{noformat}

>From the log we can see that transfer leadership, which was supposed to be 
>successful, do not happen. Behaviour is the following:
1) Current leader is {{Leader: ClusterNodeImpl 
[id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
2) We want to transfer leadership to {{Peer to transfer leader: Peer 
[consistentId=aat_tclf_1234, idx=0]}}
3) Process of transfer is started
4) We receive warn about error during {{GetLeaderRequestImpl}}:

{noformat}
[2024-01-29T05:19:08,855][WARN 
][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
during the request occurred (will be retried on the randomly selected node) 
[request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) 
~[?:?]
        at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
[?:?]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 [?:?]
        at 
java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) 
[?:?]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException
        ... 7 more
{noformat}

5) After that we see that node {{aat_tclf_1236}} sends invalid 
{{RequestVoteResponse}} because it thinks that it is the leader:

{noformat}
[2024-01-29T05:19:11,370][WARN 
][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
<TestReplicationGroup/aat_tclf_1234> received invalid RequestVoteResponse from 
aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
{noformat}
 
Tests {{ActiveActorTest#testChangeLeaderForce}} and 
{{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.


Also there are some other problems with this tests, they incorrectly clean up 
resources in case of failure. Cluster is stopped in test itself, meaning that 
if some assertion is failed, the rest part of the test won't be evaluated, 
hence cluster won't be stopped.

The next problem is that if we run this test a several times, even if they pass 
successfully, we can see that at some point new test cannot be run because of 


{noformat}
 java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
{noformat}

>From visualvm we can see, that {{Raft-Group-Client}} threads leaked:

 !screenshot-1.png! 
 !screenshot-2.png! 

h4. Definition of done
1) Investigate and fix the problem with the failed transferLeadersihp
2) Correctly clean up resources if test is failed. Move all cleanup logic to 
{{AfterEach}} section of tests for all {{ActiveActorTest}} and 
{{TopologyAwareRaftGroupServiceTest}}
3) Refactor {{ActiveActorTest}} and {{TopologyAwareRaftGroupServiceTest}}, the 
code is just copy-pasted
4) Investigate the problem with leaked {{Raft-Group-Client}} threads 



> ActiveActorTest#testChangeLeaderForce is flaky 
> -----------------------------------------------
>
>                 Key: IGNITE-21381
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21381
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Priority: Major
>              Labels: ignite-3
>         Attachments: screenshot-1.png, screenshot-2.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 
> {noformat}
> [05:19:12]F:                   
> [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)]
>  org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
>       at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>       at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>       at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>       at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>       at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
>       at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
>       at 
> app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
> {noformat}
> From the log we can see that transfer leadership, which was supposed to be 
> successful, do not happen. Behaviour is the following:
> 1) Current leader is {{Leader: ClusterNodeImpl 
> [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
> 2) We want to transfer leadership to {{Peer to transfer leader: Peer 
> [consistentId=aat_tclf_1234, idx=0]}}
> 3) Process of transfer is started
> 4) We receive warn about error during {{GetLeaderRequestImpl}}:
> {noformat}
> [2024-01-29T05:19:08,855][WARN 
> ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error 
> during the request occurred (will be retried on the randomly selected node) 
> [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, 
> peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], 
> newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>       at 
> java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>  ~[?:?]
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>  [?:?]
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>  [?:?]
>       at 
> java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>  [?:?]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>       at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.util.concurrent.TimeoutException
>       ... 7 more
> {noformat}
> 5) After that we see that node {{aat_tclf_1236}} sends invalid 
> {{RequestVoteResponse}} because it thinks that it is the leader:
> {noformat}
> [2024-01-29T05:19:11,370][WARN 
> ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node 
> <TestReplicationGroup/aat_tclf_1234> received invalid RequestVoteResponse 
> from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
> {noformat}
>  
> Tests {{ActiveActorTest#testChangeLeaderForce}} and 
> {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.
> Also there are some other problems with this tests, they incorrectly clean up 
> resources in case of failure. Cluster is stopped in test itself, meaning that 
> if some assertion is failed, the rest part of the test won't be evaluated, 
> hence cluster won't be stopped.
> The next problem is that if we run this test a several times, even if they 
> pass successfully, we can see that at some point new test cannot be run 
> because of 
> {noformat}
>  java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
> memory or process/resource limits reached
> {noformat}
> From visualvm we can see, that {{Raft-Group-Client}} threads leaked:
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> h4. Definition of done
> 1) Investigate and fix the problem with the failed transferLeadersihp
> 2) Correctly clean up resources if test is failed. Move all cleanup logic to 
> {{AfterEach}} section of tests for all {{ActiveActorTest}} and 
> {{TopologyAwareRaftGroupServiceTest}}
> 3) Refactor {{ActiveActorTest}} and {{TopologyAwareRaftGroupServiceTest}}, 
> the code is just copy-pasted
> 4) Investigate the problem with leaked {{Raft-Group-Client}} threads 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce is flaky

Reply via email to