[ https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Chudov updated IGNITE-21381: ---------------------------------- Summary: ActiveActorTest#testChangeLeaderForce has problems with resource cleanup (was: ActiveActorTest#testChangeLeaderForce is flaky ) > ActiveActorTest#testChangeLeaderForce has problems with resource cleanup > ------------------------------------------------------------------------ > > Key: IGNITE-21381 > URL: https://issues.apache.org/jira/browse/IGNITE-21381 > Project: Ignite > Issue Type: Bug > Reporter: Mirza Aliev > Priority: Major > Labels: ignite-3 > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with > {noformat} > [05:19:12]F: > [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] > org.opentest4j.AssertionFailedError: expected: <true> but was: <false> > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180) > at > app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370) > {noformat} > From the log we can see that transfer leadership, which was supposed to be > successful, do not happen. Behaviour is the following: > 1) Current leader is {{Leader: ClusterNodeImpl > [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}} > 2) We want to transfer leadership to {{Peer to transfer leader: Peer > [consistentId=aat_tclf_1234, idx=0]}} > 3) Process of transfer is started > 4) We receive warn about error during {{GetLeaderRequestImpl}}: > {noformat} > [2024-01-29T05:19:08,855][WARN > ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error > during the request occurred (will be retried on the randomly selected node) > [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, > peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], > newPeer=Peer [consistentId=aat_tclf_1234, idx=0]]. > java.util.concurrent.CompletionException: > java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) > ~[?:?] > at > java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376) > ~[?:?] > at > java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019) > ~[?:?] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > [?:?] > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > [?:?] > at > java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) > [?:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > Caused by: java.util.concurrent.TimeoutException > ... 7 more > {noformat} > 5) After that we see that node {{aat_tclf_1236}} sends invalid > {{RequestVoteResponse}} because it thinks that it is the leader: > {noformat} > [2024-01-29T05:19:11,370][WARN > ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node > <TestReplicationGroup/aat_tclf_1234> received invalid RequestVoteResponse > from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER. > {noformat} > > Tests {{ActiveActorTest#testChangeLeaderForce}} and > {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted. > Also there are some other problems with this tests, they incorrectly clean up > resources in case of failure. Cluster is stopped in test itself, meaning that > if some assertion is failed, the rest part of the test won't be evaluated, > hence cluster won't be stopped. > The next problem is that if we run this test a several times, even if they > pass successfully, we can see that at some point new test cannot be run > because of > {noformat} > java.lang.OutOfMemoryError: unable to create native thread: possibly out of > memory or process/resource limits reached > {noformat} > From visualvm we can see, that {{Raft-Group-Client}} threads leaked: > !screenshot-1.png! > !screenshot-2.png! > h4. Definition of done > 1) Investigate and fix the problem with the failed transferLeadership > 2) Correctly clean up resources if test is failed. Move all cleanup logic to > {{AfterEach}} section of tests for all {{ActiveActorTest}} and > {{TopologyAwareRaftGroupServiceTest}} > 3) Refactor {{ActiveActorTest}} and {{TopologyAwareRaftGroupServiceTest}}, > the code is just copy-pasted > 4) Investigate the problem with leaked {{Raft-Group-Client}} threads -- This message was sent by Atlassian Jira (v8.20.10#820010)