[ 
https://issues.apache.org/jira/browse/IGNITE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18005204#comment-18005204
 ] 

Mirza Aliev edited comment on IGNITE-25848 at 7/16/25 10:56 AM:
----------------------------------------------------------------

The root cause of this ticket is the following:

In the test scenario we kill the single metastorage node, and after that check 
that the rest nodes could be stopped without error.

Right after we've killed single MS node, other nodes receive CMG topology event 
and try to update logical topology in MS, but sometimes we got error in TC 
because failure handler is called after unsuccessful MS invoke (because single 
MS is stopped). The reason of flakiness is the fact that in 
{{ItIgniteStartTest}} we specifically set 
{{org.apache.ignite.internal.ItIgniteStartTest#RAFT_RETRY_TIMEOUT_MILLIS}} to 
5000 for other tests purposes, but this affects our 
{{testNodesCouldBeStoppedEvenIfMetastorageIsUnavailable}} in the way that 
sometimes MS invoke for topology change timeouts before test is finished.

To fix the issue, we have moved the test to the different test class and set 
{{RAFT_RETRY_TIMEOUT_MILLIS}} to 15000

 

Additionally, I resolved similar issues in {{ItIgniteNodeRestartTest}} by 
modifying the stop sequence to stop the single MS node last instead of first.


was (Author: maliev):
The root cause of this ticket is the following:

In the test scenario we kill the single metastorage node, and after that check 
that the rest nodes could be stopped without error.

Right after we've killed single MS node, other nodes receive CMG topology event 
and try to update logical topology in MS, but sometimes we got error in TC 
because failure handler is called after unsuccessful MS invoke (because single 
MS is stopped). The reason of flakiness is the fact that in 
{{ItIgniteStartTest}} we specifically set 
{{org.apache.ignite.internal.ItIgniteStartTest#RAFT_RETRY_TIMEOUT_MILLIS}} to 
5000 for other tests purposes, but this affects our 
{{testNodesCouldBeStoppedEvenIfMetastorageIsUnavailable}} in the way that 
sometimes MS invoke for topology change timeouts before test is finished.

To fix the issue, we have moved the test to the different test class and set 
{{RAFT_RETRY_TIMEOUT_MILLIS}} to 15000

> Fix several Critical system error caused by Meta Storage unavailability
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-25848
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25848
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexander Lapin
>            Assignee: Mirza Aliev
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain, ignite-3
>         Attachments: _Integration_Tests_Integration_Runner_17450.log.zip
>
>
> {code:java}
> 03:25:14     
> [2025-07-05T19:25:14,346][ERROR][%iist_tncbseimiu_3346%Raft-Group-Client-4][FailureManager]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=CRITICAL_ERROR]
> 03:25:14     org.apache.ignite.internal.failure.StackTraceCapturingException: 
> Failed to update distribution zones' logical topology and version keys 
> [topology = [{id=c6292384-572e-4e36-ba73-cb3ea193bdca, 
> name=iist_tncbseimiu_3345, address=127.0.1.1:3345}, 
> {id=a37ce25e-0f98-4842-baa7-c0fa16ec26a3, name=iist_tncbseimiu_3346, 
> address=127.0.1.1:3346}], version = 4]
> 03:25:14       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
> 03:25:14       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
> 03:25:14       at 
> org.apache.ignite.internal.distributionzones.DistributionZoneManager.lambda$updateLogicalTopologyInMetaStorage$3(DistributionZoneManager.java:444)
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> 03:25:14       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:678)
> 03:25:14       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:652)
> 03:25:14       at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$scheduleRetry$51(RaftGroupServiceImpl.java:902)
> 03:25:14       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> 03:25:14       at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 03:25:14       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> 03:25:14       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> 03:25:14       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> 03:25:14       at java.base/java.lang.Thread.run(Thread.java:833)
> 03:25:14     Caused by: java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException: Send with retry timed out [retryCount 
> = 25, groupId = metastorage_group, traceId = null, request = 
> org.apache.ignite.raft.jraft.rpc.WriteActionRequestImpl(org.apache.ignite.internal.metastorage.command.MultiInvokeCommandImpl),
>  originCommand = null, retryReasons = [[time=1751761509328, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; attemptWaitDuration=0, 
> attemptDuration=0, attemptStartTime=2025-07-05T19:25:09,328], 
> [time=1751761509529, msg=Peer iist_tncbseimiu_3344:0 threw 
> PeerUnavailableException; attemptWaitDuration=196, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:09,529], [time=1751761509729, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:09,729], [time=1751761509930, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=6, 
> attemptStartTime=2025-07-05T19:25:09,930], [time=1751761510131, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=194, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:10,131], [time=1751761510331, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:10,331], [time=1751761510532, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=8, 
> attemptStartTime=2025-07-05T19:25:10,532], [time=1751761510732, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=192, attemptDuration=8, 
> attemptStartTime=2025-07-05T19:25:10,732], [time=1751761510933, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=192, attemptDuration=9, 
> attemptStartTime=2025-07-05T19:25:10,933], [time=1751761511133, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=191, attemptDuration=9, 
> attemptStartTime=2025-07-05T19:25:11,133], [time=1751761511334, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=191, attemptDuration=10, 
> attemptStartTime=2025-07-05T19:25:11,334], [time=1751761511534, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=200, attemptDuration=0, 
> attemptStartTime=2025-07-05T19:25:11,534], [time=1751761511735, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=200, attemptDuration=1, 
> attemptStartTime=2025-07-05T19:25:11,735], [time=1751761511935, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=199, attemptDuration=1, 
> attemptStartTime=2025-07-05T19:25:11,935], [time=1751761512136, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=199, attemptDuration=2, 
> attemptStartTime=2025-07-05T19:25:12,136], [time=1751761512336, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=198, attemptDuration=2, 
> attemptStartTime=2025-07-05T19:25:12,336], [time=1751761512537, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=198, attemptDuration=3, 
> attemptStartTime=2025-07-05T19:25:12,537], [time=1751761512737, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=197, attemptDuration=3, 
> attemptStartTime=2025-07-05T19:25:12,737], [time=1751761512938, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=197, attemptDuration=4, 
> attemptStartTime=2025-07-05T19:25:12,938], [time=1751761513139, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=196, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:13,139], [time=1751761513339, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:13,339], [time=1751761513540, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=6, 
> attemptStartTime=2025-07-05T19:25:13,540], [time=1751761513741, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=194, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:13,741], [time=1751761513941, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:13,941], [time=1751761514142, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=8, 
> attemptStartTime=2025-07-05T19:25:14,142]], stopTime = 1751761514327, 
> currentTime = 1751761514343, startTime = 1751761509328, duration = 5015].
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> 03:25:14       at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)
> 03:25:14       ... 11 more
> 03:25:14     Caused by: java.util.concurrent.TimeoutException: Send with 
> retry timed out [retryCount = 25, groupId = metastorage_group, traceId = 
> null, request = 
> org.apache.ignite.raft.jraft.rpc.WriteActionRequestImpl(org.apache.ignite.internal.metastorage.command.MultiInvokeCommandImpl),
>  originCommand = null, retryReasons = [[time=1751761509328, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; attemptWaitDuration=0, 
> attemptDuration=0, attemptStartTime=2025-07-05T19:25:09,328], 
> [time=1751761509529, msg=Peer iist_tncbseimiu_3344:0 threw 
> PeerUnavailableException; attemptWaitDuration=196, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:09,529], [time=1751761509729, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:09,729], [time=1751761509930, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=6, 
> attemptStartTime=2025-07-05T19:25:09,930], [time=1751761510131, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=194, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:10,131], [time=1751761510331, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:10,331], [time=1751761510532, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=8, 
> attemptStartTime=2025-07-05T19:25:10,532], [time=1751761510732, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=192, attemptDuration=8, 
> attemptStartTime=2025-07-05T19:25:10,732], [time=1751761510933, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=192, attemptDuration=9, 
> attemptStartTime=2025-07-05T19:25:10,933], [time=1751761511133, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=191, attemptDuration=9, 
> attemptStartTime=2025-07-05T19:25:11,133], [time=1751761511334, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=191, attemptDuration=10, 
> attemptStartTime=2025-07-05T19:25:11,334], [time=1751761511534, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=200, attemptDuration=0, 
> attemptStartTime=2025-07-05T19:25:11,534], [time=1751761511735, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=200, attemptDuration=1, 
> attemptStartTime=2025-07-05T19:25:11,735], [time=1751761511935, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=199, attemptDuration=1, 
> attemptStartTime=2025-07-05T19:25:11,935], [time=1751761512136, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=199, attemptDuration=2, 
> attemptStartTime=2025-07-05T19:25:12,136], [time=1751761512336, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=198, attemptDuration=2, 
> attemptStartTime=2025-07-05T19:25:12,336], [time=1751761512537, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=198, attemptDuration=3, 
> attemptStartTime=2025-07-05T19:25:12,537], [time=1751761512737, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=197, attemptDuration=3, 
> attemptStartTime=2025-07-05T19:25:12,737], [time=1751761512938, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=197, attemptDuration=4, 
> attemptStartTime=2025-07-05T19:25:12,938], [time=1751761513139, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=196, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:13,139], [time=1751761513339, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=5, 
> attemptStartTime=2025-07-05T19:25:13,339], [time=1751761513540, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=195, attemptDuration=6, 
> attemptStartTime=2025-07-05T19:25:13,540], [time=1751761513741, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=194, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:13,741], [time=1751761513941, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=7, 
> attemptStartTime=2025-07-05T19:25:13,941], [time=1751761514142, msg=Peer 
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; 
> attemptWaitDuration=193, attemptDuration=8, 
> attemptStartTime=2025-07-05T19:25:14,142]], stopTime = 1751761514327, 
> currentTime = 1751761514343, startTime = 1751761509328, duration = 5015].
> 03:25:14       at 
> org.apache.ignite.internal.raft.RetryContext.createTimeoutException(RetryContext.java:198)
> 03:25:14       ... 9 more {code}
> [TC 
> link|https://ggtc.gridgain.com/buildConfiguration/GridGain9Gradle_Test_RunAllTests/14979636?showLog=14979383_45078_511.1976&logFilter=debug&logView=flowAware]
> Full logs in attachments



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to