[
https://issues.apache.org/jira/browse/IGNITE-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18005204#comment-18005204
]
Mirza Aliev edited comment on IGNITE-25848 at 7/16/25 10:56 AM:
----------------------------------------------------------------
The root cause of this ticket is the following:
In the test scenario we kill the single metastorage node, and after that check
that the rest nodes could be stopped without error.
Right after we've killed single MS node, other nodes receive CMG topology event
and try to update logical topology in MS, but sometimes we got error in TC
because failure handler is called after unsuccessful MS invoke (because single
MS is stopped). The reason of flakiness is the fact that in
{{ItIgniteStartTest}} we specifically set
{{org.apache.ignite.internal.ItIgniteStartTest#RAFT_RETRY_TIMEOUT_MILLIS}} to
5000 for other tests purposes, but this affects our
{{testNodesCouldBeStoppedEvenIfMetastorageIsUnavailable}} in the way that
sometimes MS invoke for topology change timeouts before test is finished.
To fix the issue, we have moved the test to the different test class and set
{{RAFT_RETRY_TIMEOUT_MILLIS}} to 15000
Additionally, I resolved similar issues in {{ItIgniteNodeRestartTest}} by
modifying the stop sequence to stop the single MS node last instead of first.
was (Author: maliev):
The root cause of this ticket is the following:
In the test scenario we kill the single metastorage node, and after that check
that the rest nodes could be stopped without error.
Right after we've killed single MS node, other nodes receive CMG topology event
and try to update logical topology in MS, but sometimes we got error in TC
because failure handler is called after unsuccessful MS invoke (because single
MS is stopped). The reason of flakiness is the fact that in
{{ItIgniteStartTest}} we specifically set
{{org.apache.ignite.internal.ItIgniteStartTest#RAFT_RETRY_TIMEOUT_MILLIS}} to
5000 for other tests purposes, but this affects our
{{testNodesCouldBeStoppedEvenIfMetastorageIsUnavailable}} in the way that
sometimes MS invoke for topology change timeouts before test is finished.
To fix the issue, we have moved the test to the different test class and set
{{RAFT_RETRY_TIMEOUT_MILLIS}} to 15000
> Fix several Critical system error caused by Meta Storage unavailability
> -----------------------------------------------------------------------
>
> Key: IGNITE-25848
> URL: https://issues.apache.org/jira/browse/IGNITE-25848
> Project: Ignite
> Issue Type: Bug
> Reporter: Alexander Lapin
> Assignee: Mirza Aliev
> Priority: Major
> Labels: MakeTeamcityGreenAgain, ignite-3
> Attachments: _Integration_Tests_Integration_Runner_17450.log.zip
>
>
> {code:java}
> 03:25:14
> [2025-07-05T19:25:14,346][ERROR][%iist_tncbseimiu_3346%Raft-Group-Client-4][FailureManager]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=CRITICAL_ERROR]
> 03:25:14 org.apache.ignite.internal.failure.StackTraceCapturingException:
> Failed to update distribution zones' logical topology and version keys
> [topology = [{id=c6292384-572e-4e36-ba73-cb3ea193bdca,
> name=iist_tncbseimiu_3345, address=127.0.1.1:3345},
> {id=a37ce25e-0f98-4842-baa7-c0fa16ec26a3, name=iist_tncbseimiu_3346,
> address=127.0.1.1:3346}], version = 4]
> 03:25:14 at
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
> 03:25:14 at
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
> 03:25:14 at
> org.apache.ignite.internal.distributionzones.DistributionZoneManager.lambda$updateLogicalTopologyInMetaStorage$3(DistributionZoneManager.java:444)
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> 03:25:14 at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:678)
> 03:25:14 at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:652)
> 03:25:14 at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$scheduleRetry$51(RaftGroupServiceImpl.java:902)
> 03:25:14 at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> 03:25:14 at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 03:25:14 at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> 03:25:14 at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> 03:25:14 at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> 03:25:14 at java.base/java.lang.Thread.run(Thread.java:833)
> 03:25:14 Caused by: java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException: Send with retry timed out [retryCount
> = 25, groupId = metastorage_group, traceId = null, request =
> org.apache.ignite.raft.jraft.rpc.WriteActionRequestImpl(org.apache.ignite.internal.metastorage.command.MultiInvokeCommandImpl),
> originCommand = null, retryReasons = [[time=1751761509328, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; attemptWaitDuration=0,
> attemptDuration=0, attemptStartTime=2025-07-05T19:25:09,328],
> [time=1751761509529, msg=Peer iist_tncbseimiu_3344:0 threw
> PeerUnavailableException; attemptWaitDuration=196, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:09,529], [time=1751761509729, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:09,729], [time=1751761509930, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=6,
> attemptStartTime=2025-07-05T19:25:09,930], [time=1751761510131, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=194, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:10,131], [time=1751761510331, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:10,331], [time=1751761510532, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=8,
> attemptStartTime=2025-07-05T19:25:10,532], [time=1751761510732, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=192, attemptDuration=8,
> attemptStartTime=2025-07-05T19:25:10,732], [time=1751761510933, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=192, attemptDuration=9,
> attemptStartTime=2025-07-05T19:25:10,933], [time=1751761511133, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=191, attemptDuration=9,
> attemptStartTime=2025-07-05T19:25:11,133], [time=1751761511334, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=191, attemptDuration=10,
> attemptStartTime=2025-07-05T19:25:11,334], [time=1751761511534, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=200, attemptDuration=0,
> attemptStartTime=2025-07-05T19:25:11,534], [time=1751761511735, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=200, attemptDuration=1,
> attemptStartTime=2025-07-05T19:25:11,735], [time=1751761511935, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=199, attemptDuration=1,
> attemptStartTime=2025-07-05T19:25:11,935], [time=1751761512136, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=199, attemptDuration=2,
> attemptStartTime=2025-07-05T19:25:12,136], [time=1751761512336, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=198, attemptDuration=2,
> attemptStartTime=2025-07-05T19:25:12,336], [time=1751761512537, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=198, attemptDuration=3,
> attemptStartTime=2025-07-05T19:25:12,537], [time=1751761512737, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=197, attemptDuration=3,
> attemptStartTime=2025-07-05T19:25:12,737], [time=1751761512938, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=197, attemptDuration=4,
> attemptStartTime=2025-07-05T19:25:12,938], [time=1751761513139, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=196, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:13,139], [time=1751761513339, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:13,339], [time=1751761513540, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=6,
> attemptStartTime=2025-07-05T19:25:13,540], [time=1751761513741, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=194, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:13,741], [time=1751761513941, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:13,941], [time=1751761514142, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=8,
> attemptStartTime=2025-07-05T19:25:14,142]], stopTime = 1751761514327,
> currentTime = 1751761514343, startTime = 1751761509328, duration = 5015].
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> 03:25:14 at
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)
> 03:25:14 ... 11 more
> 03:25:14 Caused by: java.util.concurrent.TimeoutException: Send with
> retry timed out [retryCount = 25, groupId = metastorage_group, traceId =
> null, request =
> org.apache.ignite.raft.jraft.rpc.WriteActionRequestImpl(org.apache.ignite.internal.metastorage.command.MultiInvokeCommandImpl),
> originCommand = null, retryReasons = [[time=1751761509328, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException; attemptWaitDuration=0,
> attemptDuration=0, attemptStartTime=2025-07-05T19:25:09,328],
> [time=1751761509529, msg=Peer iist_tncbseimiu_3344:0 threw
> PeerUnavailableException; attemptWaitDuration=196, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:09,529], [time=1751761509729, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:09,729], [time=1751761509930, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=6,
> attemptStartTime=2025-07-05T19:25:09,930], [time=1751761510131, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=194, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:10,131], [time=1751761510331, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:10,331], [time=1751761510532, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=8,
> attemptStartTime=2025-07-05T19:25:10,532], [time=1751761510732, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=192, attemptDuration=8,
> attemptStartTime=2025-07-05T19:25:10,732], [time=1751761510933, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=192, attemptDuration=9,
> attemptStartTime=2025-07-05T19:25:10,933], [time=1751761511133, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=191, attemptDuration=9,
> attemptStartTime=2025-07-05T19:25:11,133], [time=1751761511334, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=191, attemptDuration=10,
> attemptStartTime=2025-07-05T19:25:11,334], [time=1751761511534, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=200, attemptDuration=0,
> attemptStartTime=2025-07-05T19:25:11,534], [time=1751761511735, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=200, attemptDuration=1,
> attemptStartTime=2025-07-05T19:25:11,735], [time=1751761511935, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=199, attemptDuration=1,
> attemptStartTime=2025-07-05T19:25:11,935], [time=1751761512136, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=199, attemptDuration=2,
> attemptStartTime=2025-07-05T19:25:12,136], [time=1751761512336, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=198, attemptDuration=2,
> attemptStartTime=2025-07-05T19:25:12,336], [time=1751761512537, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=198, attemptDuration=3,
> attemptStartTime=2025-07-05T19:25:12,537], [time=1751761512737, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=197, attemptDuration=3,
> attemptStartTime=2025-07-05T19:25:12,737], [time=1751761512938, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=197, attemptDuration=4,
> attemptStartTime=2025-07-05T19:25:12,938], [time=1751761513139, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=196, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:13,139], [time=1751761513339, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=5,
> attemptStartTime=2025-07-05T19:25:13,339], [time=1751761513540, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=195, attemptDuration=6,
> attemptStartTime=2025-07-05T19:25:13,540], [time=1751761513741, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=194, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:13,741], [time=1751761513941, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=7,
> attemptStartTime=2025-07-05T19:25:13,941], [time=1751761514142, msg=Peer
> iist_tncbseimiu_3344:0 threw PeerUnavailableException;
> attemptWaitDuration=193, attemptDuration=8,
> attemptStartTime=2025-07-05T19:25:14,142]], stopTime = 1751761514327,
> currentTime = 1751761514343, startTime = 1751761509328, duration = 5015].
> 03:25:14 at
> org.apache.ignite.internal.raft.RetryContext.createTimeoutException(RetryContext.java:198)
> 03:25:14 ... 9 more {code}
> [TC
> link|https://ggtc.gridgain.com/buildConfiguration/GridGain9Gradle_Test_RunAllTests/14979636?showLog=14979383_45078_511.1976&logFilter=debug&logView=flowAware]
> Full logs in attachments
--
This message was sent by Atlassian Jira
(v8.20.10#820010)