Kirill Tkalenko created IGNITE-25673:
----------------------------------------

             Summary: Add debug information for 
SnapshotExecutorImpl#runningJobs to help detect node stop hangs
                 Key: IGNITE-25673
                 URL: https://issues.apache.org/jira/browse/IGNITE-25673
             Project: Ignite
          Issue Type: Improvement
            Reporter: Kirill Tkalenko
            Assignee: Kirill Tkalenko


During the use of the cluster, it was found that on the node stop we can hang 
on the raft node stop. Due to some snapshot operation that could not complete, 
most likely due to some exception. To help identify the cause, it is proposed 
to add 
*org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl#runningJobs*
 to the output of operations that did not have time to complete for some reason 
on the node stop.

The following thread demonstrates that *SnapshotExecutorImpl* waits on a latch 
for a Raft snapshot to be finished with:
{noformat}
Thread [name="SIGTERM handler", id=7184, state=WAITING, blockCnt=3, waitCnt=26]
    Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6f346369,
 ownerName=null, ownerId=-1]
        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
        at 
[email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
        at 
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:506)
        at 
[email protected]/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3465)
        at 
[email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3436)
        at 
[email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1630)
        at 
app//org.apache.ignite.raft.jraft.util.CountDownEvent.await(CountDownEvent.java:59)
        at 
app//org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.join(SnapshotExecutorImpl.java:704)
        at 
app//org.apache.ignite.raft.jraft.core.NodeImpl.join(NodeImpl.java:3259)
        - locked org.apache.ignite.raft.jraft.core.NodeImpl@17da08e8
        at 
app//org.apache.ignite.raft.jraft.RaftGroupService.shutdown(RaftGroupService.java:127)
        - locked org.apache.ignite.raft.jraft.RaftGroupService@762ce012
        at 
app//org.apache.ignite.internal.raft.server.impl.JraftServerImpl.stopRaftNodes(JraftServerImpl.java:591)
        at 
app//org.apache.ignite.internal.raft.Loza.stopRaftNodes(Loza.java:505)
        at 
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.lambda$stopAsync$31(MetaStorageManagerImpl.java:771)
        at 
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl$$Lambda$3787/0x00007ffac0d4eab8.close(Unknown
 Source)
        at 
app//org.apache.ignite.internal.util.IgniteUtils.lambda$closeAllManually$1(IgniteUtils.java:611)
        at 
app//org.apache.ignite.internal.util.IgniteUtils$$Lambda$3677/0x00007ffac0d36c48.accept(Unknown
 Source)
        at 
[email protected]/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at 
[email protected]/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
        at 
[email protected]/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:992)
        at 
[email protected]/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at 
[email protected]/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at 
[email protected]/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at 
[email protected]/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at 
[email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
[email protected]/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
        at 
app//org.apache.ignite.internal.util.IgniteUtils.closeAllManually(IgniteUtils.java:609)
        at 
app//org.apache.ignite.internal.util.IgniteUtils.closeAllManually(IgniteUtils.java:643)
        at 
app//org.apache.ignite.internal.metastorage.impl.MetaStorageManagerImpl.stopAsync(MetaStorageManagerImpl.java:767)
        at 
app//org.apache.ignite.internal.util.IgniteUtils.lambda$stopAsync$6(IgniteUtils.java:1256)
        at 
app//org.apache.ignite.internal.util.IgniteUtils$$Lambda$3739/0x00007ffac0d451f8.apply(Unknown
 Source)
        at 
[email protected]/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
        at 
[email protected]/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
        at 
[email protected]/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
        at 
[email protected]/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at 
[email protected]/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at 
[email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:575)
        at 
[email protected]/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
        at 
[email protected]/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:616)
        at 
app//org.apache.ignite.internal.util.IgniteUtils.stopAsync(IgniteUtils.java:1262)
        at 
app//org.apache.ignite.internal.util.IgniteUtils.stopAsync(IgniteUtils.java:1304)
        at 
app//org.apache.ignite.internal.app.LifecycleManager.initiateAllComponentsStop(LifecycleManager.java:170)
        - locked org.apache.ignite.internal.app.LifecycleManager@26304515
        at 
app//org.apache.ignite.internal.app.LifecycleManager.stopNode(LifecycleManager.java:144)
        at 
app//org.apache.ignite.internal.app.IgniteImpl.stopAsync(IgniteImpl.java:2066)
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl.doShutdownAsync(IgniteServerImpl.java:342)
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl$$Lambda$3644/0x00007ffac0d2fd68.get(Unknown
 Source)
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl.lambda$chainRestartOrShutdownAction$6(IgniteServerImpl.java:281)
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl$$Lambda$3646/0x00007ffac0d30230.apply(Unknown
 Source)
        at 
[email protected]/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1187)
        at 
[email protected]/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2309)
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl.chainRestartOrShutdownAction(IgniteServerImpl.java:281)
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl.shutdownAsync(IgniteServerImpl.java:318)
        - locked java.lang.Object@30400aca
        at 
app//org.apache.ignite.internal.app.IgniteServerImpl.shutdown(IgniteServerImpl.java:358)
        at 
app//org.apache.ignite.internal.app.IgniteRunner.lambda$main$0(IgniteRunner.java:73)
        at 
app//org.apache.ignite.internal.app.IgniteRunner$$Lambda$1540/0x00007ffac08ff7e0.handle(Unknown
 Source)
        at [email protected]/jdk.internal.misc.Signal$1.run(Signal.java:219)
        at [email protected]/java.lang.Thread.run(Thread.java:840)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to