We are running a cache in a 4 node cluster with atomicityMode set to ATOMIC and 
have persistence enabled. We repeatedly get a SYSTEM_WORKER_BLOCKED error on 
one node which is disabling the entire cluster. We were seeing a lot of sockets 
in TIME_WAIT state which was blocking clients from connecting so we did the 
following on all the nodes:

# ignore TIME_WAIT state on sockets
echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle

This made that issue go away, but may play a part in this new issue. First 
question is what is the root cause of the error? The second question is why 
does this bring down the entire cluster?

Here is the error message:

[2019-11-07 16:36:22,037][ERROR][tcp-disco-msg-worker-#2][root] Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=partition-exchanger, igniteInstanceName=null, finished=false, 
heartbeatTs=1573090509428]]]
class org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, 
igniteInstanceName=null, finished=false, heartbeatTs=1573090509428]
        at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
        at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
        at 
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
        at 
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
        at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

This is followed by a warning and a thread dump:

[2019-11-07 16:36:22,038][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] No 
deadlocked threads detected.
[2019-11-07 16:36:22,328][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] 
Thread dump at 2019/11/07 16:36:22 GMT

For the particular thread in the error and warning messages here is the thread 
dump:

Thread [name="tcp-disco-msg-worker-#2", id=113, state=RUNNABLE, blockCnt=211, 
waitCnt=4745368]
        at sun.management.ThreadImpl.dumpThreads0(Native Method)
        at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:454)
        at o.a.i.i.util.IgniteUtils.dumpThreads(IgniteUtils.java:1368)
        at 
o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:128)
        - locked o.a.i.i.processors.failure.FailureProcessor@7e65ceba
        at 
o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:104)
        at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1829)
        at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
        at o.a.i.i.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
        at o.a.i.i.util.worker.GridWorker.onIdle(GridWorker.java:297)
        at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)
        at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$$Lambda$47/1047515321.run(Unknown
 Source)
        at 
o.a.i.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)
        at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
        at 
o.a.i.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)
        at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

In addition all the system threads are in TIMED_WAITING state:

Thread [name="sys-#7099", id=9252, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
    Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@677ec573,
 ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

and all the query threads are in WAITING state:

Thread [name="query-#1062", id=3027, state=WAITING, blockCnt=11, waitCnt=3365]
    Lock 
[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2061baf, 
ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at 
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1500)
        at 
o.a.i.i.processors.cache.GridCacheMapEntry.onExpired(GridCacheMapEntry.java:4049)
        at 
o.a.i.i.processors.cache.GridCacheMapEntry.unswap(GridCacheMapEntry.java:522)
        at 
o.a.i.i.processors.cache.GridCacheMapEntry.unswap(GridCacheMapEntry.java:472)
        at 
o.a.i.i.processors.cache.query.GridCacheQueryManager$ScanQueryIterator.advance(GridCacheQueryManager.java:3010)
        at 
o.a.i.i.processors.cache.query.GridCacheQueryManager$ScanQueryIterator.onHasNext(GridCacheQueryManager.java:2965)
        at 
o.a.i.i.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53)
        at 
o.a.i.i.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
        at 
o.a.i.i.processors.cache.query.GridCacheQueryManager.runQuery(GridCacheQueryManager.java:1266)
        at 
o.a.i.i.processors.cache.query.GridCacheDistributedQueryManager.processQueryRequest(GridCacheDistributedQueryManager.java:234)
       at 
o.a.i.i.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:109)
        at 
o.a.i.i.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:107)
        at 
o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
        at 
o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
        at 
o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
        at 
o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
        at 
o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
        at 
o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
        at 
o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
        at 
o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
        at 
o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
        at 
o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

    Locked synchronizers:
        java.util.concurrent.ThreadPoolExecutor$Worker@c1349cf
        
java.util.concurrent.locks.ReentrantLock$NonfairSync@7bce13d5<mailto:java.util.concurrent.locks.ReentrantLock$NonfairSync@7bce13d5>

Many other threads are in WAITING state, in particular this one may be of 
interest:

Thread [name="srvc-deploy-#91", id=164, state=WAITING, blockCnt=0, waitCnt=5182]
    Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7310e7da,
 ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

As well as all the checkpoint-runner threads:

Thread [name="checkpoint-runner-#82", id=152, state=WAITING, blockCnt=2, 
waitCnt=3282307]
    Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5c8ae71a,
 ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

and the db-checkpoint-thread:

Thread [name="db-checkpoint-thread-#78", id=148, state=WAITING, blockCnt=297, 
waitCnt=1337]
    Lock 
[object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2061baf, 
ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
       at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
        at 
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:3690)
        at 
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:3300)
        at 
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:3199)
        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
        at java.lang.Thread.run(Thread.java:748)

Thanks in advance,
Conrad Mukai

Reply via email to