We are running a cache in a 4 node cluster with atomicityMode set to ATOMIC and have persistence enabled. We repeatedly get a SYSTEM_WORKER_BLOCKED error on one node which is disabling the entire cluster. We were seeing a lot of sockets in TIME_WAIT state which was blocking clients from connecting so we did the following on all the nodes:
# ignore TIME_WAIT state on sockets echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle This made that issue go away, but may play a part in this new issue. First question is what is the root cause of the error? The second question is why does this bring down the entire cluster? Here is the error message: [2019-11-07 16:36:22,037][ERROR][tcp-disco-msg-worker-#2][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=null, finished=false, heartbeatTs=1573090509428]]] class org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=null, finished=false, heartbeatTs=1573090509428] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) This is followed by a warning and a thread dump: [2019-11-07 16:36:22,038][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] No deadlocked threads detected. [2019-11-07 16:36:22,328][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] Thread dump at 2019/11/07 16:36:22 GMT For the particular thread in the error and warning messages here is the thread dump: Thread [name="tcp-disco-msg-worker-#2", id=113, state=RUNNABLE, blockCnt=211, waitCnt=4745368] at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:454) at o.a.i.i.util.IgniteUtils.dumpThreads(IgniteUtils.java:1368) at o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:128) - locked o.a.i.i.processors.failure.FailureProcessor@7e65ceba at o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:104) at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1829) at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) at o.a.i.i.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) at o.a.i.i.util.worker.GridWorker.onIdle(GridWorker.java:297) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$$Lambda$47/1047515321.run(Unknown Source) at o.a.i.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at o.a.i.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) In addition all the system threads are in TIMED_WAITING state: Thread [name="sys-#7099", id=9252, state=TIMED_WAITING, blockCnt=0, waitCnt=1] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@677ec573, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) and all the query threads are in WAITING state: Thread [name="query-#1062", id=3027, state=WAITING, blockCnt=11, waitCnt=3365] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2061baf, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1500) at o.a.i.i.processors.cache.GridCacheMapEntry.onExpired(GridCacheMapEntry.java:4049) at o.a.i.i.processors.cache.GridCacheMapEntry.unswap(GridCacheMapEntry.java:522) at o.a.i.i.processors.cache.GridCacheMapEntry.unswap(GridCacheMapEntry.java:472) at o.a.i.i.processors.cache.query.GridCacheQueryManager$ScanQueryIterator.advance(GridCacheQueryManager.java:3010) at o.a.i.i.processors.cache.query.GridCacheQueryManager$ScanQueryIterator.onHasNext(GridCacheQueryManager.java:2965) at o.a.i.i.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53) at o.a.i.i.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45) at o.a.i.i.processors.cache.query.GridCacheQueryManager.runQuery(GridCacheQueryManager.java:1266) at o.a.i.i.processors.cache.query.GridCacheDistributedQueryManager.processQueryRequest(GridCacheDistributedQueryManager.java:234) at o.a.i.i.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:109) at o.a.i.i.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:107) at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197) at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:127) at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1093) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked synchronizers: java.util.concurrent.ThreadPoolExecutor$Worker@c1349cf java.util.concurrent.locks.ReentrantLock$NonfairSync@7bce13d5<mailto:java.util.concurrent.locks.ReentrantLock$NonfairSync@7bce13d5> Many other threads are in WAITING state, in particular this one may be of interest: Thread [name="srvc-deploy-#91", id=164, state=WAITING, blockCnt=0, waitCnt=5182] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@7310e7da, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) As well as all the checkpoint-runner threads: Thread [name="checkpoint-runner-#82", id=152, state=WAITING, blockCnt=2, waitCnt=3282307] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5c8ae71a, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) and the db-checkpoint-thread: Thread [name="db-checkpoint-thread-#78", id=148, state=WAITING, blockCnt=297, waitCnt=1337] Lock [object=java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2061baf, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:3690) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:3300) at o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:3199) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Thanks in advance, Conrad Mukai