Re: Blocked Worker Thread
Hi Conrad, The reasons can be different. Could you please share the logs? BR, Andrei 11/7/2019 10:35 PM, Conrad Mukai (cmukai) пишет: We are running a cache in a 4 node cluster with atomicityMode set to ATOMIC and have persistence enabled. We repeatedly get a SYSTEM_WORKER_BLOCKED error on one node which is disabling the entire cluster. We were seeing a lot of sockets in TIME_WAIT state which was blocking clients from connecting so we did the following on all the nodes: /# ignore TIME_WAIT state on sockets /echo *"1" **> */proc/sys/net/ipv4/tcp_tw_reuse echo *"1" **> */proc/sys/net/ipv4/tcp_tw_recycle This made that issue go away, but may play a part in this new issue. First question is what is the root cause of the error? The second question is why does this bring down the entire cluster? Here is the error message: [2019-11-07 16:36:22,037][ERROR][tcp-disco-msg-worker-#2][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=null, finished=false, heartbeatTs=1573090509428]]] class org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=null, finished=false, heartbeatTs=1573090509428] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) This is followed by a warning and a thread dump: [2019-11-07 16:36:22,038][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] No deadlocked threads detected. [2019-11-07 16:36:22,328][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] Thread dump at 2019/11/07 16:36:22 GMT For the particular thread in the error and warning messages here is the thread dump: Thread [name="tcp-disco-msg-worker-#2", id=113, state=RUNNABLE, blockCnt=211, waitCnt=4745368] at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:454) at o.a.i.i.util.IgniteUtils.dumpThreads(IgniteUtils.java:1368) at o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:128) - locked o.a.i.i.processors.failure.FailureProcessor@7e65ceba at o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:104) at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1829) at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) at o.a.i.i.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) at o.a.i.i.util.worker.GridWorker.onIdle(GridWorker.java:297) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$$Lambda$47/1047515321.run(Unknown Source) at o.a.i.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at o.a.i.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) In addition all the system threads are in TIMED_WAITING state: Thread [name="sys-#7099", id=9252, state=TIMED_WAITING, blockCnt=0, waitCnt=1] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@677ec573, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at
Blocked Worker Thread
We are running a cache in a 4 node cluster with atomicityMode set to ATOMIC and have persistence enabled. We repeatedly get a SYSTEM_WORKER_BLOCKED error on one node which is disabling the entire cluster. We were seeing a lot of sockets in TIME_WAIT state which was blocking clients from connecting so we did the following on all the nodes: # ignore TIME_WAIT state on sockets echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle This made that issue go away, but may play a part in this new issue. First question is what is the root cause of the error? The second question is why does this bring down the entire cluster? Here is the error message: [2019-11-07 16:36:22,037][ERROR][tcp-disco-msg-worker-#2][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=null, finished=false, heartbeatTs=1573090509428]]] class org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=null, finished=false, heartbeatTs=1573090509428] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181) at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) This is followed by a warning and a thread dump: [2019-11-07 16:36:22,038][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] No deadlocked threads detected. [2019-11-07 16:36:22,328][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] Thread dump at 2019/11/07 16:36:22 GMT For the particular thread in the error and warning messages here is the thread dump: Thread [name="tcp-disco-msg-worker-#2", id=113, state=RUNNABLE, blockCnt=211, waitCnt=4745368] at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:454) at o.a.i.i.util.IgniteUtils.dumpThreads(IgniteUtils.java:1368) at o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:128) - locked o.a.i.i.processors.failure.FailureProcessor@7e65ceba at o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:104) at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1829) at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) at o.a.i.i.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) at o.a.i.i.util.worker.GridWorker.onIdle(GridWorker.java:297) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$$Lambda$47/1047515321.run(Unknown Source) at o.a.i.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181) at o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at o.a.i.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119) at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) In addition all the system threads are in TIMED_WAITING state: Thread [name="sys-#7099", id=9252, state=TIMED_WAITING, blockCnt=0, waitCnt=1] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@677ec573, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at