Re: Blocked Worker Thread

2019-11-08 Thread Andrei Aleksandrov

Hi Conrad,

The reasons can be different. Could you please share the logs?

BR,
Andrei

11/7/2019 10:35 PM, Conrad Mukai (cmukai) пишет:


We are running a cache in a 4 node cluster with atomicityMode set to 
ATOMIC and have persistence enabled. We repeatedly get a 
SYSTEM_WORKER_BLOCKED error on one node which is disabling the entire 
cluster. We were seeing a lot of sockets in TIME_WAIT state which was 
blocking clients from connecting so we did the following on all the nodes:


/# ignore TIME_WAIT state on sockets
/echo *"1" **> */proc/sys/net/ipv4/tcp_tw_reuse
echo *"1" **> */proc/sys/net/ipv4/tcp_tw_recycle

This made that issue go away, but may play a part in this new issue. 
First question is what is the root cause of the error? The second 
question is why does this bring down the entire cluster?


Here is the error message:

[2019-11-07 16:36:22,037][ERROR][tcp-disco-msg-worker-#2][root] 
Critical system error detected. Will be handled accordingly to 
configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, 
timeout=0, super=AbstractFailureHandler 
[ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: 
GridWorker [name=partition-exchanger, igniteInstanceName=null, 
finished=false, heartbeatTs=1573090509428]]]


class org.apache.ignite.IgniteException: GridWorker 
[name=partition-exchanger, igniteInstanceName=null, finished=false, 
heartbeatTs=1573090509428]


    at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)


    at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)


    at 
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)


    at 
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)


    at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)


    at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)


    at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)


    at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)


    at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)


    at 
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)


This is followed by a warning and a thread dump:

[2019-11-07 16:36:22,038][WARN 
][tcp-disco-msg-worker-#2][FailureProcessor] No deadlocked threads 
detected.


[2019-11-07 16:36:22,328][WARN 
][tcp-disco-msg-worker-#2][FailureProcessor] Thread dump at 2019/11/07 
16:36:22 GMT


For the particular thread in the error and warning messages here is 
the thread dump:


Thread [name="tcp-disco-msg-worker-#2", id=113, state=RUNNABLE, 
blockCnt=211, waitCnt=4745368]


    at sun.management.ThreadImpl.dumpThreads0(Native Method)

    at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:454)

    at o.a.i.i.util.IgniteUtils.dumpThreads(IgniteUtils.java:1368)

    at 
o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:128)


    - locked o.a.i.i.processors.failure.FailureProcessor@7e65ceba

    at 
o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:104)


    at 
o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1829)


    at 
o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)


    at o.a.i.i.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)

    at o.a.i.i.util.worker.GridWorker.onIdle(GridWorker.java:297)

    at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)


    at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$$Lambda$47/1047515321.run(Unknown 
Source)


    at 
o.a.i.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)


    at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)


    at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)

    at 
o.a.i.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)


    at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

In addition all the system threads are in TIMED_WAITING state:

Thread [name="sys-#7099", id=9252, state=TIMED_WAITING, blockCnt=0, 
waitCnt=1]


    Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@677ec573, 
ownerName=null, ownerId=-1]


    at sun.misc.Unsafe.park(Native Method)

    at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)


    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)


    at 

Blocked Worker Thread

2019-11-07 Thread Conrad Mukai (cmukai)
We are running a cache in a 4 node cluster with atomicityMode set to ATOMIC and 
have persistence enabled. We repeatedly get a SYSTEM_WORKER_BLOCKED error on 
one node which is disabling the entire cluster. We were seeing a lot of sockets 
in TIME_WAIT state which was blocking clients from connecting so we did the 
following on all the nodes:

# ignore TIME_WAIT state on sockets
echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle

This made that issue go away, but may play a part in this new issue. First 
question is what is the root cause of the error? The second question is why 
does this bring down the entire cluster?

Here is the error message:

[2019-11-07 16:36:22,037][ERROR][tcp-disco-msg-worker-#2][root] Critical system 
error detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=partition-exchanger, igniteInstanceName=null, finished=false, 
heartbeatTs=1573090509428]]]
class org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, 
igniteInstanceName=null, finished=false, heartbeatTs=1573090509428]
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
at 
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
at 
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)
at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

This is followed by a warning and a thread dump:

[2019-11-07 16:36:22,038][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] No 
deadlocked threads detected.
[2019-11-07 16:36:22,328][WARN ][tcp-disco-msg-worker-#2][FailureProcessor] 
Thread dump at 2019/11/07 16:36:22 GMT

For the particular thread in the error and warning messages here is the thread 
dump:

Thread [name="tcp-disco-msg-worker-#2", id=113, state=RUNNABLE, blockCnt=211, 
waitCnt=4745368]
at sun.management.ThreadImpl.dumpThreads0(Native Method)
at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:454)
at o.a.i.i.util.IgniteUtils.dumpThreads(IgniteUtils.java:1368)
at 
o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:128)
- locked o.a.i.i.processors.failure.FailureProcessor@7e65ceba
at 
o.a.i.i.processors.failure.FailureProcessor.process(FailureProcessor.java:104)
at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1829)
at o.a.i.i.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
at o.a.i.i.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
at o.a.i.i.util.worker.GridWorker.onIdle(GridWorker.java:297)
at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)
at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker$$Lambda$47/1047515321.run(Unknown
 Source)
at 
o.a.i.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)
at 
o.a.i.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)
at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
at 
o.a.i.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)
at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

In addition all the system threads are in TIMED_WAITING state:

Thread [name="sys-#7099", id=9252, state=TIMED_WAITING, blockCnt=0, waitCnt=1]
Lock 
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@677ec573,
 ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at