Hi,

According to logs you have the following:

[17:02:04,774][INFO][grid-nio-worker-tcp-comm-0-#40%MATCHERWORKER%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/xx.xx.xxx.IP2:47100, rmtAddr=/xx.xx.xxx.IP1:52166] *[18:54:12,125][SEVERE][exchange-worker-#65%MATCHERWORKER%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=grid-nio-worker-tcp-comm-0, threadName=grid-nio-worker-tcp-comm-0-#40%MATCHERWORKER%, blockedFor=10s]* [18:54:14,059][WARNING][exchange-worker-#65%MATCHERWORKER%][G] Thread [name="grid-nio-worker-tcp-comm-0-#40%MATCHERWORKER%", id=63, state=RUNNABLE, blockCnt=0, waitCnt=0]

[18:54:14,062][INFO][tcp-disco-sock-reader-[d6d591bd xx.xx.xxx.IP4:48633]-#8%MATCHERWORKER%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xxx.IP4:48633, rmtPort=48633 [18:54:14,067][INFO][tcp-disco-srvr-[:47500]-#3%MATCHERWORKER%][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/xx.xx.xxx.IP3, rmtPort=39939] [18:54:14,067][INFO][tcp-disco-srvr-[:47500]-#3%MATCHERWORKER%][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/xx.xx.xxx.IP3, rmtPort=39939] [18:54:14,068][INFO][tcp-disco-sock-reader-[]-#11%MATCHERWORKER%][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/xx.xx.xxx.IP3:39939, rmtPort=39939] [18:54:14,072][INFO][tcp-disco-sock-reader-[439d63a3 xx.xx.xxx.IP3:39939]-#11%MATCHERWORKER%][TcpDiscoverySpi] Initialized connection with remote server node [nodeId=439d63a3-4d24-4e53-9507-581ce7e2347e, rmtAddr=/xx.xx.xxx.IP3:39939] [18:54:14,075][INFO][tcp-disco-sock-reader-[439d63a3 xx.xx.xxx.IP3:39939]-#11%MATCHERWORKER%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xx.xx.xxx.IP3:39939, rmtPort=39939 [18:54:14,097][WARNING][exchange-worker-#65%MATCHERWORKER%][] Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=grid-nio-worker-tcp-comm-0, igniteInstanceName=MATCHERWORKER, finished=false, heartbeatTs=1660915454055]]] class org.apache.ignite.IgniteException: GridWorker [name=grid-nio-worker-tcp-comm-0, igniteInstanceName=MATCHERWORKER, finished=false, heartbeatTs=1660915454055]     at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$3.apply(IgnitionEx.java:1810)     at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$3.apply(IgnitionEx.java:1805)     at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:234)     at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)     at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3096)     at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3063)     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at java.base/java.lang.Thread.run(Thread.java:829)

It would be great to have a thread-dump for corresponding time, to see what kind of work did *grid-nio-worker-tcp-comm-0-#40%MATCHERWORKER%* perform. (Actually its better to have logs and thread-dumps for corresponding time from whole cluster). You mentioned that this is one of the worker nodes, what does it mean? Do you run some specific tasks there, like IgniteCompute tasks, or something like this?

Meanwhile you can also try to update your setup, to modern versions like 2.13.



On  2022/08/25 05:04:43 BEELA GAYATRI via user wrote:
> TCS Confidential
>
> Dear Team,
>
>
> We have 4 worker nodes in baseline topology .Every time one of the nodes is getting down with message " Node is out of topology (probably, due to short-time network problems)". Ignite log and configuration is attached in the mail.
> Here are few queries
>
> 1. Why every time the one of the node is getting down with that error
>
> 1. If the node is getting down , is there any way the node can be started automatically without manual intervention??
> 2.
>
> Thanks & Regards,
> Gayatri Beela
>
>
> TCS Confidential
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>

Reply via email to