Re: Blocked system-critical due to a listener?

Joan Pujol Sun, 03 Apr 2022 00:21:59 -0700

Hi Ilya,

Yes, that was exactly the case.
What I've done is launch an async worker from the listener to do that
operation.


Cheers,

On Sat, 2 Apr 2022 at 22:30, Ilya Shishkov <shishkovi...@gmail.com> wrote:

> I didn't read your post carefully, sorry.
>
> >  The problem seems to be a listener that we register with
>  ignite.events(forLocal).remoteListen() that deletes/puts some cache
> entries during NODE_LEFT events.
>
> Are you listening to NODE_LEFT events and then trigger some cache actions
> (put / get)?
>
>
>
> сб, 2 апр. 2022 г. в 22:34, Ilya Shishkov <shishkovi...@gmail.com>:
>
>> Hi,
>>
>> Could you, please, attach logs from failed nodes?
>>
>> ср, 23 мар. 2022 г. в 14:07, Joan Pujol <joanpu...@gmail.com>:
>>
>>> Hi,
>>>
>>> We had a cluster problem while two of the four server nodes left the
>>> cluster and started again in a cluster running Ignite 2.10
>>> The problem seems to be that Partition Exchange didn't finish
>>>
>>> In particular, the problem seems to be in one of the remaining nodes
>>> that has a Blocked system-critical error on disco-event-worker that is
>>> locked for more than 19s.
>>> I checked GC logs and there isn't any relevant pause when the block
>>> occurs
>>>
>>> Some relevant logs:
>>>
>>> remaining node:
>>> >2022-03-22 13:33:42.439 WARN  [services-deployment-worker-#95]
>>> o.a.i.i.p.s.ServiceDeploymentManager - Failed to wait service
>>> deployment process or timeout had been reached, timeout=10000,
>>> taskDepId=ServiceDeploymentProcessId [topVer=AffinityTopologyVersion
>>> [topVer=9, minorTopVer=0], reqId=null]
>>> >2022-03-22 13:33:42.687 WARN  [exchange-worker-#54]
>>> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await
>>> partitions release latch within timeout. Some nodes have not sent
>>> acknowledgement for latch completion. It's possible due to unfinishined
>>> atomic updates, transactions or not released explicit locks on that nodes.
>>> Please check logs for errors on nodes with ids reported in latch
>>> `pendingAcks` collection [latch=ServerLatch [permits=1,
>>> pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb],
>>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>>> topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]]
>>> >2022-03-22 13:33:51.681 ERROR [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0
>>> :0:0:1%lo:47500]-#2-#45] o.a.i.i.u.typedef.G - Blocked system-critical
>>> thread has been detected. This can lead to cluster-wide undefined behaviour
>>> [workerName=disco-event-worker, threadName=disco-event-worker-#50,
>>> blockedFor=19s]
>>> >2022-03-22 13:33:51.701 WARN  [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0
>>> :0:0:1%lo:47500]-#2-#45] o.a.ignite.Ignite - Possible failure
>>> suppressed accordingly to a configured handler [hnd=AbstractFailureHandler
>>> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>>> SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext
>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException:
>>> GridWorker [name=disco-event-worker, igniteInstanceName=null, finished=
>>> false, heartbeatTs=1647956012230]]]
>>> org.apache.ignite.IgniteException: GridWorker [name=disco-event-worker,
>>> igniteInstanceName=null, finished=false, heartbeatTs=1647956012230]
>>>   at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
>>>   at java.base@11.0.9
>>> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>>>   at
>>> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>>>   at
>>> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
>>>   at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746)
>>> >2022-03-22 13:33:51.701 WARN  [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0
>>> :0:0:1%lo:47500]-#2-#45] o.a.i.i.p.f.FailureProcessor - No deadlocked
>>> threads detected.
>>> ...
>>> A thread dump where the blocked thread thread dump is:
>>> Thread [name="disco-event-worker-#50", id=306, state=WAITING, blockCnt=1,
>>> waitCnt=543567]
>>>         at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
>>>         at java.base@11.0.9
>>> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>>>         at
>>> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>>>         at
>>> o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
>>>         at
>>> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746)
>>>         at
>>> o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3397)
>>>         at
>>> o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3380)
>>>         at
>>> o.a.i.i.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1437)
>>>         at
>>> o.a.i.i.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:952)
>>>         at
>>> es.decide.dwm.service.IgniteManager.removeProcessingRequest(IgniteManager.java:186)
>>>         at
>>> es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$1(IgniteManager.java:178)
>>>         at
>>> es.decide.dwm.service.IgniteManager$$Lambda$8656/0x000000080019d040.accept(Unknown
>>> Source)
>>>         at java.base@11.0.9
>>> /java.util.ArrayList.forEach(ArrayList.java:1541)
>>>         at
>>> es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$3c60aaa7$1(IgniteManager.java:176)
>>>         at
>>> es.decide.dwm.service.IgniteManager$$Lambda$3251/0x0000000802113c40.apply(Unknown
>>> Source)
>>>         at
>>> o.a.i.i.GridEventConsumeHandler$2.onEvent(GridEventConsumeHandler.java:192)
>>>         at
>>> o.a.i.i.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1399)
>>>         at
>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894)
>>>         at
>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879)
>>>         at
>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350)
>>>         at
>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313)
>>>         at
>>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2894)
>>>         at
>>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3110)
>>>         at
>>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2914)
>>>         at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
>>>         at java.base@11.0.9/java.lang.Thread.run(Thread.java:834)
>>>
>>> ..
>>> >2022-03-22 13:33:52.687 WARN  [exchange-worker-#54]
>>> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await
>>> partitions release latch within timeout. Some nodes have not sent
>>> acknowledgement for latch completion. It's possible due to unfinishined
>>> atomic updates, transactions or not released explicit locks on that nodes.
>>> Please check logs for errors on nodes with ids reported in latch
>>> `pendingAcks` collection [latch=ServerLatch [permits=1,
>>> pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb],
>>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>>> topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]]
>>>
>>>
>>> The problem seems to be a listener that we register with
>>> ignite.events(forLocal).remoteListen() that deletes/puts some cache
>>> entries during NODE_LEFT events. What I don't understand it's why it's
>>> running in a critical ignite thread.
>>> Do you think that this listener is the possible cause of the cluster
>>> problem? Besides removing the listener is there any way to run it in a
>>> safer way?
>>> And if the listener isn't the problem any help or clue on where the
>>> problem can be?
>>>
>>> A lot of thanks in advance,
>>>
>>> --
>>> Joan Jesús Pujol Espinar
>>> http://www.joanpujol.cat
>>>
>>

-- 
Joan Jesús Pujol Espinar
http://www.joanpujol.cat
https://play.google.com/store/apps/developer?id=lujop&hl=ca

Re: Blocked system-critical due to a listener?

Reply via email to