Re: Blocked system-critical due to a listener?

Вячеслав Коптилин Wed, 06 Apr 2022 04:31:56 -0700

Hello Joan,

> What I've done is launch an async worker from the listener to do that
operation.
Yep, that is true. Event listeners can be called from Ignite internal
threads (which can be critical ones).
I think it was done this way for performance reasons. At least, this allows
you to avoid transferring an event from critical thread to another
thread/executor. The user should take care about this "feature",
unfortunately.
So, it is recommended to avoid any cache operations from the event listener
(you need to use your own separate thread for this).


Thanks,
Slava.

вс, 3 апр. 2022 г. в 10:21, Joan Pujol <j...@joanpujol.cat>:

> Hi Ilya,
>
> Yes, that was exactly the case.
> What I've done is launch an async worker from the listener to do that
> operation.
>
> Cheers,
>
> On Sat, 2 Apr 2022 at 22:30, Ilya Shishkov <shishkovi...@gmail.com> wrote:
>
>> I didn't read your post carefully, sorry.
>>
>> >  The problem seems to be a listener that we register with
>>  ignite.events(forLocal).remoteListen() that deletes/puts some cache
>> entries during NODE_LEFT events.
>>
>> Are you listening to NODE_LEFT events and then trigger some cache actions
>> (put / get)?
>>
>>
>>
>> сб, 2 апр. 2022 г. в 22:34, Ilya Shishkov <shishkovi...@gmail.com>:
>>
>>> Hi,
>>>
>>> Could you, please, attach logs from failed nodes?
>>>
>>> ср, 23 мар. 2022 г. в 14:07, Joan Pujol <joanpu...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> We had a cluster problem while two of the four server nodes left the
>>>> cluster and started again in a cluster running Ignite 2.10
>>>> The problem seems to be that Partition Exchange didn't finish
>>>>
>>>> In particular, the problem seems to be in one of the remaining nodes
>>>> that has a Blocked system-critical error on disco-event-worker that is
>>>> locked for more than 19s.
>>>> I checked GC logs and there isn't any relevant pause when the block
>>>> occurs
>>>>
>>>> Some relevant logs:
>>>>
>>>> remaining node:
>>>> >2022-03-22 13:33:42.439 WARN  [services-deployment-worker-#95]
>>>> o.a.i.i.p.s.ServiceDeploymentManager - Failed to wait service
>>>> deployment process or timeout had been reached, timeout=10000,
>>>> taskDepId=ServiceDeploymentProcessId [topVer=AffinityTopologyVersion
>>>> [topVer=9, minorTopVer=0], reqId=null]
>>>> >2022-03-22 13:33:42.687 WARN  [exchange-worker-#54]
>>>> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await
>>>> partitions release latch within timeout. Some nodes have not sent
>>>> acknowledgement for latch completion. It's possible due to unfinishined
>>>> atomic updates, transactions or not released explicit locks on that nodes.
>>>> Please check logs for errors on nodes with ids reported in latch
>>>> `pendingAcks` collection [latch=ServerLatch [permits=1,
>>>> pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb],
>>>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>>>> topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]]
>>>> >2022-03-22 13:33:51.681 ERROR [tcp-disco-msg-worker-[5bc334df 0:0:0:0:
>>>> 0:0:0:1%lo:47500]-#2-#45] o.a.i.i.u.typedef.G - Blocked
>>>> system-critical thread has been detected. This can lead to cluster-wide
>>>> undefined behaviour [workerName=disco-event-worker,
>>>> threadName=disco-event-worker-#50, blockedFor=19s]
>>>> >2022-03-22 13:33:51.701 WARN  [tcp-disco-msg-worker-[5bc334df 0:0:0:0:
>>>> 0:0:0:1%lo:47500]-#2-#45] o.a.ignite.Ignite - Possible failure
>>>> suppressed accordingly to a configured handler [hnd=AbstractFailureHandler
>>>> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>>>> SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext
>>>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException:
>>>> GridWorker [name=disco-event-worker, igniteInstanceName=null, finished=
>>>> false, heartbeatTs=1647956012230]]]
>>>> org.apache.ignite.IgniteException: GridWorker
>>>> [name=disco-event-worker, igniteInstanceName=null, finished=false,
>>>> heartbeatTs=1647956012230]
>>>>   at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native Method)
>>>>   at java.base@11.0.9
>>>> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>>>>   at
>>>> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>>>>   at
>>>> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
>>>>   at
>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746)
>>>> >2022-03-22 13:33:51.701 WARN  [tcp-disco-msg-worker-[5bc334df 0:0:0:0:
>>>> 0:0:0:1%lo:47500]-#2-#45] o.a.i.i.p.f.FailureProcessor - No deadlocked
>>>> threads detected.
>>>> ...
>>>> A thread dump where the blocked thread thread dump is:
>>>> Thread [name="disco-event-worker-#50", id=306, state=WAITING, blockCnt=
>>>> 1, waitCnt=543567]
>>>>         at java.base@11.0.9/jdk.internal.misc.Unsafe.park(Native
>>>> Method)
>>>>         at java.base@11.0.9
>>>> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>>>>         at
>>>> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>>>>         at
>>>> o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
>>>>         at
>>>> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746)
>>>>         at
>>>> o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3397)
>>>>         at
>>>> o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3380)
>>>>         at
>>>> o.a.i.i.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1437)
>>>>         at
>>>> o.a.i.i.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:952)
>>>>         at
>>>> es.decide.dwm.service.IgniteManager.removeProcessingRequest(IgniteManager.java:186)
>>>>         at
>>>> es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$1(IgniteManager.java:178)
>>>>         at
>>>> es.decide.dwm.service.IgniteManager$$Lambda$8656/0x000000080019d040.accept(Unknown
>>>> Source)
>>>>         at java.base@11.0.9
>>>> /java.util.ArrayList.forEach(ArrayList.java:1541)
>>>>         at
>>>> es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$3c60aaa7$1(IgniteManager.java:176)
>>>>         at
>>>> es.decide.dwm.service.IgniteManager$$Lambda$3251/0x0000000802113c40.apply(Unknown
>>>> Source)
>>>>         at
>>>> o.a.i.i.GridEventConsumeHandler$2.onEvent(GridEventConsumeHandler.java:192)
>>>>         at
>>>> o.a.i.i.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1399)
>>>>         at
>>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894)
>>>>         at
>>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879)
>>>>         at
>>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350)
>>>>         at
>>>> o.a.i.i.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313)
>>>>         at
>>>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2894)
>>>>         at
>>>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3110)
>>>>         at
>>>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2914)
>>>>         at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120)
>>>>         at java.base@11.0.9/java.lang.Thread.run(Thread.java:834)
>>>>
>>>> ..
>>>> >2022-03-22 13:33:52.687 WARN  [exchange-worker-#54]
>>>> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await
>>>> partitions release latch within timeout. Some nodes have not sent
>>>> acknowledgement for latch completion. It's possible due to unfinishined
>>>> atomic updates, transactions or not released explicit locks on that nodes.
>>>> Please check logs for errors on nodes with ids reported in latch
>>>> `pendingAcks` collection [latch=ServerLatch [permits=1,
>>>> pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb],
>>>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>>>> topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]]
>>>>
>>>>
>>>> The problem seems to be a listener that we register with
>>>> ignite.events(forLocal).remoteListen() that deletes/puts some cache
>>>> entries during NODE_LEFT events. What I don't understand it's why it's
>>>> running in a critical ignite thread.
>>>> Do you think that this listener is the possible cause of the cluster
>>>> problem? Besides removing the listener is there any way to run it in a
>>>> safer way?
>>>> And if the listener isn't the problem any help or clue on where the
>>>> problem can be?
>>>>
>>>> A lot of thanks in advance,
>>>>
>>>> --
>>>> Joan Jesús Pujol Espinar
>>>> http://www.joanpujol.cat
>>>>
>>>
>
> --
> Joan Jesús Pujol Espinar
> http://www.joanpujol.cat
> https://play.google.com/store/apps/developer?id=lujop&hl=ca
>

Re: Blocked system-critical due to a listener?

Reply via email to