Hi Ilya, Yes, that was exactly the case. What I've done is launch an async worker from the listener to do that operation.
Cheers, On Sat, 2 Apr 2022 at 22:30, Ilya Shishkov <[email protected]> wrote: > I didn't read your post carefully, sorry. > > > The problem seems to be a listener that we register with > ignite.events(forLocal).remoteListen() that deletes/puts some cache > entries during NODE_LEFT events. > > Are you listening to NODE_LEFT events and then trigger some cache actions > (put / get)? > > > > сб, 2 апр. 2022 г. в 22:34, Ilya Shishkov <[email protected]>: > >> Hi, >> >> Could you, please, attach logs from failed nodes? >> >> ср, 23 мар. 2022 г. в 14:07, Joan Pujol <[email protected]>: >> >>> Hi, >>> >>> We had a cluster problem while two of the four server nodes left the >>> cluster and started again in a cluster running Ignite 2.10 >>> The problem seems to be that Partition Exchange didn't finish >>> >>> In particular, the problem seems to be in one of the remaining nodes >>> that has a Blocked system-critical error on disco-event-worker that is >>> locked for more than 19s. >>> I checked GC logs and there isn't any relevant pause when the block >>> occurs >>> >>> Some relevant logs: >>> >>> remaining node: >>> >2022-03-22 13:33:42.439 WARN [services-deployment-worker-#95] >>> o.a.i.i.p.s.ServiceDeploymentManager - Failed to wait service >>> deployment process or timeout had been reached, timeout=10000, >>> taskDepId=ServiceDeploymentProcessId [topVer=AffinityTopologyVersion >>> [topVer=9, minorTopVer=0], reqId=null] >>> >2022-03-22 13:33:42.687 WARN [exchange-worker-#54] >>> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await >>> partitions release latch within timeout. Some nodes have not sent >>> acknowledgement for latch completion. It's possible due to unfinishined >>> atomic updates, transactions or not released explicit locks on that nodes. >>> Please check logs for errors on nodes with ids reported in latch >>> `pendingAcks` collection [latch=ServerLatch [permits=1, >>> pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb], >>> super=CompletableLatch [id=CompletableLatchUid [id=exchange, >>> topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]] >>> >2022-03-22 13:33:51.681 ERROR [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0 >>> :0:0:1%lo:47500]-#2-#45] o.a.i.i.u.typedef.G - Blocked system-critical >>> thread has been detected. This can lead to cluster-wide undefined behaviour >>> [workerName=disco-event-worker, threadName=disco-event-worker-#50, >>> blockedFor=19s] >>> >2022-03-22 13:33:51.701 WARN [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0 >>> :0:0:1%lo:47500]-#2-#45] o.a.ignite.Ignite - Possible failure >>> suppressed accordingly to a configured handler [hnd=AbstractFailureHandler >>> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, >>> SYSTEM_CRITICAL_OPERATION_TIMEOUT]], failureCtx=FailureContext >>> [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: >>> GridWorker [name=disco-event-worker, igniteInstanceName=null, finished= >>> false, heartbeatTs=1647956012230]]] >>> org.apache.ignite.IgniteException: GridWorker [name=disco-event-worker, >>> igniteInstanceName=null, finished=false, heartbeatTs=1647956012230] >>> at [email protected]/jdk.internal.misc.Unsafe.park(Native Method) >>> at [email protected] >>> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) >>> at >>> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) >>> at >>> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746) >>> >2022-03-22 13:33:51.701 WARN [tcp-disco-msg-worker-[5bc334df 0:0:0:0:0 >>> :0:0:1%lo:47500]-#2-#45] o.a.i.i.p.f.FailureProcessor - No deadlocked >>> threads detected. >>> ... >>> A thread dump where the blocked thread thread dump is: >>> Thread [name="disco-event-worker-#50", id=306, state=WAITING, blockCnt=1, >>> waitCnt=543567] >>> at [email protected]/jdk.internal.misc.Unsafe.park(Native Method) >>> at [email protected] >>> /java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) >>> at >>> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178) >>> at >>> o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) >>> at >>> o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.remove0(GridDhtAtomicCache.java:746) >>> at >>> o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3397) >>> at >>> o.a.i.i.processors.cache.GridCacheAdapter.remove(GridCacheAdapter.java:3380) >>> at >>> o.a.i.i.processors.cache.IgniteCacheProxyImpl.remove(IgniteCacheProxyImpl.java:1437) >>> at >>> o.a.i.i.processors.cache.GatewayProtectedCacheProxy.remove(GatewayProtectedCacheProxy.java:952) >>> at >>> es.decide.dwm.service.IgniteManager.removeProcessingRequest(IgniteManager.java:186) >>> at >>> es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$1(IgniteManager.java:178) >>> at >>> es.decide.dwm.service.IgniteManager$$Lambda$8656/0x000000080019d040.accept(Unknown >>> Source) >>> at [email protected] >>> /java.util.ArrayList.forEach(ArrayList.java:1541) >>> at >>> es.decide.dwm.service.IgniteManager.lambda$registerIgniteEvents$3c60aaa7$1(IgniteManager.java:176) >>> at >>> es.decide.dwm.service.IgniteManager$$Lambda$3251/0x0000000802113c40.apply(Unknown >>> Source) >>> at >>> o.a.i.i.GridEventConsumeHandler$2.onEvent(GridEventConsumeHandler.java:192) >>> at >>> o.a.i.i.managers.eventstorage.GridEventStorageManager$LocalListenerWrapper.onEvent(GridEventStorageManager.java:1399) >>> at >>> o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:894) >>> at >>> o.a.i.i.managers.eventstorage.GridEventStorageManager.notifyListeners(GridEventStorageManager.java:879) >>> at >>> o.a.i.i.managers.eventstorage.GridEventStorageManager.record0(GridEventStorageManager.java:350) >>> at >>> o.a.i.i.managers.eventstorage.GridEventStorageManager.record(GridEventStorageManager.java:313) >>> at >>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.recordEvent(GridDiscoveryManager.java:2894) >>> at >>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body0(GridDiscoveryManager.java:3110) >>> at >>> o.a.i.i.managers.discovery.GridDiscoveryManager$DiscoveryWorker.body(GridDiscoveryManager.java:2914) >>> at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) >>> at [email protected]/java.lang.Thread.run(Thread.java:834) >>> >>> .. >>> >2022-03-22 13:33:52.687 WARN [exchange-worker-#54] >>> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await >>> partitions release latch within timeout. Some nodes have not sent >>> acknowledgement for latch completion. It's possible due to unfinishined >>> atomic updates, transactions or not released explicit locks on that nodes. >>> Please check logs for errors on nodes with ids reported in latch >>> `pendingAcks` collection [latch=ServerLatch [permits=1, >>> pendingAcks=HashSet [6d63e1bd-9de4-4341-81bb-c3a293ebe2eb], >>> super=CompletableLatch [id=CompletableLatchUid [id=exchange, >>> topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]]]] >>> >>> >>> The problem seems to be a listener that we register with >>> ignite.events(forLocal).remoteListen() that deletes/puts some cache >>> entries during NODE_LEFT events. What I don't understand it's why it's >>> running in a critical ignite thread. >>> Do you think that this listener is the possible cause of the cluster >>> problem? Besides removing the listener is there any way to run it in a >>> safer way? >>> And if the listener isn't the problem any help or clue on where the >>> problem can be? >>> >>> A lot of thanks in advance, >>> >>> -- >>> Joan Jesús Pujol Espinar >>> http://www.joanpujol.cat >>> >> -- Joan Jesús Pujol Espinar http://www.joanpujol.cat https://play.google.com/store/apps/developer?id=lujop&hl=ca
