Hello, we recently had a production incident in which our application got stuck connecting to the cluster. The *IgnitionEx start0* method was blocked for more than 24 hours waiting for that latch to be notified, but that never happened. Finally, the container was restarted in order to recover the service.
this is the stacktrace of that thread <http://apache-ignite-users.70518.x6.nabble.com/file/t1923/Screen_Shot_2020-10-27_at_3.png> This happened close to an Ignite server node restart due to SEGMENTATION. These are some lines that I extracted from the logs of that server that may be relevant (not sure tho). /2020-10-22T13:33:03.348+00:00 a5912bf99152 ignite: tcp-disco-msg-worker-#2|WARN |o.a.i.s.d.tcp.TcpDiscoverySpi|Node is out of topology (probably, due to short-time network problems). 2020-10-22T13:33:03.349+00:00 a5912bf99152 ignite: disco-event-worker-#66|WARN |o.a.i.i.m.d.GridDiscoveryManager|Local node SEGMENTED: TcpDiscoveryNode [id=2296e9a7-96d6-44d9-af3b-4e22e33261ea, addrs=[10.133.3.6, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, a5912bf99152/10.133.3.6:47500], discPort=47500, order=276, intOrder=142, lastExchangeTime=1603373583342, loc=true, ver=2.7.6#20190911-sha1:21f7ca41, isClient=false] 2020-10-22T13:33:04.232+00:00 a5912bf99152 ignite: node-stopper|ERROR|ROOT|Stopping local node on Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]] 2020-10-22T13:33:09.312+00:00 a5912bf99152 ignite: exchange-worker-#67|INFO |o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture|Coordinator changed, send partitions to new coordinator [ver=AffinityTopologyVersion [topVer=284, minorTopVer=0], crd=6293444a-0f6d-4946-b357-85a6d195a244, newCrd=ad701f62-28ee-4028-8981-8a19dd5de1f8] 2020-10-22T13:33:09.313+00:00 a5912bf99152 ignite: exchange-worker-#67|INFO |o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture|Coordinator failed, node is new coordinator [ver=AffinityTopologyVersion [topVer=284, minorTopVer=0], prev=ad701f62-28ee-4028-8981-8a19dd5de1f8 ]/ During those 24 hours there are hundreds of messages about the SYSTEM_WORKER_BLOCKED but that event is ignored by the failure handler: /2020-10-22 06:33:12.732 PDT [grid-timeout-worker-#119] ERROR root -Critical system error detected. Will be handled accordingly to configured handler [hnd=ExpressIgnitionFailureHandler [], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=tcp-client-disco-msg-worker, igniteInstanceName=null, finished=false, heartbeatTs=1603373580648]]]/ Based on the logs, it seems that there was a network glitch during that interval at the same time the client was trying to join the cluster. Do you think these events can be related to the blocked start0 method? is it possible that the glitch/coordinator-change is generating the join request/response to get lost making that latch to block forever? Any suggestions to handle this case? (any 2.8.1 or 2.9 change that may apply?) Thanks for your time. -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/