client hangs forever trying to join the cluster (ClientImp joinLatch.await())

xero Tue, 27 Oct 2020 12:20:24 -0700

Hello,
we recently had a production incident in which our application got stuck
connecting to the cluster. The *IgnitionEx start0* method was blocked for
more than 24 hours waiting for that latch to be notified, but that never
happened. Finally, the container was restarted in order to recover the
service.


this is the stacktrace of that thread
<http://apache-ignite-users.70518.x6.nabble.com/file/t1923/Screen_Shot_2020-10-27_at_3.png>
 


This happened close to an Ignite server node restart due to SEGMENTATION.
These are some lines that I extracted from the logs of that server that may
be relevant (not sure tho).

/2020-10-22T13:33:03.348+00:00 a5912bf99152 ignite:
tcp-disco-msg-worker-#2|WARN |o.a.i.s.d.tcp.TcpDiscoverySpi|Node is out of
topology (probably, due to short-time network problems).

2020-10-22T13:33:03.349+00:00 a5912bf99152 ignite:
disco-event-worker-#66|WARN |o.a.i.i.m.d.GridDiscoveryManager|Local node
SEGMENTED: TcpDiscoveryNode [id=2296e9a7-96d6-44d9-af3b-4e22e33261ea,
addrs=[10.133.3.6, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
a5912bf99152/10.133.3.6:47500], discPort=47500, order=276, intOrder=142,
lastExchangeTime=1603373583342, loc=true, ver=2.7.6#20190911-sha1:21f7ca41,
isClient=false]

2020-10-22T13:33:04.232+00:00 a5912bf99152 ignite:
node-stopper|ERROR|ROOT|Stopping local node on Ignite failure:
[failureCtx=FailureContext [type=SEGMENTATION, err=null]]

2020-10-22T13:33:09.312+00:00 a5912bf99152 ignite: exchange-worker-#67|INFO
|o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture|Coordinator changed, send
partitions to new coordinator [ver=AffinityTopologyVersion [topVer=284,
minorTopVer=0], crd=6293444a-0f6d-4946-b357-85a6d195a244,
newCrd=ad701f62-28ee-4028-8981-8a19dd5de1f8]

2020-10-22T13:33:09.313+00:00 a5912bf99152 ignite: exchange-worker-#67|INFO
|o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture|Coordinator failed, node
is new coordinator [ver=AffinityTopologyVersion [topVer=284, minorTopVer=0],
prev=ad701f62-28ee-4028-8981-8a19dd5de1f8
]/


During those 24 hours there are hundreds of messages about the
SYSTEM_WORKER_BLOCKED but that event is ignored by the failure handler:

/2020-10-22 06:33:12.732 PDT [grid-timeout-worker-#119]  ERROR root             
              
-Critical system error detected. Will be handled accordingly to configured
handler [hnd=ExpressIgnitionFailureHandler [], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=tcp-client-disco-msg-worker, igniteInstanceName=null, finished=false,
heartbeatTs=1603373580648]]]/


Based on the logs, it seems that there was a network glitch during that
interval at the same time the client was trying to join the cluster.
Do you think these events can be related to the blocked start0 method? is it
possible that the glitch/coordinator-change is generating the join
request/response to get lost making that latch to block forever?

Any suggestions to handle this case? (any 2.8.1 or 2.9 change that may
apply?)
Thanks for your time. 












--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

client hangs forever trying to join the cluster (ClientImp joinLatch.await())

Reply via email to