[ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
-----------------------------------
    Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff0000}724{color} > 910 > 655 > ...* 
****node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=155, offline=5]

*topology rolled to version 188.*

*Coordinator rejects node 724 form joining the cluster:*

2018-09-24 15:17:27.491 [INFO 
][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
 Accepted incoming communication connection [locAddr=/10.116.206.11:47100, 
rmtAddr=/10.116.206.98:41611]
 2018-09-24 15:17:27.491 [WARN 
][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
 Close incoming connection, unknown node 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl 
[worker=GridWorker [name=grid-nio-worker-tcp-comm-9, 
igniteInstanceName=DPL_GRID%DplGridNodeName, finished=false, 
hashCode=632168435, interrupted=false, 
runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
 [idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, 
bytesSent0=18, select=true, super=]DirectNioClientWorker [super=], 
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, 
outRecovery=null, super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, 
rmtAddr=/10.116.206.98:41611, createTime=1537791447487, closeTime=0, 
bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, 
sndSchedTime=1537791447487, lastSndTime=1537791447487, 
lastRcvTime=1537791447487, readsPaused=false, 
filterChain=FilterChain[filters=[, GridConnectionBytesVerifyFilter], 
accepted=true, markedForClose=false]]] 

*Node 724 SEGMENTED before activation starts:*

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

2018-09-24 15:18:07.145 
[ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=class o.a.i.failure.StopNodeFailureHandler, 
failureCtx=FailureContext [type=SEGMENTATION, err=null]]

*Node 724 stopped:*

2018-09-24 15:18:09.931 [INFO 
][node-stopper][o.a.i.i.IgniteKernal%DPL_GRID%DplGridNodeName] 

*Coordinator started activating topology without nodes 724 and 931:*

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

*Node 601 received activation message:*

2018-09-24 15:21:08.430 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

*But node 910 which is next to node 724 missed ChangeStateMessage activation 
message from 601.*

*Nevertheless node 601 has connected to node 910:*

2018-09-21 14:29:35.843 [INFO 
][grid-nio-worker-tcp-comm-2-#122%DPL_GRID%DplGridNodeName%|#122%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
 Established outgoing communication connection [locAddr=/10.116.214.83:59739, 
rmtAddr=grid910.domain/10.116.214.210:47100]

*Hovever 655 which is next after 910 received ChangeStateMessage:*

2018-09-24 15:21:08.816 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

*Coordinator received all expected messages except 910, that last was:*

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, *allReceived=false*]

*GridDhtPartitionsExchangeFuture hangs forever:*

09-24 15:33:46.784 [WARN 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
 Failed to wait for partition map exchange [topVer=AffinityTopologyVersion 
[topVer=188, minorTopVer=1], node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. 
Dumping pending objects that might be the cause:

So one node in the ring missed the message, hovever all other nodes in topology 
got it, how is that possible?

GC logs for node 609, 724, 910 attached. There is no anything unusual in GC for 
nodes 609 or 910.

  was:
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff0000}724{color} > 910 > 655 > ...* 
****node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=190000.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=155, offline=5]

*topology rolled to version 188.*

*Coordinator rejects node 724 form joining the cluster:*

2018-09-24 15:17:27.491 [INFO 
][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
 Accepted incoming communication connection [locAddr=/10.116.206.11:47100, 
rmtAddr=/10.116.206.98:41611]
 2018-09-24 15:17:27.491 [WARN 
][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
 Close incoming connection, unknown node 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl 
[worker=GridWorker [name=grid-nio-worker-tcp-comm-9, 
igniteInstanceName=DPL_GRID%DplGridNodeName, finished=false, 
hashCode=632168435, interrupted=false, 
runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
 [idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, 
bytesSent0=18, select=true, super=]DirectNioClientWorker [super=], 
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, 
outRecovery=null, super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, 
rmtAddr=/10.116.206.98:41611, createTime=1537791447487, closeTime=0, 
bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, 
sndSchedTime=1537791447487, lastSndTime=1537791447487, 
lastRcvTime=1537791447487, readsPaused=false, 
filterChain=FilterChain[filters=[, GridConnectionBytesVerifyFilter], 
accepted=true, markedForClose=false]]] 

*Node 724 SEGMENTED before activation starts:*

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

2018-09-24 15:18:07.145 
[ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=class o.a.i.failure.StopNodeFailureHandler, 
failureCtx=FailureContext [type=SEGMENTATION, err=null]]

*Node 724 stopped:*

2018-09-24 15:18:09.931 [INFO 
][node-stopper][o.a.i.i.IgniteKernal%DPL_GRID%DplGridNodeName] 

*Coordinator started activating topology without nodes 724 and 931:*

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

*Node 601 received activation message:*

2018-09-24 15:21:08.430 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

*But node 910 which is next to node 724 missed ChangeStateMessage activation 
message from 601.*

*Nevertheless node 601 has connected to node 910:*

2018-09-21 14:29:35.843 [INFO 
][grid-nio-worker-tcp-comm-2-#122%DPL_GRID%DplGridNodeName%|#122%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
 Established outgoing communication connection [locAddr=/10.116.214.83:59739, 
rmtAddr=grid910.domain/10.116.214.210:47100]

*Hovever 655 which is next after 910 received ChangeStateMessage:*

2018-09-24 15:21:08.816 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

Coordinator received all expected messages except 910, that was:

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, *allReceived=false*]

*GridDhtPartitionsExchangeFuture hangs forever:*

09-24 15:33:46.784 [WARN 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
 Failed to wait for partition map exchange [topVer=AffinityTopologyVersion 
[topVer=188, minorTopVer=1], node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. 
Dumping pending objects that might be the cause:

 

So one node in the ring missed the message, hovever all other nodes in topology 
got it, how is that possible?

GC logs for node 609, 724, 910 attached. There is no anything unusual in GC for 
nodes 609 or 910.


> Deactivation, segmentation of one node, activation may lead to hang 
> activation forever
> --------------------------------------------------------------------------------------
>
>                 Key: IGNITE-9793
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9793
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.5
>            Reporter: Pavel Voronkin
>            Priority: Major
>         Attachments: 601_gc_server_memory.log.0.current.7z, 
> 724_gc_server_memory.log.0.current.7z, 910_gc_server_memory.log.0.current.7z
>
>
> There is topology with ring of nodes:
> *665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated
> *665(coordinator)* *> 601 > {color:#ff0000}724{color} > 910 > 655 > ...* 
> ****node failed
> *665(coordinator) > 601 > 910 > 655 > ...* **activated
> During activation node 910 haven't received StateChangedMessage, hovever 655 
> and all subsequent nodes received and responded to coordinator.
> So coordinator expects to have 154 messages but received only 153 that is why 
> activation hangs.
> Details bellow:
> *Coordinator deactivated:*
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 
> [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
>  Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}
> , participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> *nodes 601, 724, 910, 655 were deactivated:*
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> *Spi on coordinator received node 724 failed message:*
> 2018-09-24 15:17:00.220 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
> addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
> [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, 
> intOrder=110, lastExchangeTime=1537528210290, loc=false, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
> offheap=190000.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=156, offline=4]
> *topology rolled to version 187, then another node 931 failed:*
> 2018-09-24 15:17:00.466 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
> addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
> [grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
> lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
> isClient=false]
>  2018-09-24 15:17:00.467 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
> offheap=190000.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.467 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.467 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=155, offline=5]
> *topology rolled to version 188.*
> *Coordinator rejects node 724 form joining the cluster:*
> 2018-09-24 15:17:27.491 [INFO 
> ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
>  Accepted incoming communication connection [locAddr=/10.116.206.11:47100, 
> rmtAddr=/10.116.206.98:41611]
>  2018-09-24 15:17:27.491 [WARN 
> ][grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
>  Close incoming connection, unknown node 
> [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, ses=GridSelectorNioSessionImpl 
> [worker=GridWorker [name=grid-nio-worker-tcp-comm-9, 
> igniteInstanceName=DPL_GRID%DplGridNodeName, finished=false, 
> hashCode=632168435, interrupted=false, 
> runner=grid-nio-worker-tcp-comm-9-#129%DPL_GRID%DplGridNodeName%|#129%DPL_GRID%DplGridNodeName%]AbstractNioClientWorker
>  [idx=9, bytesRcvd=326476100, bytesSent=39876884360, bytesRcvd0=42, 
> bytesSent0=18, select=true, super=]DirectNioClientWorker [super=], 
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
> readBuf=java.nio.DirectByteBuffer[pos=42 lim=42 cap=32768], inRecovery=null, 
> outRecovery=null, super=GridNioSessionImpl [locAddr=/10.116.206.11:47100, 
> rmtAddr=/10.116.206.98:41611, createTime=1537791447487, closeTime=0, 
> bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, 
> sndSchedTime=1537791447487, lastSndTime=1537791447487, 
> lastRcvTime=1537791447487, readsPaused=false, 
> filterChain=FilterChain[filters=[, GridConnectionBytesVerifyFilter], 
> accepted=true, markedForClose=false]]] 
> *Node 724 SEGMENTED before activation starts:*
> 2018-09-24 15:17:50.068 [WARN 
> ][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
>  Node is out of topology (probably, due to short-time network problems).
>  2018-09-24 15:17:50.069 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Local node SEGMENTED: TcpDiscoveryNode 
> [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
> sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
> order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
> 2018-09-24 15:18:07.145 
> [ERROR][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][org.apache.ignite.Ignite]
>  Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=class o.a.i.failure.StopNodeFailureHandler, 
> failureCtx=FailureContext [type=SEGMENTATION, err=null]]
> *Node 724 stopped:*
> 2018-09-24 15:18:09.931 [INFO 
> ][node-stopper][o.a.i.i.IgniteKernal%DPL_GRID%DplGridNodeName] 
> *Coordinator started activating topology without nodes 724 and 931:*
> 2018-09-24 15:19:48.686 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
> client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]
> *Node 601 received activation message:*
> 2018-09-24 15:21:08.430 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, 
> client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]
> *But node 910 which is next to node 724 missed ChangeStateMessage activation 
> message from 601.*
> *Nevertheless node 601 has connected to node 910:*
> 2018-09-21 14:29:35.843 [INFO 
> ][grid-nio-worker-tcp-comm-2-#122%DPL_GRID%DplGridNodeName%|#122%DPL_GRID%DplGridNodeName%][o.a.i.s.c.tcp.TcpCommunicationSpi]
>  Established outgoing communication connection [locAddr=/10.116.214.83:59739, 
> rmtAddr=grid910.domain/10.116.214.210:47100]
> *Hovever 655 which is next after 910 received ChangeStateMessage:*
> 2018-09-24 15:21:08.816 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Start activation process [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, 
> client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]
> *Coordinator received all expected messages except 910, that last was:*
> 2018-09-24 15:24:25.911 [INFO 
> ][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Coordinator received single message [ver=AffinityTopologyVersion 
> [topVer=188, minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, 
> *allReceived=false*]
> *GridDhtPartitionsExchangeFuture hangs forever:*
> 09-24 15:33:46.784 [WARN 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.apache.ignite.internal.diagnostic]
>  Failed to wait for partition map exchange [topVer=AffinityTopologyVersion 
> [topVer=188, minorTopVer=1], node=e002e011-8d1c-4353-a0f3-b71264c5b0f4]. 
> Dumping pending objects that might be the cause:
> So one node in the ring missed the message, hovever all other nodes in 
> topology got it, how is that possible?
> GC logs for node 609, 724, 910 attached. There is no anything unusual in GC 
> for nodes 609 or 910.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to