[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff}724{color} > 910 > 655 > ...* 
node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#15

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff}724{color} > 910 > 655 > ...* 
node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#15

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff}724{color} > 910 > 655 > ...* 
node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.domain/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#15

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff}724{color} > 910 > 655 > ...* 
node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#ff}724{color} > 910 > 655 > ...* 
node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

*Coordinator deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

*nodes 601, 724, 910, 655 were deactivated:*

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

*Spi on coordinator received node 724 failed message:*

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

*topology rolled to version 187, then another node 931 failed:*

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Attachment: 910_gc_server_memory.log.0.current.7z

> Deactivation, segmentation of one node, activation may lead to hang 
> activation forever
> --
>
> Key: IGNITE-9793
> URL: https://issues.apache.org/jira/browse/IGNITE-9793
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: Pavel Voronkin
>Priority: Major
> Attachments: 601_gc_server_memory.log.0.current.7z, 
> 724_gc_server_memory.log.0.current.7z, 910_gc_server_memory.log.0.current.7z
>
>
> There is topology with ring of nodes:
> *665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated
> *665(coordinator)* *> 601 > {color:#FF}724{color} > 910 > 655 > ...* 
> node failed
> *665(coordinator) > 601 > 910 > 655 > ...* **activated
> During activation node 910 haven't received StateChangedMessage, hovever 655 
> and all subsequent nodes received and responded to coordinator.
> So coordinator expects to have 154 messages but received only 153 that is why 
> activation hangs.
> Details bellow:
>  
> Coordinator deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 
> [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
>  Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}
> , participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> nodes 601, 724, 910, 655 were deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> Spi on coordinator received node 724 failed message:
> 2018-09-24 15:17:00.220 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
> addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
> [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, 
> intOrder=110, lastExchangeTime=1537528210290, loc=false, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
> offheap=19.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=156, offline=4]
> topology rolled to version 187, then another node 931 failed:
> 2018-09-24 15:17:00.466 [WARN 
> ][disco-event-worke

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Attachment: 724_gc_server_memory.log.0.current.7z

> Deactivation, segmentation of one node, activation may lead to hang 
> activation forever
> --
>
> Key: IGNITE-9793
> URL: https://issues.apache.org/jira/browse/IGNITE-9793
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: Pavel Voronkin
>Priority: Major
> Attachments: 601_gc_server_memory.log.0.current.7z, 
> 724_gc_server_memory.log.0.current.7z
>
>
> There is topology with ring of nodes:
> *665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated
> *665(coordinator)* *> 601 > {color:#FF}724{color} > 910 > 655 > ...* 
> node failed
> *665(coordinator) > 601 > 910 > 655 > ...* **activated
> During activation node 910 haven't received StateChangedMessage, hovever 655 
> and all subsequent nodes received and responded to coordinator.
> So coordinator expects to have 154 messages but received only 153 that is why 
> activation hangs.
> Details bellow:
>  
> Coordinator deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 
> [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
>  Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}
> , participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> nodes 601, 724, 910, 655 were deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> Spi on coordinator received node 724 failed message:
> 2018-09-24 15:17:00.220 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
> addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
> [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, 
> intOrder=110, lastExchangeTime=1537528210290, loc=false, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
> offheap=19.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=156, offline=4]
> topology rolled to version 187, then another node 931 failed:
> 2018-09-24 15:17:00.466 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%D

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Attachment: 601_gc_server_memory.log.0.current.7z

> Deactivation, segmentation of one node, activation may lead to hang 
> activation forever
> --
>
> Key: IGNITE-9793
> URL: https://issues.apache.org/jira/browse/IGNITE-9793
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: Pavel Voronkin
>Priority: Major
> Attachments: 601_gc_server_memory.log.0.current.7z
>
>
> There is topology with ring of nodes:
> *665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated
> *665(coordinator)* *> 601 > {color:#FF}724{color} > 910 > 655 > ...* 
> node failed
> *665(coordinator) > 601 > 910 > 655 > ...* **activated
> During activation node 910 haven't received StateChangedMessage, hovever 655 
> and all subsequent nodes received and responded to coordinator.
> So coordinator expects to have 154 messages but received only 153 that is why 
> activation hangs.
> Details bellow:
>  
> Coordinator deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
>  2018-09-24 15:09:01.620 
> [DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
>  Server latch is created [latch=CompletableLatchUid
> {id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}
> , participantsSize=160]
>  2018-09-24 15:09:01.621 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]
> nodes 601, 724, 910, 655 were deactivated:
> 2018-09-24 15:09:01.609 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.328 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.334 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> 2018-09-24 15:09:03.332 [INFO 
> ][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
>  Successfully deactivated data structures, services and caches 
> [nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
> topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
> Spi on coordinator received node 724 failed message:
> 2018-09-24 15:17:00.220 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
> addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
> [grid724.domain/10.116.206.98:47500], discPort=47500, order=110, 
> intOrder=110, lastExchangeTime=1537528210290, loc=false, 
> ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
> offheap=19.0GB, heap=4800.0GB]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
>  2018-09-24 15:17:00.221 [INFO 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
>  ^-- Baseline [id=6, size=160, online=156, offline=4]
> topology rolled to version 187, then another node 931 failed:
> 2018-09-24 15:17:00.466 [WARN 
> ][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.Grid

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator)* *> 601 > {color:#FF}724{color} > 910 > 655 > ...* 
node failed

*665(coordinator) > 601 > 910 > 655 > ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator) --> 601 -> {color:#ff}724{color} - > 910 -> 655-* -*> 
...* **node failed

*665(coordinator) -> 601 -> 910 -> 655-*  *> ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridN

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) > 601 > 724 > 910 > 655 > ...* deactivated

*665(coordinator) -> 601 > {color:#ff}724{color} > 910 > 655-* *> ...* 
**node failed

*665(coordinator) -> 601 -> 910 -> 655-*  *> ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeNam

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) -> 601 -> 724 - > 910 -> 655 -> ...* deactivated

*665(coordinator) --> 601 -> {color:#ff}724{color} - > 910 -> 655-* -*> 
...* **node failed

*665(coordinator) -> 601 -> 910 -> 655-*  *> ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%Dp

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is topology with ring of nodes:

*665(coordinator) -> 601 -> 724 - > 910 -> 655 -> ...* deactivated

*665(coordinator) -> 601 -> {color:#FF}724{color} - > 910 -> 655* *-> ...* 
**node failed

*665(coordinator) -> 601 -> 910 -> 655*  *-> ...* **activated

During activation node 910 haven't received StateChangedMessage, hovever 655 
and all subsequent nodes received and responded to coordinator.

So coordinator expects to have 154 messages but received only 153 that is why 
activation hangs.

Details bellow:

 

Coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
 2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4]

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
 2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplG

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is coordinator and ring of nodes

665(coordinator) -> 601 -> 724 - > 910 -> 655

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, 
participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 601, 724, 910, 655 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Spi on coordinator received node 724 failed message:

2018-09-24 15:17:00.220 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, 
addrs=ArrayList [10.116.206.98], sockAddrs=HashSet 
[grid724.domain/10.116.206.98:47500], discPort=47500, order=110, intOrder=110, 
lastExchangeTime=1537528210290, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=187, servers=156, clients=0, CPUs=8736, 
offheap=19.0GB, heap=4800.0GB]
2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
2018-09-24 15:17:00.221 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=156, offline=4] 

topology rolled to version 187, then another node 931 failed:

2018-09-24 15:17:00.466 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=83536b6d-8aa3-4c85-b3da-5e577ae37ac6, 
addrs=ArrayList [10.116.215.3], sockAddrs=HashSet 
[grid931.ca.sbrf.ru/10.116.215.3:47500], discPort=47500, order=73, intOrder=73, 
lastExchangeTime=1537528186599, loc=false, ver=2.5.1#20180906-sha1:ebde6c79, 
isClient=false]
2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Topology snapshot [ver=188, servers=155, clients=0, CPUs=8680, 
offheap=19.0GB, heap=4800.0GB]
2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Node [id=E002E011-8D1C-4353-A0F3-B71264C5B0F4, clusterState=INACTIVE]
2018-09-24 15:17:00.467 [INFO 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 ^-- Baseline [id=6, size=160, online=155, offline=5]

topology rolled to version 188.

Node 724 SEGMENTED before activation starts:

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disc

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 1, 2, 3, 4 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Node 2 SEGMENTED

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

Coordinator started activation on topology without node2

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

But node 3 which is next to node 2 haven't received activation message.

Coordinator sent activation to all except 3.

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false]

GridDhtPartitionsExchangeFuture hangs forever.

 So one node in the ring missed the message, hovever all other nodes in 
topology got it, how is that possible?

 

  was:
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#1

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid

{id='exchange', topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}

, participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 1, 2, 3, 4 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Node 2 SEGMENTED

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

Coordinator started activation on topology without node2

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

But node 3 which is next to node 2 haven't received activation message.

Coordinator sent activation to all except 3.

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false]

GridDhtPartitionsExchangeFuture hangs forever.

 

 

  was:
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid{id='exchange', 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, 
participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 1, 2, 3, 4 were deactivated:

2018-09-24 15:

[jira] [Updated] (IGNITE-9793) Deactivation, segmentation of one node, activation may lead to hang activation forever

2018-10-04 Thread Pavel Voronkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Voronkin updated IGNITE-9793:
---
Description: 
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
 2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid{id='exchange', 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, 
participantsSize=160]
 2018-09-24 15:09:01.621 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%]

nodes 1, 2, 3, 4 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.328 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=22a58223-47b5-43c2-897b-e70e8e50edf7, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.334 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=973eb8ce-3b8c-463d-a6ab-00ac66d93f13, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

2018-09-24 15:09:03.332 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=a904bac4-aaed-4f69-90f3-c13bc4d331d1, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]

Node 2 SEGMENTED

2018-09-24 15:17:50.068 [WARN 
][tcp-disco-msg-worker-#2%DPL_GRID%DplGridNodeName%|#2%DPL_GRID%DplGridNodeName%][o.a.i.s.d.tcp.TcpDiscoverySpi]
 Node is out of topology (probably, due to short-time network problems).
 2018-09-24 15:17:50.069 [WARN 
][disco-event-worker-#152%DPL_GRID%DplGridNodeName%|#152%DPL_GRID%DplGridNodeName%][o.a.i.i.m.d.GridDiscoveryManager]
 Local node SEGMENTED: TcpDiscoveryNode 
[id=a904bac4-aaed-4f69-90f3-c13bc4d331d1, addrs=ArrayList [10.116.206.98], 
sockAddrs=HashSet [grid724.domain/10.116.206.98:47500], discPort=47500, 
order=110, intOrder=110, lastExchangeTime=1537791470063, loc=true, 
ver=2.5.1#20180906-sha1:ebde6c79, isClient=false]

Coordinator started activation on topology without node2

2018-09-24 15:19:48.686 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%|#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Start activation process [nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, 
client=false, topVer=AffinityTopologyVersion [topVer=188, minorTopVer=1]]

But node 3 which is next to node 2 haven't received activation message.

Coordinator sent activation to all except 3.

Node 3 haven't received activation message.

2018-09-24 15:24:25.911 [INFO 
][sys-#28144%DPL_GRID%DplGridNodeName%|#28144%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Coordinator received single message [ver=AffinityTopologyVersion [topVer=188, 
minorTopVer=1], node=073f1598-6b70-49df-8f45-126735611775, allReceived=false]

GridDhtPartitionsExchangeFuture hangs forever.

 

 

 

  was:
There is coordinator and ring of nodes

coordinator -> 1 -> 2 - > 3 -> 4

coordinator deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=e002e011-8d1c-4353-a0f3-b71264c5b0f4, client=false, 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]]
2018-09-24 15:09:01.620 
[DEBUG][exchange-worker-#153%DPL_GRID%DplGridNodeName%][o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager]
 Server latch is created [latch=CompletableLatchUid\{id='exchange', 
topVer=AffinityTopologyVersion [topVer=183, minorTopVer=1]}, 
participantsSize=160]
2018-09-24 15:09:01.621 [INFO ][exchange-worker-#153%DPL_GRID%DplGridNodeName%]

nodes 1, 2, 3, 4 were deactivated:

2018-09-24 15:09:01.609 [INFO 
][exchange-worker-#153%DPL_GRID%