We are running an application on 2 servers with each running Ignite in a cluster. As the logs show below at some point the nodes had trouble communicating with each other. What I would really like to know is why one of the nodes seemed to recover and the other node did not. Is there something I should be looking for or some setting that might be misconfigured?
Thanks, Ralph 192.168.202.110 2018-01-04 22:16:52 WARN [ ] TcpDiscoverySpi:133 - Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=/192.168.202.111:47500, rmtPort=47500] 2018-01-04 22:16:52 WARN [ ] TcpDiscoverySpi:133 - Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=50668565061-9d6b5111-c433-4e43-997b-3c803c84ee45, verifierNodeId=9d6b5111-c433-4e43-997b-3c803c84ee45, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=225ed6e9-3116-465f-bc3a-94818278fd31, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.111], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.202.111:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1513268425829, loc=false, ver=2.3.0#20171027-sha1:8add7fd5, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=50668565061-9d6b5111-c433-4e43-997b-3c803c84ee45, verifierNodeId=9d6b5111-c433-4e43-997b-3c803c84ee45, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=225ed6e9-3116-465f-bc3a-94818278fd31, order=12, addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.111], daemon=false]]] 2018-01-04 22:16:52 WARN [ ] TcpDiscoverySpi:133 - Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' 2018-01-04 22:16:52 WARN [ ] GridDiscoveryManager:133 - Node FAILED: TcpDiscoveryNode [id=225ed6e9-3116-465f-bc3a-94818278fd31, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.111], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.202.111:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1513268425829, loc=false, ver=2.3.0#20171027-sha1:8add7fd5, isClient=false] 2018-01-04 22:16:52 INFO [ ] GridDiscoveryManager:128 - Topology snapshot [ver=13, servers=1, clients=0, CPUs=4, heap=2.0GB] 2018-01-04 22:16:52 INFO [ ] time:128 - Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=225ed6e9-3116-465f-bc3a-94818278fd31, customEvt=null, allowMerge=true] 2018-01-04 22:16:52 INFO [ ] GridDhtPartitionsExchangeFuture:128 - Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], waitTime=0ms, futInfo=NA] 2018-01-04 22:16:52 INFO [ ] GridDhtPartitionsExchangeFuture:128 - Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=13, minorTopVer=0]] 2018-01-04 22:16:52 INFO [ ] GridDhtPartitionsExchangeFuture:128 - finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=13, minorTopVer=0]] 2018-01-04 22:16:53 INFO [ ] GridDhtPartitionsExchangeFuture:128 - Finish exchange future [startVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], err=null] 2018-01-04 22:16:53 INFO [ ] time:128 - Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true] 2018-01-04 22:16:53 INFO [ ] GridCachePartitionExchangeManager:128 - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=13, minorTopVer=0], evt=NODE_FAILED, node=225ed6e9-3116-465f-bc3a-94818278fd31] 2018-01-04 22:17:00 INFO [ ] IgniteKernal:128 - 192.168.202.111 2018-01-04 22:16:12 INFO [ ] IgniteKernal:128 - FreeList [name=null, buckets=256, dataPages=5, reusePages=0] 2018-01-04 22:16:12 INFO [ ] IgniteKernal:128 - FreeList [name=null, buckets=256, dataPages=5, reusePages=0] 2018-01-04 22:16:52 INFO [ ] TcpDiscoverySpi:128 - Finished serving remote node connection [rmtAddr=/192.168.202.110:55327, rmtPort=55327 2018-01-04 22:16:52 INFO [ ] TcpDiscoverySpi:128 - TCP discovery accepted incoming connection [rmtAddr=/192.168.202.110, rmtPort=51136] 2018-01-04 22:16:52 INFO [ ] TcpDiscoverySpi:128 - TCP discovery spawning a new thread for connection [rmtAddr=/192.168.202.110, rmtPort=51136] 2018-01-04 22:16:52 INFO [ ] TcpDiscoverySpi:128 - Started serving remote node connection [rmtAddr=/192.168.202.110:51136, rmtPort=51136] 2018-01-04 22:16:52 WARN [ ] TcpDiscoverySpi:133 - Node is out of topology (probably, due to short-time network problems). 2018-01-04 22:16:52 INFO [ ] TcpDiscoverySpi:128 - Finished serving remote node connection [rmtAddr=/192.168.202.110:51136, rmtPort=51136 2018-01-04 22:16:52 WARN [ ] GridDiscoveryManager:133 - Local node SEGMENTED: TcpDiscoveryNode [id=225ed6e9-3116-465f-bc3a-94818278fd31, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.111], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.202.111:47500], discPort=47500, order=12, intOrder=7, lastExchangeTime=1515129412681, loc=true, ver=2.3.0#20171027-sha1:8add7fd5, isClient=false] 2018-01-04 22:16:53 WARN [ ] GridDiscoveryManager:133 - Stopping local node according to configured segmentation policy. 2018-01-04 22:16:53 WARN [ ] GridDiscoveryManager:133 - Node FAILED: TcpDiscoveryNode [id=9d6b5111-c433-4e43-997b-3c803c84ee45, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.110], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.202.110:47500], discPort=47500, order=10, intOrder=6, lastExchangeTime=1513268425880, loc=false, ver=2.3.0#20171027-sha1:8add7fd5, isClient=false] 2018-01-04 22:16:53 INFO [ ] GridDiscoveryManager:128 - Topology snapshot [ver=13, servers=1, clients=0, CPUs=4, heap=2.0GB] 2018-01-04 22:16:53 INFO [ ] time:128 - Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=9d6b5111-c433-4e43-997b-3c803c84ee45, customEvt=null, allowMerge=true] 2018-01-04 22:16:53 INFO [ ] GridTcpRestProtocol:128 - Command protocol successfully stopped: TCP binary 2018-01-04 22:16:53 INFO [ ] GridDhtPartitionsExchangeFuture:128 - Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], waitTime=0ms, futInfo=NA] 2018-01-04 22:16:53 INFO [ ] GridDhtPartitionsExchangeFuture:128 - Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=13, minorTopVer=0]] 2018-01-04 22:16:53 INFO [ ] GridDhtPartitionsExchangeFuture:128 - finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=13, minorTopVer=0]] 2018-01-04 22:16:53 INFO [ ] GridDhtPartitionsExchangeFuture:128 - Finish exchange future [startVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Thread is interrupted: IgniteThread [compositeRwLockIdx=1, stripe=-1, plc=-1, name=exchange-worker-#42]] 2018-01-04 22:16:53 INFO [ ] time:128 - Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true] 2018-01-04 22:16:53 INFO [ ] GridCacheProcessor:128 - Stopped cache [cacheName=loginCache] 2018-01-04 22:16:53 INFO [ ] GridCacheProcessor:128 - Stopped cache [cacheName=sessionCache] 2018-01-04 22:16:53 INFO [ ] GridCacheProcessor:128 - Stopped cache [cacheName=authCache] 2018-01-04 22:16:53 INFO [ ] GridCacheProcessor:128 - Stopped cache [cacheName=ignite-sys-cache] 2018-01-04 22:16:53 INFO [ ] GridCacheProcessor:128 - Stopped cache [cacheName=DirectoryContactCache] 2018-01-04 22:16:54 INFO [ ] IgniteKernal:128 -
