Node fails to recover

Ralph Goers Wed, 17 Jan 2018 07:12:08 -0800

We are running an application on 2 servers with each running Ignite in a 
cluster. As the logs show below at some point the nodes had trouble 
communicating with each other. What I would really like to know is why one of 
the nodes seemed to recover and the other node did not. Is there something I 
should be looking for or some setting that might be misconfigured?


Thanks,
Ralph



192.168.202.110
 
2018-01-04 22:16:52 WARN  [ ] TcpDiscoverySpi:133 - Timed out waiting for 
message delivery receipt (most probably, the reason is in long GC pauses on 
remote node; consider tuning GC and increasing 'ackTimeout' configuration 
property). Will retry to send message with increased timeout 
[currentTimeout=10000, rmtAddr=/192.168.202.111:47500, rmtPort=47500]
2018-01-04 22:16:52 WARN  [ ] TcpDiscoverySpi:133 - Failed to send message to 
next node [msg=TcpDiscoveryMetricsUpdateMessage 
[super=TcpDiscoveryAbstractMessage [sndNodeId=null, 
id=50668565061-9d6b5111-c433-4e43-997b-3c803c84ee45, 
verifierNodeId=9d6b5111-c433-4e43-997b-3c803c84ee45, topVer=0, pendingIdx=0, 
failedNodes=null, isClient=false]], next=TcpDiscoveryNode 
[id=225ed6e9-3116-465f-bc3a-94818278fd31, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 
192.168.202.111], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
/192.168.202.111:47500], discPort=47500, order=12, intOrder=7, 
lastExchangeTime=1513268425829, loc=false, ver=2.3.0#20171027-sha1:8add7fd5, 
isClient=false], errMsg=Failed to send message to next node 
[msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage 
[sndNodeId=null, id=50668565061-9d6b5111-c433-4e43-997b-3c803c84ee45, 
verifierNodeId=9d6b5111-c433-4e43-997b-3c803c84ee45, topVer=0, pendingIdx=0, 
failedNodes=null, isClient=false]], next=ClusterNode 
[id=225ed6e9-3116-465f-bc3a-94818278fd31, order=12, addr=[0:0:0:0:0:0:0:1%lo, 
127.0.0.1, 192.168.202.111], daemon=false]]]
2018-01-04 22:16:52 WARN  [ ] TcpDiscoverySpi:133 - Local node has detected 
failed nodes and started cluster-wide procedure. To speed up failure detection 
please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi'
2018-01-04 22:16:52 WARN  [ ] GridDiscoveryManager:133 - Node FAILED: 
TcpDiscoveryNode [id=225ed6e9-3116-465f-bc3a-94818278fd31, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.111], 
sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
/192.168.202.111:47500], discPort=47500, order=12, intOrder=7, 
lastExchangeTime=1513268425829, loc=false, ver=2.3.0#20171027-sha1:8add7fd5, 
isClient=false]
2018-01-04 22:16:52 INFO  [ ] GridDiscoveryManager:128 - Topology snapshot 
[ver=13, servers=1, clients=0, CPUs=4, heap=2.0GB]
2018-01-04 22:16:52 INFO  [ ] time:128 - Started exchange init 
[topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true, 
evt=NODE_FAILED, evtNode=225ed6e9-3116-465f-bc3a-94818278fd31, customEvt=null, 
allowMerge=true]
2018-01-04 22:16:52 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - Finished 
waiting for partition release future [topVer=AffinityTopologyVersion 
[topVer=13, minorTopVer=0], waitTime=0ms, futInfo=NA]
2018-01-04 22:16:52 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - Coordinator 
received all messages, try merge [ver=AffinityTopologyVersion [topVer=13, 
minorTopVer=0]]
2018-01-04 22:16:52 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - 
finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=13, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=13, minorTopVer=0]]
2018-01-04 22:16:53 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - Finish 
exchange future [startVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], 
resVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], err=null]
2018-01-04 22:16:53 INFO  [ ] time:128 - Finished exchange init 
[topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true]
2018-01-04 22:16:53 INFO  [ ] GridCachePartitionExchangeManager:128 - Skipping 
rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=13, 
minorTopVer=0], evt=NODE_FAILED, node=225ed6e9-3116-465f-bc3a-94818278fd31]
2018-01-04 22:17:00 INFO  [ ] IgniteKernal:128 -
 
192.168.202.111
 
2018-01-04 22:16:12 INFO  [ ] IgniteKernal:128 - FreeList [name=null, 
buckets=256, dataPages=5, reusePages=0]
2018-01-04 22:16:12 INFO  [ ] IgniteKernal:128 - FreeList [name=null, 
buckets=256, dataPages=5, reusePages=0]
2018-01-04 22:16:52 INFO  [ ] TcpDiscoverySpi:128 - Finished serving remote 
node connection [rmtAddr=/192.168.202.110:55327, rmtPort=55327
2018-01-04 22:16:52 INFO  [ ] TcpDiscoverySpi:128 - TCP discovery accepted 
incoming connection [rmtAddr=/192.168.202.110, rmtPort=51136]
2018-01-04 22:16:52 INFO  [ ] TcpDiscoverySpi:128 - TCP discovery spawning a 
new thread for connection [rmtAddr=/192.168.202.110, rmtPort=51136]
2018-01-04 22:16:52 INFO  [ ] TcpDiscoverySpi:128 - Started serving remote node 
connection [rmtAddr=/192.168.202.110:51136, rmtPort=51136]
2018-01-04 22:16:52 WARN  [ ] TcpDiscoverySpi:133 - Node is out of topology 
(probably, due to short-time network problems).
2018-01-04 22:16:52 INFO  [ ] TcpDiscoverySpi:128 - Finished serving remote 
node connection [rmtAddr=/192.168.202.110:51136, rmtPort=51136
2018-01-04 22:16:52 WARN  [ ] GridDiscoveryManager:133 - Local node SEGMENTED: 
TcpDiscoveryNode [id=225ed6e9-3116-465f-bc3a-94818278fd31, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.111], 
sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
/192.168.202.111:47500], discPort=47500, order=12, intOrder=7, 
lastExchangeTime=1515129412681, loc=true, ver=2.3.0#20171027-sha1:8add7fd5, 
isClient=false]
2018-01-04 22:16:53 WARN  [ ] GridDiscoveryManager:133 - Stopping local node 
according to configured segmentation policy.
2018-01-04 22:16:53 WARN  [ ] GridDiscoveryManager:133 - Node FAILED: 
TcpDiscoveryNode [id=9d6b5111-c433-4e43-997b-3c803c84ee45, 
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.202.110], 
sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 
/192.168.202.110:47500], discPort=47500, order=10, intOrder=6, 
lastExchangeTime=1513268425880, loc=false, ver=2.3.0#20171027-sha1:8add7fd5, 
isClient=false]
2018-01-04 22:16:53 INFO  [ ] GridDiscoveryManager:128 - Topology snapshot 
[ver=13, servers=1, clients=0, CPUs=4, heap=2.0GB]
2018-01-04 22:16:53 INFO  [ ] time:128 - Started exchange init 
[topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true, 
evt=NODE_FAILED, evtNode=9d6b5111-c433-4e43-997b-3c803c84ee45, customEvt=null, 
allowMerge=true]
2018-01-04 22:16:53 INFO  [ ] GridTcpRestProtocol:128 - Command protocol 
successfully stopped: TCP binary
2018-01-04 22:16:53 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - Finished 
waiting for partition release future [topVer=AffinityTopologyVersion 
[topVer=13, minorTopVer=0], waitTime=0ms, futInfo=NA]
2018-01-04 22:16:53 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - Coordinator 
received all messages, try merge [ver=AffinityTopologyVersion [topVer=13, 
minorTopVer=0]]
2018-01-04 22:16:53 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - 
finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=13, 
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=13, minorTopVer=0]]
2018-01-04 22:16:53 INFO  [ ] GridDhtPartitionsExchangeFuture:128 - Finish 
exchange future [startVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], 
resVer=null, err=class 
org.apache.ignite.internal.IgniteInterruptedCheckedException: Thread is 
interrupted: IgniteThread [compositeRwLockIdx=1, stripe=-1, plc=-1, 
name=exchange-worker-#42]]
2018-01-04 22:16:53 INFO  [ ] time:128 - Finished exchange init 
[topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=true]
2018-01-04 22:16:53 INFO  [ ] GridCacheProcessor:128 - Stopped cache 
[cacheName=loginCache]
2018-01-04 22:16:53 INFO  [ ] GridCacheProcessor:128 - Stopped cache 
[cacheName=sessionCache]
2018-01-04 22:16:53 INFO  [ ] GridCacheProcessor:128 - Stopped cache 
[cacheName=authCache]
2018-01-04 22:16:53 INFO  [ ] GridCacheProcessor:128 - Stopped cache 
[cacheName=ignite-sys-cache]
2018-01-04 22:16:53 INFO  [ ] GridCacheProcessor:128 - Stopped cache 
[cacheName=DirectoryContactCache]
2018-01-04 22:16:54 INFO  [ ] IgniteKernal:128 -

Node fails to recover

Reply via email to