So if I understand correctly the logs below... The node that shut off was timing out trying to get partition exchange from the indicated nodes and it shut itself off correct? Does this mean this node was also the master?
1- The time indicated in the log is that UTC? 2- I'm trying to see if it was high IO, but the node that stopped had no high IO. 3- the other nodes had an average peek of about 0.3%-0.5% And that was around 16:30 EST So not sure if those times match with the log. 4- On super high load days we get up to 7%-9% IO Wait but we never lost a node on those occasions. 5- The nodes are persistent nodes so I'm guessing on write there a high IO can we reduce the IO, but even then I'm not sure it's related. [04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=132, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e5645874-77e9-4455-8db5-07fa63984276, addrs=[127.0.0.1, 172.17.0.5], sockAddrs=[/172.17.0.5:0, /127.0.0.1:0], discPort=0, order=108, intOrder=60, lastExchangeTime=1600798045116, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765] [04:36:07,771][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=133, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=13f473cb-5441-460b-9b9f-a23c1bcf3b0b, addrs=[127.0.0.1, 172.17.0.3], sockAddrs=[/127.0.0.1:0, /172.17.0.3:0], discPort=0, order=110, intOrder=61, lastExchangeTime=1600798045127, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=134, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=ef9f3750-edfa-474e-8e68-6f0abee5095a, addrs=[127.0.0.1, 172.17.0.10], sockAddrs=[/127.0.0.1:0, /172.17.0.10:0], discPort=0, order=112, intOrder=62, lastExchangeTime=1600798045137, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=true], done=true], topVer=null, durationFromInit=1605328567765] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=135, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=e14e5bea-2784-4a6f-af2f-e98a4bf0ab52, addrs=[127.0.0.1, xxx.xxx.xxx.63], sockAddrs=[xxxxxx-0001/xxx.xxx.xxx.63:47500, / 127.0.0.1:47500], discPort=47500, order=118, intOrder=65, lastExchangeTime=1600798763474, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: xxxxxx] [04:36:07,772][INFO][node-stopper][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=8d84b4e9-8c11-4166-a75a-9b3d959ff1fe, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=136, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=8facbe9c-6d9f-4c2d-8d9c-5c7f38dbb7da, addrs=[127.0.0.1, xxx.xxx.xxx.69], sockAddrs=[/127.0.0.1:47500, xxxxxx-0003/xxx.xxx.xxx.69:47500], discPort=47500, order=120, intOrder=66, lastExchangeTime=1600799256509, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=null, durationFromInit=1605328567765] [04:36:07,796][INFO][db-checkpoint-thread-#56%xxxxxx%][GridCacheDatabaseSharedManager] Skipping checkpoint (no pages were modified) [checkpointLockWait=0ms, checkpointLockHoldTime=14ms, reason='timeout'] [04:36:07,832][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-ids-for-phones] [04:36:07,839][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=twofa-otp] [04:36:07,840][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=SQL_PUBLIC_CARRIER_CODE] [04:36:07,841][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=carrier-lookups] [04:36:07,842][INFO][node-stopper][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache] [04:36:08,497][INFO][node-stopper][IgniteKernal%xxxxxx] >>> +---------------------------------------------------------------------------------+ >>> Ignite ver. 2.7.0#20181130-sha1:256ae4012cb143b4855b598b740a6f3499ead4db stopped OK >>> +---------------------------------------------------------------------------------+ >>> Ignite instance name: xxxxxx >>> Grid uptime: 52 days, 10:25:51.029