Hi, pretty randomly we discover that certain ignite nodes shutdown without a reasonable outer cause. The grid was not in use the time this happened (no load on the system):
2017-11-11 06:06:26:752 +0000 [grid-timeout-worker-#23] INFO org.apache.ignite.internal.IgniteKernal - Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=9512cd22, uptime=18:03:23.333] ^-- H/N/C [hosts=1, nodes=6, CPUs=8] ^-- CPU [cur=0.1%, avg=0.09%, GC=0%] ^-- PageMemory [pages=2256] ^-- Heap [used=4831MB, free=60.68%, comm=12288MB] ^-- Non heap [used=102MB, free=96.91%, comm=106MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=6, qSize=0] ^-- Outbound messages queue [size=0] 2017-11-11 06:06:26:752 +0000 [grid-timeout-worker-#23] INFO org.apache.ignite.internal.IgniteKernal - FreeList [name=null, buckets=256, dataPages=1, reusePages=0] 2017-11-11 06:07:08:789 +0000 [tcp-disco-sock-reader-#5] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.10.100.251:58485, rmtPort=58485 2017-11-11 06:07:09:630 +0000 [tcp-disco-srvr-#3] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.10.100.251, rmtPort=45719] 2017-11-11 06:07:09:630 +0000 [tcp-disco-srvr-#3] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.10.100.251, rmtPort=45719] 2017-11-11 06:07:09:631 +0000 [tcp-disco-sock-reader-#10] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.10.100.251:45719, rmtPort=45719] 2017-11-11 06:07:09:632 +0000 [tcp-disco-sock-reader-#10] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Received ping request from the remote node [rmtNodeId=c68d7211-41ac-4364-81b5-46f55f62463e, rmtAddr=/10.10.100.251:45719, rmtPort=45719] 2017-11-11 06:07:09:632 +0000 [tcp-disco-sock-reader-#10] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished writing ping response [rmtNodeId=c68d7211-41ac-4364-81b5-46f55f62463e, rmtAddr=/10.10.100.251:45719, rmtPort=45719] 2017-11-11 06:07:09:632 +0000 [tcp-disco-sock-reader-#10] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.10.100.251:45719, rmtPort=45719 2017-11-11 06:07:09:860 +0000 [tcp-disco-srvr-#3] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.10.100.251, rmtPort=60777] 2017-11-11 06:07:09:860 +0000 [tcp-disco-srvr-#3] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.10.100.251, rmtPort=60777] 2017-11-11 06:07:09:860 +0000 [tcp-disco-sock-reader-#11] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.10.100.251:60777, rmtPort=60777] 2017-11-11 06:07:09:864 +0000 [tcp-disco-msg-worker-#2] WARN org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems). 2017-11-11 06:07:09:864 +0000 [disco-event-worker-#41] WARN org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=9512cd22-4e04-4627-9cd7-902b0143725c, addrs=[127.0.0.1, 172.17.0.2], sockAddrs=[/127.0.0.1:31000, 2f35d5160c01/172.17.0.2:31000], discPort=31000, order=2, intOrder=2, lastExchangeTime=1510380429855, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] 2017-11-11 06:07:09:865 +0000 [tcp-disco-sock-reader-#11] INFO org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.10.100.251:60777, rmtPort=60777 2017-11-11 06:07:09:867 +0000 [disco-event-worker-#41] WARN org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Stopping local node according to configured segmentation policy. The ignite node then shuts down (which is correct due to a failed node). Our failsafe mechanisms can recover from this, but we would like to know how to circumvent these failing nodes in future. What could be the reason for such a node segmentation. As this runs on AWS and we even see a similar error when the nodes are running on the same vm I am pretty sure it is NOT a network issue ... Thankx Lukas ---- Lukas Lentner, B. Sc. St.-Cajetan-Straße 13 81669 München Deutschland Fon: +49 / 89 / 44 38 61 27 Mobile: +49 / 176 / 24 77 09 22 E-Mail: kont...@lukaslentner.de Website: www.LukasLentner.de IBAN: DE33 7019 0000 0001 1810 17 BIC: GENODEF1M01 (Münchner Bank)