Hi,

pretty randomly we discover that certain ignite nodes shutdown without a 
reasonable outer cause. The grid was not in use the time this happened (no load 
on the system):

2017-11-11 06:06:26:752 +0000 [grid-timeout-worker-#23] INFO 
org.apache.ignite.internal.IgniteKernal -
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=9512cd22, uptime=18:03:23.333]
^-- H/N/C [hosts=1, nodes=6, CPUs=8]
^-- CPU [cur=0.1%, avg=0.09%, GC=0%]
^-- PageMemory [pages=2256]
^-- Heap [used=4831MB, free=60.68%, comm=12288MB]
^-- Non heap [used=102MB, free=96.91%, comm=106MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
^-- Outbound messages queue [size=0]
2017-11-11 06:06:26:752 +0000 [grid-timeout-worker-#23] INFO 
org.apache.ignite.internal.IgniteKernal - FreeList [name=null, buckets=256, 
dataPages=1, reusePages=0]
2017-11-11 06:07:08:789 +0000 [tcp-disco-sock-reader-#5] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished serving remote 
node connection [rmtAddr=/10.10.100.251:58485, rmtPort=58485
2017-11-11 06:07:09:630 +0000 [tcp-disco-srvr-#3] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery accepted 
incoming connection [rmtAddr=/10.10.100.251, rmtPort=45719]
2017-11-11 06:07:09:630 +0000 [tcp-disco-srvr-#3] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery spawning a 
new thread for connection [rmtAddr=/10.10.100.251, rmtPort=45719]
2017-11-11 06:07:09:631 +0000 [tcp-disco-sock-reader-#10] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Started serving remote 
node connection [rmtAddr=/10.10.100.251:45719, rmtPort=45719]
2017-11-11 06:07:09:632 +0000 [tcp-disco-sock-reader-#10] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Received ping request 
from the remote node [rmtNodeId=c68d7211-41ac-4364-81b5-46f55f62463e, 
rmtAddr=/10.10.100.251:45719, rmtPort=45719]
2017-11-11 06:07:09:632 +0000 [tcp-disco-sock-reader-#10] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished writing ping 
response [rmtNodeId=c68d7211-41ac-4364-81b5-46f55f62463e, 
rmtAddr=/10.10.100.251:45719, rmtPort=45719]
2017-11-11 06:07:09:632 +0000 [tcp-disco-sock-reader-#10] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished serving remote 
node connection [rmtAddr=/10.10.100.251:45719, rmtPort=45719
2017-11-11 06:07:09:860 +0000 [tcp-disco-srvr-#3] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery accepted 
incoming connection [rmtAddr=/10.10.100.251, rmtPort=60777]
2017-11-11 06:07:09:860 +0000 [tcp-disco-srvr-#3] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery spawning a 
new thread for connection [rmtAddr=/10.10.100.251, rmtPort=60777]
2017-11-11 06:07:09:860 +0000 [tcp-disco-sock-reader-#11] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Started serving remote 
node connection [rmtAddr=/10.10.100.251:60777, rmtPort=60777]
2017-11-11 06:07:09:864 +0000 [tcp-disco-msg-worker-#2] WARN 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Node is out of topology 
(probably, due to short-time network problems).
2017-11-11 06:07:09:864 +0000 [disco-event-worker-#41] WARN 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Local node 
SEGMENTED: TcpDiscoveryNode [id=9512cd22-4e04-4627-9cd7-902b0143725c, 
addrs=[127.0.0.1, 172.17.0.2], sockAddrs=[/127.0.0.1:31000, 
2f35d5160c01/172.17.0.2:31000], discPort=31000, order=2, intOrder=2, 
lastExchangeTime=1510380429855, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, 
isClient=false]
2017-11-11 06:07:09:865 +0000 [tcp-disco-sock-reader-#11] INFO 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - Finished serving remote 
node connection [rmtAddr=/10.10.100.251:60777, rmtPort=60777
2017-11-11 06:07:09:867 +0000 [disco-event-worker-#41] WARN 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager - Stopping 
local node according to configured segmentation policy.

The ignite node then shuts down (which is correct due to a failed node). Our 
failsafe mechanisms can recover from this, but we would like to know how to 
circumvent these failing nodes in future. What could be the reason for such a 
node segmentation. As this runs on AWS and we even see a similar error when the 
nodes are running on the same vm I am pretty sure it is NOT a network issue ...

Thankx
Lukas




----

Lukas Lentner, B. Sc.
St.-Cajetan-Straße 13
81669 München
Deutschland
Fon:     +49 / 89  / 44 38 61 27
Mobile:  +49 / 176 / 24 77 09 22
E-Mail:  kont...@lukaslentner.de
Website: www.LukasLentner.de

IBAN:    DE33 7019 0000 0001 1810 17
BIC:     GENODEF1M01 (Münchner Bank)

Reply via email to