It look like as segmentation by the reasong pause of garbage collector. You need attentively look the article [1] and collect GC logs, if GC works with long pauses (around 10 seconds) try to tune JVM.
If the issue persists, please provide GC log and Ignite log for analyze. [1] https://apacheignite.readme.io/docs/jvm-and-system-tuning On Fri, Aug 5, 2016 at 2:57 PM, yucigou <yuci....@gmail.com> wrote: > Hello, > > One of my Ignite nodes was stopped and the logs were appended as below. It > seems that grid-timeout-worker checks the health of the cluster every > minute. But then in my case, before the due time 23:34:19, at 23:34:03 it > reported "Local node seems to be disconnected from topology (failure > detection timeout is reached)", and the Ignite node got stopped. In turn, > the web session clustering, and so on, stopped working. > > Just wonder what could cause this to happen? There should be no network > issue etc with the host machine then. It is a bit scary to us, as it can > happen to our production servers in the near future. > > Thank you for your help. > > Yuci > > ===================Ignite logs====================== > [23:31:19,896][INFO ][grid-timeout-worker-#33%null%][IgniteKernal] > Metrics for local node (to disable set 'metricsLogFrequency' to 0) > ^-- Node [id=9a069f70, name=null, uptime=10:37:03:793] > ^-- H/N/C [hosts=2, nodes=2, CPUs=4] > ^-- CPU [cur=43.17%, avg=12.83%, GC=1.1%] > ^-- Heap [used=2115MB, free=61.26%, comm=3955MB] > ^-- Non heap [used=138MB, free=-1%, comm=143MB] > ^-- Public thread pool [active=0, idle=16, qSize=0] > ^-- System thread pool [active=0, idle=16, qSize=0] > ^-- Outbound messages queue [size=0] > [23:32:19,904][INFO ][grid-timeout-worker-#33%null%][IgniteKernal] > Metrics for local node (to disable set 'metricsLogFrequency' to 0) > ^-- Node [id=9a069f70, name=null, uptime=10:38:03:801] > ^-- H/N/C [hosts=2, nodes=2, CPUs=4] > ^-- CPU [cur=0.83%, avg=12.87%, GC=0%] > ^-- Heap [used=2638MB, free=51.69%, comm=3957MB] > ^-- Non heap [used=138MB, free=-1%, comm=143MB] > ^-- Public thread pool [active=0, idle=16, qSize=0] > ^-- System thread pool [active=0, idle=16, qSize=0] > ^-- Outbound messages queue [size=0] > [23:33:19,913][INFO ][grid-timeout-worker-#33%null%][IgniteKernal] > Metrics for local node (to disable set 'metricsLogFrequency' to 0) > ^-- Node [id=9a069f70, name=null, uptime=10:39:03:808] > ^-- H/N/C [hosts=2, nodes=2, CPUs=4] > ^-- CPU [cur=0.5%, avg=12.86%, GC=0%] > ^-- Heap [used=796MB, free=85.41%, comm=3921MB] > ^-- Non heap [used=138MB, free=-1%, comm=143MB] > ^-- Public thread pool [active=0, idle=16, qSize=0] > ^-- System thread pool [active=0, idle=16, qSize=0] > ^-- Outbound messages queue [size=0] > [23:34:03,752][INFO ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] > Local > node seems to be disconnected from topology (failure detection timeout is > reached) [failureDetectionTimeout=10000, connCheckFreq=3333] > [23:34:03,783][WARN ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node > is out of topology (probably, due to short-time network problems). > [23:34:03,786][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager] > Local node SEGMENTED: TcpDiscoveryNode > [id=9a069f70-d49d-472e-9771-7ac2353e751f, addrs=[10.3.0.64, 127.0.0.1], > sockAddrs=[ves-hx-40.ebi.ac.uk/10.3.0.64:47500, /10.3.0.64:47500, > /127.0.0.1:47500], discPort=47500, order=56, intOrder=29, > lastExchangeTime=1470350043783, loc=true, ver=1.6.0#20160518-sha1: > 0b22c45b, > isClient=false] > [23:34:03,819][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager] > Stopping local node according to configured segmentation policy. > [23:34:03,825][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager] > Node FAILED: TcpDiscoveryNode [id=cef7fc5e-b854-4072-8e16-396a87d5d556, > addrs=[10.3.0.65, 127.0.0.1], > sockAddrs=[ves-hx-41.ebi.ac.uk/10.3.0.65:47500, /10.3.0.65:47500, > /127.0.0.1:47500], discPort=47500, order=58, intOrder=30, > lastExchangeTime=1470311808664, loc=false, ver=1.6.0#20160518-sha1: > 0b22c45b, > isClient=false] > [23:34:03,827][INFO ][disco-event-worker-#44%null%][GridDiscoveryManager] > Topology snapshot [ver=59, servers=1, clients=0, CPUs=2, heap=5.3GB] > [23:34:03,874][INFO ][Thread-32][GridTcpRestProtocol] Command protocol > successfully stopped: TCP binary > [23:34:03,902][INFO ][Thread-32][GridJettyRestProtocol] Command protocol > successfully stopped: Jetty REST > [23:34:04,571][INFO ][Thread-32][GridCacheProcessor] Stopped cache: > session-cache > [23:34:04,572][INFO ][Thread-32][GridCacheProcessor] Stopped cache: > ignite-marshaller-sys-cache > [23:34:04,572][INFO ][Thread-32][GridCacheProcessor] Stopped cache: > ignite-sys-cache > [23:34:04,573][INFO ][Thread-32][GridCacheProcessor] Stopped cache: > ignite-atomics-sys-cache > [23:34:04,583][INFO ][Thread-32][GridCacheProcessor] Stopped cache: > wicket-data-store > [23:34:04,623][INFO ][Thread-32][IgniteKernal] > > >>> +----------------------------------------------------------- > ----------------------+ > >>> Ignite ver. 1.6.0#20160518-sha1:0b22c45bb9b97692208fd0705ddf80 > 45ff34a031 > >>> stopped OK > >>> +----------------------------------------------------------- > ----------------------+ > >>> Grid uptime: 10:39:48:518 > > > > > > -- > View this message in context: http://apache-ignite-users. > 70518.x6.nabble.com/Local-node-seems-to-be-disconnected- > from-topology-failure-detection-timeout-is-reached-tp6797.html > Sent from the Apache Ignite Users mailing list archive at Nabble.com. >