Re: Local node seems to be disconnected from topology (failure detection timeout is reached)

Vladislav Pyatkov Fri, 05 Aug 2016 05:53:28 -0700

It look like as segmentation by the reasong pause of garbage collector.
You need attentively look the article [1] and collect GC logs, if GC works
with long pauses (around 10 seconds) try to tune JVM.


If the issue persists, please provide GC log and Ignite log for analyze.

[1] https://apacheignite.readme.io/docs/jvm-and-system-tuning

On Fri, Aug 5, 2016 at 2:57 PM, yucigou <yuci....@gmail.com> wrote:

> Hello,
>
> One of my Ignite nodes was stopped and the logs were appended as below. It
> seems that grid-timeout-worker checks the health of the cluster every
> minute. But then in my case, before the due time 23:34:19, at 23:34:03 it
> reported "Local node seems to be disconnected from topology (failure
> detection timeout is reached)", and the Ignite node got stopped. In turn,
> the web session clustering, and so on, stopped working.
>
> Just wonder what could cause this to happen? There should be no network
> issue etc with the host machine then. It is a bit scary to us, as it can
> happen to our production servers in the near future.
>
> Thank you for your help.
>
> Yuci
>
> ===================Ignite logs======================
> [23:31:19,896][INFO ][grid-timeout-worker-#33%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=9a069f70, name=null, uptime=10:37:03:793]
>     ^-- H/N/C [hosts=2, nodes=2, CPUs=4]
>     ^-- CPU [cur=43.17%, avg=12.83%, GC=1.1%]
>     ^-- Heap [used=2115MB, free=61.26%, comm=3955MB]
>     ^-- Non heap [used=138MB, free=-1%, comm=143MB]
>     ^-- Public thread pool [active=0, idle=16, qSize=0]
>     ^-- System thread pool [active=0, idle=16, qSize=0]
>     ^-- Outbound messages queue [size=0]
> [23:32:19,904][INFO ][grid-timeout-worker-#33%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=9a069f70, name=null, uptime=10:38:03:801]
>     ^-- H/N/C [hosts=2, nodes=2, CPUs=4]
>     ^-- CPU [cur=0.83%, avg=12.87%, GC=0%]
>     ^-- Heap [used=2638MB, free=51.69%, comm=3957MB]
>     ^-- Non heap [used=138MB, free=-1%, comm=143MB]
>     ^-- Public thread pool [active=0, idle=16, qSize=0]
>     ^-- System thread pool [active=0, idle=16, qSize=0]
>     ^-- Outbound messages queue [size=0]
> [23:33:19,913][INFO ][grid-timeout-worker-#33%null%][IgniteKernal]
> Metrics for local node (to disable set 'metricsLogFrequency' to 0)
>     ^-- Node [id=9a069f70, name=null, uptime=10:39:03:808]
>     ^-- H/N/C [hosts=2, nodes=2, CPUs=4]
>     ^-- CPU [cur=0.5%, avg=12.86%, GC=0%]
>     ^-- Heap [used=796MB, free=85.41%, comm=3921MB]
>     ^-- Non heap [used=138MB, free=-1%, comm=143MB]
>     ^-- Public thread pool [active=0, idle=16, qSize=0]
>     ^-- System thread pool [active=0, idle=16, qSize=0]
>     ^-- Outbound messages queue [size=0]
> [23:34:03,752][INFO ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi]
> Local
> node seems to be disconnected from topology (failure detection timeout is
> reached) [failureDetectionTimeout=10000, connCheckFreq=3333]
> [23:34:03,783][WARN ][tcp-disco-msg-worker-#2%null%][TcpDiscoverySpi] Node
> is out of topology (probably, due to short-time network problems).
> [23:34:03,786][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager]
> Local node SEGMENTED: TcpDiscoveryNode
> [id=9a069f70-d49d-472e-9771-7ac2353e751f, addrs=[10.3.0.64, 127.0.0.1],
> sockAddrs=[ves-hx-40.ebi.ac.uk/10.3.0.64:47500, /10.3.0.64:47500,
> /127.0.0.1:47500], discPort=47500, order=56, intOrder=29,
> lastExchangeTime=1470350043783, loc=true, ver=1.6.0#20160518-sha1:
> 0b22c45b,
> isClient=false]
> [23:34:03,819][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager]
> Stopping local node according to configured segmentation policy.
> [23:34:03,825][WARN ][disco-event-worker-#44%null%][GridDiscoveryManager]
> Node FAILED: TcpDiscoveryNode [id=cef7fc5e-b854-4072-8e16-396a87d5d556,
> addrs=[10.3.0.65, 127.0.0.1],
> sockAddrs=[ves-hx-41.ebi.ac.uk/10.3.0.65:47500, /10.3.0.65:47500,
> /127.0.0.1:47500], discPort=47500, order=58, intOrder=30,
> lastExchangeTime=1470311808664, loc=false, ver=1.6.0#20160518-sha1:
> 0b22c45b,
> isClient=false]
> [23:34:03,827][INFO ][disco-event-worker-#44%null%][GridDiscoveryManager]
> Topology snapshot [ver=59, servers=1, clients=0, CPUs=2, heap=5.3GB]
> [23:34:03,874][INFO ][Thread-32][GridTcpRestProtocol] Command protocol
> successfully stopped: TCP binary
> [23:34:03,902][INFO ][Thread-32][GridJettyRestProtocol] Command protocol
> successfully stopped: Jetty REST
> [23:34:04,571][INFO ][Thread-32][GridCacheProcessor] Stopped cache:
> session-cache
> [23:34:04,572][INFO ][Thread-32][GridCacheProcessor] Stopped cache:
> ignite-marshaller-sys-cache
> [23:34:04,572][INFO ][Thread-32][GridCacheProcessor] Stopped cache:
> ignite-sys-cache
> [23:34:04,573][INFO ][Thread-32][GridCacheProcessor] Stopped cache:
> ignite-atomics-sys-cache
> [23:34:04,583][INFO ][Thread-32][GridCacheProcessor] Stopped cache:
> wicket-data-store
> [23:34:04,623][INFO ][Thread-32][IgniteKernal]
>
> >>> +-----------------------------------------------------------
> ----------------------+
> >>> Ignite ver. 1.6.0#20160518-sha1:0b22c45bb9b97692208fd0705ddf80
> 45ff34a031
> >>> stopped OK
> >>> +-----------------------------------------------------------
> ----------------------+
> >>> Grid uptime: 10:39:48:518
>
>
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Local-node-seems-to-be-disconnected-
> from-topology-failure-detection-timeout-is-reached-tp6797.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: Local node seems to be disconnected from topology (failure detection timeout is reached)

Reply via email to