I see messages in the log that are not from Ignite, but which also suggest that 
there is some network issue:
========
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
========

In any case, given that there were no messages in the logs for ~35 minutes, it 
is unlikely that it is an Ignite issue – Ignite would at least print metrics or 
errors 
if it was running and getting any CPU time.

Stan

From: naresh.goty
Sent: 9 июня 2018 г. 18:16
To: user@ignite.apache.org
Subject: Re: Ignite Node failure - Node out of topology (SEGMENTED)

We are still seeing the NODE SEGMENTATION issue happening to one of the node
in our production even after JVM option is enabled (
-Djava.net.preferIPv4Stack=true).

We don't see any activity reported in the logs for a period of ~30min after
node failed.
The below logs are from the failed node, and it can be observed that till
this timestamp (9:43:25) node was up and running, and after that there are
no messages till 10:19:59 in the catalina logs. Thugh node was failed at
09:44, the node segmentation was reported at 10:19AM

Right before the node failure, the metrics clearly stated that CPU and
memory utilization are very low. Long GC pauses are also not an issue.
=================================================
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:42:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=9bbe0362, name=delivery, uptime=56:24:59.907]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=1.67%, avg=2.71%, GC=0%]
    ^-- PageMemory [pages=43003]
    ^-- Heap [used=605MB, free=85.11%, comm=4065MB]
    ^-- Non heap [used=204MB, free=-1%, comm=213MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=10239,
reusePages=6228]
Jun 09, 2018 9:43:25 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=278]
2018-06-09 10:19:59 [a9fc3882] warning [java  ] ... last message repeated 1
time ...
2018-06-09 10:19:59 [a9fc3882] severe  [native] Exception in controller:
receiveExact() ... error reading, 70014, End of file found. Retrying every
10 seconds.
2018-06-09 10:19:59 [abff8882] warning [native] Instrumentation channel
disconnected: server did not reply to ping request
Jun 09, 2018 10:19:59 AM org.apache.ignite.logger.java.JavaLogger error
SEVERE: Failed to send message: class o.a.i.IgniteCheckedException: Failed
to send message (connection was closed): GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=0,
bytesRcvd=24290335228, bytesSent=51080178562, bytesRcvd0=25800,
bytesSent0=183862, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-0, igniteInstanceName=delivery,
finished=false, hashCode=1045175627, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#25%delivery%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=3240064, resendCnt=0,
rcvCnt=3240098, sentCnt=3240092, reserved=false, lastAck=3240096,
nodeLeft=false, node=TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1528334301578, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=false,
connectCnt=1, queueLimit=131072, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/10.40.173.21:50320,
rmtAddr=/10.40.173.14:47100, createTime=1528334304232,
closeTime=1528537460337, bytesSent=13308284327, bytesRcvd=12937081481,
bytesSent0=52340, bytesRcvd0=7812, sndSchedTime=1528334304232,
lastSndTime=1528537460337, lastRcvTime=1528537460337, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@7d2bc944, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]
====================================================

>From one of the active node, it reported that other node (above) was failed
immediately at 09:44 (as reported in the logs below)

Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=b392f3c6, name=delivery, uptime=56:18:06.778]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=9.17%, avg=9%, GC=0%]
    ^-- PageMemory [pages=41038]
    ^-- Heap [used=3167MB, free=22.42%, comm=4082MB]
    ^-- Non heap [used=202MB, free=-1%, comm=210MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=9096, reusePages=4659]
Jun 09, 2018 9:43:50 AM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=240]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9bbe0362-00e0-4f16-abf0-52de5c62c51d, addrs=[10.40.173.21, 127.0.0.1],
sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=28,
intOrder=18, lastExchangeTime=1528334738846, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=40, servers=8, clients=0, CPUs=16, heap=26.0GB]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false, evt=NODE_FAILED,
evtNode=9bbe0362-00e0-4f16-abf0-52de5c62c51d, customEvt=null,
allowMerge=true]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=40, minorTopVer=0], waitTime=0ms,
futInfo=NA]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], crd=false]
Jun 09, 2018 9:44:21 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Received full message, will finish exchange
[node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion
[topVer=40, minorTopVer=0]]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info
INFO: Finish exchange future [startVer=AffinityTopologyVersion [topVer=40,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=40, minorTopVer=0],
err=null]
Jun 09, 2018 9:44:22 AM org.apache.ignite.logger.java.JavaLogger info


================================================================

Any advice in this issue would be appreciated?

-Naresh




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to