Hi Dmitry,

We are again seeing segmentation failure in one of the node of our prod env.
This time we did not run jmap, but still node failed.

-> CPU, memory utilization and network are in optimal state.

We observed that there are page faults in memory at the same time of
segmentation failure, as reported by dynatrace agent (attached screenshot).

Can you please confirm if page faults could result in network segmentation
in a node?
I think, we see page faults in a node, but not always result in segmentation
failure.


Logs from Failed Agent:
================================
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=3f568bb8, name=delivery, uptime=24:31:12.859]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=7%, avg=9.06%, GC=0%]
    ^-- PageMemory [pages=30244]
    ^-- Heap [used=3184MB, free=22.09%, comm=4087MB]
    ^-- Non heap [used=213MB, free=-1%, comm=222MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4879, reusePages=0]
Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: TCP discovery accepted incoming connection [rmtAddr=/10.40.173.14,
rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: TCP discovery spawning a new thread for connection
[rmtAddr=/10.40.173.14, rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Started serving remote node connection [rmtAddr=/10.40.173.14:33762,
rmtPort=33762]
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node is out of topology (probably, due to short-time network
problems).
Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Local node SEGMENTED: TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=152978
6434361, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished serving remote node connection [rmtAddr=/10.40.173.14:33762,
rmtPort=33762
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished serving remote node connection [rmtAddr=/10.40.173.41:52584,
rmtPort=52584
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Stopping local node according to configured segmentation policy.
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22,
intOrder=15, lastExchangeTime=1529050123714,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Command protocol successfully stopped: TCP binary
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=28.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=a26de809-dde1-41b8-87a3-d5576851a0be, addrs=[10.40.173.56, 127.0.0.1],
sockAddrs=[/10.40.173.56:47500, /127.0.0.1:47500], discPort=47500, order=23,
intOrder=16, lastExchangeTime=1529050123735,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=57, servers=7, clients=0, CPUs=14, heap=26.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=910ea19f-af5c-4745-a035-b24a3bb48206, addrs=[10.40.173.88, 127.0.0.1],
sockAddrs=[/10.40.173.88:47500, /127.0.0.1:47500], discPort=47500, order=25,
intOrder=17, lastExchangeTime=1529050123735,
loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=58, servers=6, clients=0, CPUs=12, heap=24.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=17f3ba9c-e32e-47e4-9ca2-136338d8c4ac, addrs=[10.40.173.39, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.39:47500], discPort=47500, order=30,
intOrder=19, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=59, servers=5, clients=0, CPUs=10, heap=20.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=b392f3c6-84fd-4cd9-a695-92d1ef3b4262, addrs=[10.40.173.11, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.11:47500], discPort=47500, order=34,
intOrder=21, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=60, servers=4, clients=0, CPUs=8, heap=16.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=e2eb6d96-e60e-4643-ac7a-2b750888079e, addrs=[10.40.173.21, 127.0.0.1],
sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=41,
intOrder=25, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=61, servers=3, clients=0, CPUs=6, heap=12.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=8975781d-ac95-49eb-9f17-4be2d3374b15, addrs=[10.40.173.74, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.74:47500], discPort=47500, order=45,
intOrder=27, lastExchangeTime=1529050123735, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=62, servers=2, clients=0, CPUs=4, heap=8.0GB]
Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning


==================================


Logs from Coordinator node (or reported agent)

Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=96268498, name=delivery, uptime=16:54:44.560]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=6.5%, avg=9.07%, GC=0%]
    ^-- PageMemory [pages=33192]
    ^-- Heap [used=3396MB, free=16.88%, comm=4086MB]
    ^-- Non heap [used=219MB, free=-1%, comm=228MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=5624, reusePages=63]
Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO:
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=96268498, name=delivery, uptime=16:55:44.622]
    ^-- H/N/C [hosts=9, nodes=9, CPUs=18]
    ^-- CPU [cur=7.83%, avg=9.07%, GC=0%]
    ^-- PageMemory [pages=33192]
    ^-- Heap [used=3188MB, free=21.98%, comm=4086MB]
    ^-- Non heap [used=219MB, free=-1%, comm=228MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
    ^-- Outbound messages queue [size=0]
Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=5624, reusePages=63]
Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info
INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0]
Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Timed out waiting for message delivery receipt (most probably, the
reason is in long GC pauses on remote node; consider tuning GC and
increasing 'ackTimeout' configuration property). Will retry to send message
with increased timeout [currentTimeout=9990, rmtAddr=/10.40.173.78:47500,
rmtPort=47500]
Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Failed to send message to next node
[msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage
[sndNodeId=8975781d-ac95-49eb-9f17-4be2d3374b15,
id=bd526e6e361-9165f32c-9765-49d7-8856-5b77b0bded6d,
verifierNodeId=9165f32c-9765-49d7-8856-5b77b0bded6d, topVer=0, pendingIdx=0,
failedNodes=null, isClient=false]], next=TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=1529050142836, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], errMsg=Failed to send
message to next node [msg=TcpDiscoveryMetricsUpdateMessage
[super=TcpDiscoveryAbstractMessage
[sndNodeId=8975781d-ac95-49eb-9f17-4be2d3374b15,
id=bd526e6e361-9165f32c-9765-49d7-8856-5b77b0bded6d,
verifierNodeId=9165f32c-9765-49d7-8856-5b77b0bded6d, topVer=0, pendingIdx=0,
failedNodes=null, isClient=false]], next=ClusterNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, order=54, addr=[10.40.173.78,
127.0.0.1], daemon=false]]]
Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Local node has detected failed nodes and started cluster-wide
procedure. To speed up failure detection please see 'Failure Detection'
section under javadoc for 'TcpDiscoverySpi'
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger warning
WARNING: Node FAILED: TcpDiscoveryNode
[id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54,
intOrder=32, lastExchangeTime=1529050142836, loc=false,
ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=26.0GB]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=56,
minorTopVer=0], crd=false, evt=NODE_FAILED,
evtNode=3f568bb8-813d-47f7-b8da-4ecbff3e9753, customEvt=null,
allowMerge=true]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], waitTime=0ms,
futInfo=NA]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=56,
minorTopVer=0], crd=false]
Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info
INFO: Received full message, will finish exchange
[node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion
[topVer=56, minorTopVer=0]]
===================================================

<http://apache-ignite-users.70518.x6.nabble.com/file/t1286/pageFaults.png> 

Thanks
Naresh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Reply via email to