Hi Dmitry, We are again seeing segmentation failure in one of the node of our prod env. This time we did not run jmap, but still node failed.
-> CPU, memory utilization and network are in optimal state. We observed that there are page faults in memory at the same time of segmentation failure, as reported by dynatrace agent (attached screenshot). Can you please confirm if page faults could result in network segmentation in a node? I think, we see page faults in a node, but not always result in segmentation failure. Logs from Failed Agent: ================================ INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0] Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info INFO: Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=3f568bb8, name=delivery, uptime=24:31:12.859] ^-- H/N/C [hosts=9, nodes=9, CPUs=18] ^-- CPU [cur=7%, avg=9.06%, GC=0%] ^-- PageMemory [pages=30244] ^-- Heap [used=3184MB, free=22.09%, comm=4087MB] ^-- Non heap [used=213MB, free=-1%, comm=222MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=5, qSize=0] ^-- Outbound messages queue [size=0] Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4879, reusePages=0] Jun 23, 2018 8:40:00 PM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0] Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info INFO: TCP discovery accepted incoming connection [rmtAddr=/10.40.173.14, rmtPort=33762] Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info INFO: TCP discovery spawning a new thread for connection [rmtAddr=/10.40.173.14, rmtPort=33762] Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger info INFO: Started serving remote node connection [rmtAddr=/10.40.173.14:33762, rmtPort=33762] Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node is out of topology (probably, due to short-time network problems). Jun 23, 2018 8:40:34 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Local node SEGMENTED: TcpDiscoveryNode [id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54, intOrder=32, lastExchangeTime=152978 6434361, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Finished serving remote node connection [rmtAddr=/10.40.173.14:33762, rmtPort=33762 Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Finished serving remote node connection [rmtAddr=/10.40.173.41:52584, rmtPort=52584 Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Stopping local node according to configured segmentation policy. Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=9165f32c-9765-49d7-8856-5b77b0bded6d, addrs=[10.40.173.14, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.14:47500], discPort=47500, order=22, intOrder=15, lastExchangeTime=1529050123714, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Command protocol successfully stopped: TCP binary Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=28.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=a26de809-dde1-41b8-87a3-d5576851a0be, addrs=[10.40.173.56, 127.0.0.1], sockAddrs=[/10.40.173.56:47500, /127.0.0.1:47500], discPort=47500, order=23, intOrder=16, lastExchangeTime=1529050123735, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=57, servers=7, clients=0, CPUs=14, heap=26.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=910ea19f-af5c-4745-a035-b24a3bb48206, addrs=[10.40.173.88, 127.0.0.1], sockAddrs=[/10.40.173.88:47500, /127.0.0.1:47500], discPort=47500, order=25, intOrder=17, lastExchangeTime=1529050123735, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=58, servers=6, clients=0, CPUs=12, heap=24.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=17f3ba9c-e32e-47e4-9ca2-136338d8c4ac, addrs=[10.40.173.39, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.39:47500], discPort=47500, order=30, intOrder=19, lastExchangeTime=1529050123735, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=59, servers=5, clients=0, CPUs=10, heap=20.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=b392f3c6-84fd-4cd9-a695-92d1ef3b4262, addrs=[10.40.173.11, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.11:47500], discPort=47500, order=34, intOrder=21, lastExchangeTime=1529050123735, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=60, servers=4, clients=0, CPUs=8, heap=16.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=e2eb6d96-e60e-4643-ac7a-2b750888079e, addrs=[10.40.173.21, 127.0.0.1], sockAddrs=[/10.40.173.21:47500, /127.0.0.1:47500], discPort=47500, order=41, intOrder=25, lastExchangeTime=1529050123735, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=61, servers=3, clients=0, CPUs=6, heap=12.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=8975781d-ac95-49eb-9f17-4be2d3374b15, addrs=[10.40.173.74, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.74:47500], discPort=47500, order=45, intOrder=27, lastExchangeTime=1529050123735, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=62, servers=2, clients=0, CPUs=4, heap=8.0GB] Jun 23, 2018 8:40:35 PM org.apache.ignite.logger.java.JavaLogger warning ================================== Logs from Coordinator node (or reported agent) Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info INFO: Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=96268498, name=delivery, uptime=16:54:44.560] ^-- H/N/C [hosts=9, nodes=9, CPUs=18] ^-- CPU [cur=6.5%, avg=9.07%, GC=0%] ^-- PageMemory [pages=33192] ^-- Heap [used=3396MB, free=16.88%, comm=4086MB] ^-- Non heap [used=219MB, free=-1%, comm=228MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=5, qSize=0] ^-- Outbound messages queue [size=0] Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=5624, reusePages=63] Jun 23, 2018 8:39:18 PM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0] Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info INFO: Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=96268498, name=delivery, uptime=16:55:44.622] ^-- H/N/C [hosts=9, nodes=9, CPUs=18] ^-- CPU [cur=7.83%, avg=9.07%, GC=0%] ^-- PageMemory [pages=33192] ^-- Heap [used=3188MB, free=21.98%, comm=4086MB] ^-- Non heap [used=219MB, free=-1%, comm=228MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=5, qSize=0] ^-- Outbound messages queue [size=0] Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=5624, reusePages=63] Jun 23, 2018 8:40:18 PM org.apache.ignite.logger.java.JavaLogger info INFO: FreeList [name=delivery, buckets=256, dataPages=4, reusePages=0] Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=9990, rmtAddr=/10.40.173.78:47500, rmtPort=47500] Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=8975781d-ac95-49eb-9f17-4be2d3374b15, id=bd526e6e361-9165f32c-9765-49d7-8856-5b77b0bded6d, verifierNodeId=9165f32c-9765-49d7-8856-5b77b0bded6d, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54, intOrder=32, lastExchangeTime=1529050142836, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=8975781d-ac95-49eb-9f17-4be2d3374b15, id=bd526e6e361-9165f32c-9765-49d7-8856-5b77b0bded6d, verifierNodeId=9165f32c-9765-49d7-8856-5b77b0bded6d, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, order=54, addr=[10.40.173.78, 127.0.0.1], daemon=false]]] Jun 23, 2018 8:40:54 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger warning WARNING: Node FAILED: TcpDiscoveryNode [id=3f568bb8-813d-47f7-b8da-4ecbff3e9753, addrs=[10.40.173.78, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, /10.40.173.78:47500], discPort=47500, order=54, intOrder=32, lastExchangeTime=1529050142836, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false] Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info INFO: Topology snapshot [ver=56, servers=8, clients=0, CPUs=16, heap=26.0GB] Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info INFO: Started exchange init [topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], crd=false, evt=NODE_FAILED, evtNode=3f568bb8-813d-47f7-b8da-4ecbff3e9753, customEvt=null, allowMerge=true] Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info INFO: Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], waitTime=0ms, futInfo=NA] Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info INFO: Finished exchange init [topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0], crd=false] Jun 23, 2018 8:40:55 PM org.apache.ignite.logger.java.JavaLogger info INFO: Received full message, will finish exchange [node=9165f32c-9765-49d7-8856-5b77b0bded6d, resVer=AffinityTopologyVersion [topVer=56, minorTopVer=0]] =================================================== <http://apache-ignite-users.70518.x6.nabble.com/file/t1286/pageFaults.png> Thanks Naresh -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/