RE: How to debug network issues in cluster

Stanislav Lukyanov Fri, 11 Jan 2019 05:19:56 -0800

+1 to all points.

Generally, the message “Local node SEGMENTED” generally means that the cluster 
decided that the node is dead and kicked it out.
The next time the node tried to send a message to the cluster, it received an 
answer “you’re segmented” meaning “we’ve kicked you out, sorry”.
It usually happens when the node is unavailable for some time – either due to 
GC, network issues, OS/supervisor not giving the node CPU time, etc.
The primary remedy for this issue is indeed increasing failureDetectionTimeout.


Stan

From: Loredana Radulescu Ivanoff
Sent: 7 января 2019 г. 20:29
To: user@ignite.apache.org
Subject: Re: How to debug network issues in cluster

As an Ignite user, here are my two cents:

- if you were never able to get the node to join the cluster, check that there 
are no firewalls/rules blocking the Ignite ports (telnet might be a quick way 
to do that)
- check that the IPs printed by TcpDiscoverySpi are the correct ones; if you 
have virtual network adapters enabled then the wrong IP might be chosen, so the 
IP discovery will fail. This can happen if you use VirtualBox or Docker, for 
instance.
- for intermittent issues, you can try increasing the default failure detection 
timeout, which is 10s, I think. Somewhere in the Ignite doc it's recommended to 
use 30s if the JVM is on AWS.
- how did you configure IP discovery? In my case, I've always used static IP 
discovery with shared enabled - TcpDiscoveryVmIpFinder 

On Sun, Jan 6, 2019 at 6:04 AM Prasad Bhalerao <prasadbhalerao1...@gmail.com> 
wrote:
Hi,

I am consistently getting "Node is out of topology" message in logs on node-1 
and in other node, node-2 getting message "Timed out waiting for message 
delivery receipt (most probably, the reason is in long GC pauses on remote 
node; consider tuning GC and increasing '"

I have checked the network bandwidth using iperf and it is 470 Mbit per sec. I 
have also checked the gc logs and max pause time is 140 ms.

If it is really happening because of network issues, it there any way to debug 
it?

If it is happening because of gc, I would have seen it in gc logs.

Can someone please help me out with this? 

Log messages on node-1:
2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO  
o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection 
[rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO  
o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for 
connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%] INFO  
o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node connection 
[rmtAddr=/10.114.113.65:35651, rmtPort=35651]
2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%] WARN  
o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to 
short-time network problems).
2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%] WARN  
o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode 
[id=a5827f51-096a-4c98-af4f-564d2d3e769d, addrs=[10.114.113.53, 127.0.0.1], 
sockAddrs=[/127.0.0.1:47500, 
qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500, 
order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true, 
ver=2.7.0#20181130-sha1:256ae401, isClient=false]
2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%] INFO  
o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node connection 
[rmtAddr=/10.114.113.65:35651, rmtPort=35651
2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO  
o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node: 
cd9803ac-b810-447e-818e-ab51dada59d8

RE: How to debug network issues in cluster

Reply via email to