[jira] [Created] (ZOOKEEPER-3698) NoRouteToHostException when starting large ZooKeeper cluster on localhost

Mate Szalay-Beko (Jira) Fri, 17 Jan 2020 02:16:48 -0800

Mate Szalay-Beko created ZOOKEEPER-3698:
-------------------------------------------


             Summary: NoRouteToHostException when starting large ZooKeeper 
cluster on localhost
                 Key: ZOOKEEPER-3698
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3698
             Project: ZooKeeper
          Issue Type: Bug
            Reporter: Mate Szalay-Beko
            Assignee: Mate Szalay-Beko
             Fix For: 3.6.0


During testing RC for 3.6.0, we found that ZooKeeper cluster with large number 
of ensemble members (e.g. 23) can not start properly. We see a lot of warnings 
in the log:
{code:java}
2020-01-15 20:02:13,431 [myid:13] - WARN
 [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691]
- None of the addresses (/192.168.1.91:4190) are reachable for sid 10
java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190]
{code}
 

The exception is happening when the new MultiAddress feature tries to filter 
the unreachable hosts from the address list. This involves the calling of the 
InetAddress.isReachable method with a default timeout of 500ms, which goes down 
to a native call in java and basically try to do a ping (an ICMP echo request) 
to the host. Naturally, the localhost should be always reachable. For some 
reason, this call gets timeouted on mac if we have many ensemble members. I 
tested with 9 members and the cluster started properly. With 11-13-15 members 
it took more and more time to get the cluster to start, and the 
"NoRouteToHostException" started to appear in the logs. After around 1 minute 
the 15 ensemble members cluster started, but obviously this is not good this 
way. (I also tried with JDK 11 but the I found the same behaviour)

 

On linux, I haven't been able to reproduce the problem. I tried with 5, 9, 15 
and 23 ensemble members and the quorum always seems to start properly in a few 
seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ZOOKEEPER-3698) NoRouteToHostException when starting large ZooKeeper cluster on localhost

Reply via email to