Option 1 sounds good to me. However i'd recommend documenting it (setting
net.inet.icmp.icmplim if the error is hit) within ZK itself. Also happy to
update zkconf with the detail.

https://github.com/phunt/zkconf/commit/281ad019e1d497a94f7168aa1c74053687667225

Thanks for digging into this!

Regards,

Patrick

On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <szalay.beko.m...@gmail.com>
wrote:

> TLDR:
> During testing RC for 3.6.0, we found that ZooKeeper cluster with large
> number of ensemble members (e.g. 23) can not start properly. This issue
> seems to happen only on mac, and a workaround is to disable the ICMP
> throttling. The question is if this workaround is enough for the RC, or if
> we should change the code in ZooKeeper to limit the number of ICMP
> requests.
>
>
> The problem:
>
> On linux, I haven't been able to reproduce the problem. I tried with 5, 9,
> 15 and 23 ensemble members and the quorum always seems to start properly in
> a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
>
> On mac, the problem is consistently happening for large ensembles. The
> server is very slow to start and we see a lot of warnings in the log like
> these:
>
> 2020-01-15 20:02:13,431 [myid:13] - WARN
>  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691
> ]
> - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> java.net.NoRouteToHostException: No valid address among [/
> 192.168.1.91:4190]
>
> 2020-01-17 11:02:26,177 [myid:4] - WARN
>  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address /
> 127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
>
> The exception is happening when the new MultiAddress feature tries to
> filter the unreachable hosts from the address list when trying to decide
> which election address to connect. This involves the calling of the
> InetAddress.isReachable method with a default timeout of 500ms, which goes
> down to a native call in java and basically try to do a ping (an ICMP echo
> request) to the host. Naturally, the localhost should be always reachable.
> This call gets timeouted on mac if we have many ensemble members. I tested
> with 9 members and the cluster started properly. With 11-13-15 members it
> took more and more time to get the cluster to start, and the
> "NoRouteToHostException" started to appear in the logs. After around 1
> minute the 15 ensemble members cluster started, but obviously this is way
> too long.
>
> On mac, we we have the ICMP rate limit set to 250 by default. You can turn
> this off using the following command: sudo sysctl -w
> net.inet.icmp.icmplim=0
> (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
>
> Using the above command before starting the 23 ensemble members cluster
> locally seems to solve the problem for me. (can someone verify?) The
> question is if this workaround is enough or not.
>
> As far as I can tell, the current code will generate 2*A*(M-1) ICMP calls
> in each ZooKeeper server during startup, if 'X' is the number of ensemble
> members and 'A' is the number of election addresses provided for each
> member. This is not that high, if each ZooKeeper server is started on a
> different machine, but if we start a lot of ZooKeeper servers on a single
> machine, then it can quickly go beyond the predefined limit of 250 for mac.
>
> OPTION 1: we keep the code as it is. we might change the documentation for
> zkconf mentioning this mac specific issue and the way how to disable the
> ICMP rate limit.
>
> OPTION 2: we change the code not to filter the list of election addresses
> if the list has only a single element. This seems to be a logical way to
> decrease the ICMP requests. However, if we would run a large number of
> ZooKeeper servers on a single machine using multiple election addresses for
> each server, we would get the same problem (most probably even quicker)
>
> OPTION 3: make the address filtering configurable and change zkconf to
> disable it by default. (but disabling will make the quorum potentially
> unable to recover during network failures, so it is not recommended during
> production)
>
> OPTION 4: refactor the MultiAddress feature and remove the ICMP calls from
> the ZooKeeper code. However, it is clearly helps for the quick recovery
> during network failures... at the moment I can't think any good solution to
> avoid it.
>
> Kind regards,
> Mate
>

Reply via email to