Thanks for all the comments! I agree that the 23 nodes ZK cluster is not a production-like setup, but still it is valuable for testing. Also I agree with Enrico. I can actually imagine a production environment, where ICMP is completely disabled for some security reasons. Although it is not very likely, as ZooKeeper usually runs deeply in the backend. But who knows... I think there is a value in making this 'new behaviour' configurable. We just did it with a small patch already: https://github.com/apache/zookeeper/pull/1228.
I was also thinking about how we could remove the ICMP completely from the multi-address feature in the long term, so I created a follow-up ticket here: https://issues.apache.org/jira/browse/ZOOKEEPER-3705 (hopefully I can work on it during February) Kind regards, Mate On Thu, Jan 23, 2020 at 8:23 PM Enrico Olivelli <[email protected]> wrote: > I feel that Option 2 is more conservative, the multi address feature is > new in 3.6 and in my opinion it won't we used by current users of 3.4 and > 3.5, at least not immediately after an upgrade because it needs a different > network architecture. > > If you do not use the multi address property with option 1 you will be > seening an extra traffic (ICMP) between your hosts and maybe this fact > won't be well seen. > With option 2 the behaviour is the same as zk 3.5. > This is why I prefer option 2 > > > Enrico > > Il Gio 23 Gen 2020, 20:15 Patrick Hunt <[email protected]> ha scritto: > > > Agree with both folks (ted/michael) - I view this as a "chaos monkey" of > > sorts. If it runs with 5 shouldn't it run with 7 and so on.... I don't > > remember why I chose 23, it's been 10 years or so that I've been running > > this test. Don't do this at home folks. ;-) Also I just don't try > starting > > the cluster, I also kill servers, restart them and so on, it's a very > good > > stress test for the quorum protocol, etc... Option1 sounds fine to me, > but > > wanted to make sure the community reviewed before signing off on letting > > the code stand, or whatever as long as it's reviewed/understood given it > > was/is new behavior in 3.6 esp. Conscious decision at the eod. > > > > Regards, > > > > Patrick > > > > On Thu, Jan 23, 2020 at 11:05 AM Michael K. Edwards < > [email protected] > > > > > wrote: > > > > > While I agree that this is not a very production-like configuration, I > > > think it's good to recognize that there are plenty of clusters out > there > > > where more than five zookeeper nodes are called for. I run systems > > > routinely with seven voting members plus three or more observers, for > > > reasons having to do with system behavior during network split > scenarios > > in > > > AWS EC2. Mac OS specific issues aside, it would be unfortunate if > there > > > were an artificial cap on the number of nodes in a machine-local test > > > cluster, especially if it were related to an ICMP storm scenario. > > > > > > On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <[email protected]> > wrote: > > > > > > > I think that this is far outside the normal operation bounds and has > an > > > > easy work-around. > > > > > > > > First, it is very uncommon to run more than 5 ZK nodes. Running 23 > on a > > > > single host is bizarre (viewed from an operational lens). > > > > > > > > Second, there is a simple configuration change that makes the strange > > > > configuration work anyway. > > > > > > > > A third point unrelated to operational considerations is that there > is > > > risk > > > > in making a last minute changes to code. That risk is borne by normal > > > > configurations as well as these unusual ones. > > > > > > > > In sum, > > > > > > > > - this might look like a P1 (system down) issue, but there is a > > > workaround > > > > so it is certainly no more than P2 > > > > > > > > - even P2 is unwarranted because the is a non-production > configuration > > > > > > > > - a P3 issue isn't a stop-ship issue. > > > > > > > > > > > > > > > > On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté < > > > > [email protected]> > > > > wrote: > > > > > > > > > TLDR: > > > > > During testing RC for 3.6.0, we found that ZooKeeper cluster with > > large > > > > > number of ensemble members (e.g. 23) can not start properly. This > > issue > > > > > seems to happen only on mac, and a workaround is to disable the > ICMP > > > > > throttling. The question is if this workaround is enough for the > RC, > > or > > > > if > > > > > we should change the code in ZooKeeper to limit the number of ICMP > > > > > requests. > > > > > > > > > > > > > > > The problem: > > > > > > > > > > On linux, I haven't been able to reproduce the problem. I tried > with > > 5, > > > > 9, > > > > > 15 and 23 ensemble members and the quorum always seems to start > > > properly > > > > in > > > > > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04) > > > > > > > > > > On mac, the problem is consistently happening for large ensembles. > > The > > > > > server is very slow to start and we see a lot of warnings in the > log > > > like > > > > > these: > > > > > > > > > > 2020-01-15 20:02:13,431 [myid:13] - WARN > > > > > [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193 > > > > :QuorumCnxManager@691 > > > > > ] > > > > > - None of the addresses (/192.168.1.91:4190) are reachable for sid > > 10 > > > > > java.net.NoRouteToHostException: No valid address among [/ > > > > > 192.168.1.91:4190] > > > > > > > > > > 2020-01-17 11:02:26,177 [myid:4] - WARN > > > > > [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination > > address > > > / > > > > > 127.0.0.1 not reachable anymore, shutting down the SendWorker for > > sid 6 > > > > > > > > > > The exception is happening when the new MultiAddress feature tries > to > > > > > filter the unreachable hosts from the address list when trying to > > > decide > > > > > which election address to connect. This involves the calling of the > > > > > InetAddress.isReachable method with a default timeout of 500ms, > which > > > > goes > > > > > down to a native call in java and basically try to do a ping (an > ICMP > > > > echo > > > > > request) to the host. Naturally, the localhost should be always > > > > reachable. > > > > > This call gets timeouted on mac if we have many ensemble members. I > > > > tested > > > > > with 9 members and the cluster started properly. With 11-13-15 > > members > > > it > > > > > took more and more time to get the cluster to start, and the > > > > > "NoRouteToHostException" started to appear in the logs. After > around > > 1 > > > > > minute the 15 ensemble members cluster started, but obviously this > is > > > way > > > > > too long. > > > > > > > > > > On mac, we we have the ICMP rate limit set to 250 by default. You > can > > > > turn > > > > > this off using the following command: sudo sysctl -w > > > > > net.inet.icmp.icmplim=0 > > > > > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/ > ) > > > > > > > > > > Using the above command before starting the 23 ensemble members > > cluster > > > > > locally seems to solve the problem for me. (can someone verify?) > The > > > > > question is if this workaround is enough or not. > > > > > > > > > > As far as I can tell, the current code will generate 2*A*(M-1) ICMP > > > calls > > > > > in each ZooKeeper server during startup, if 'X' is the number of > > > ensemble > > > > > members and 'A' is the number of election addresses provided for > each > > > > > member. This is not that high, if each ZooKeeper server is started > > on a > > > > > different machine, but if we start a lot of ZooKeeper servers on a > > > single > > > > > machine, then it can quickly go beyond the predefined limit of 250 > > for > > > > mac. > > > > > > > > > > OPTION 1: we keep the code as it is. we might change the > > documentation > > > > for > > > > > zkconf mentioning this mac specific issue and the way how to > disable > > > the > > > > > ICMP rate limit. > > > > > > > > > > OPTION 2: we change the code not to filter the list of election > > > addresses > > > > > if the list has only a single element. This seems to be a logical > way > > > to > > > > > decrease the ICMP requests. However, if we would run a large number > > of > > > > > ZooKeeper servers on a single machine using multiple election > > addresses > > > > for > > > > > each server, we would get the same problem (most probably even > > quicker) > > > > > > > > > > OPTION 3: make the address filtering configurable and change zkconf > > to > > > > > disable it by default. (but disabling will make the quorum > > potentially > > > > > unable to recover during network failures, so it is not recommended > > > > during > > > > > production) > > > > > > > > > > OPTION 4: refactor the MultiAddress feature and remove the ICMP > calls > > > > from > > > > > the ZooKeeper code. However, it is clearly helps for the quick > > recovery > > > > > during network failures... at the moment I can't think any good > > > solution > > > > to > > > > > avoid it. > > > > > > > > > > Kind regards, > > > > > Mate > > > > > > > > > > > > > > >
