Thanks for all the comments!

I agree that the 23 nodes ZK cluster is not a production-like setup, but
still it is valuable for testing.
Also I agree with Enrico. I can actually imagine a production environment,
where ICMP is completely disabled for some security reasons. Although it is
not very likely, as ZooKeeper usually runs deeply in the backend. But who
knows... I think there is a value in making this 'new behaviour'
configurable. We just did it with a small patch already:
https://github.com/apache/zookeeper/pull/1228.

I was also thinking about how we could remove the ICMP completely from the
multi-address feature in the long term, so I created a follow-up ticket
here: https://issues.apache.org/jira/browse/ZOOKEEPER-3705
(hopefully I can work on it during February)

Kind regards,
Mate

On Thu, Jan 23, 2020 at 8:23 PM Enrico Olivelli <[email protected]> wrote:

> I feel that Option 2 is more  conservative, the multi address feature is
> new in 3.6 and in my opinion it won't we used by current users of 3.4 and
> 3.5, at least not immediately after an upgrade because it needs a different
> network architecture.
>
> If you do not use the multi address property with option 1 you will be
> seening an extra traffic (ICMP) between your hosts and maybe this fact
> won't be well seen.
> With option 2 the behaviour is the same as zk 3.5.
> This is why I prefer option 2
>
>
> Enrico
>
> Il Gio 23 Gen 2020, 20:15 Patrick Hunt <[email protected]> ha scritto:
>
> > Agree with both folks (ted/michael) - I view this as a "chaos monkey" of
> > sorts. If it runs with 5 shouldn't it run with 7 and so on.... I don't
> > remember why I chose 23, it's been 10 years or so that I've been running
> > this test. Don't do this at home folks. ;-) Also I just don't try
> starting
> > the cluster, I also kill servers, restart them and so on, it's a very
> good
> > stress test for the quorum protocol, etc... Option1 sounds fine to me,
> but
> > wanted to make sure the community reviewed before signing off on letting
> > the code stand, or whatever as long as it's reviewed/understood given it
> > was/is new behavior in 3.6 esp. Conscious decision at the eod.
> >
> > Regards,
> >
> > Patrick
> >
> > On Thu, Jan 23, 2020 at 11:05 AM Michael K. Edwards <
> [email protected]
> > >
> > wrote:
> >
> > > While I agree that this is not a very production-like configuration, I
> > > think it's good to recognize that there are plenty of clusters out
> there
> > > where more than five zookeeper nodes are called for.  I run systems
> > > routinely with seven voting members plus three or more observers, for
> > > reasons having to do with system behavior during network split
> scenarios
> > in
> > > AWS EC2.  Mac OS specific issues aside, it would be unfortunate if
> there
> > > were an artificial cap on the number of nodes in a machine-local test
> > > cluster, especially if it were related to an ICMP storm scenario.
> > >
> > > On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <[email protected]>
> wrote:
> > >
> > > > I think that this is far outside the normal operation bounds and has
> an
> > > > easy work-around.
> > > >
> > > > First, it is very uncommon to run more than 5 ZK nodes. Running 23
> on a
> > > > single host is bizarre (viewed from an operational lens).
> > > >
> > > > Second, there is a simple configuration change that makes the strange
> > > > configuration work anyway.
> > > >
> > > > A third point unrelated to operational considerations is that there
> is
> > > risk
> > > > in making a last minute changes to code. That risk is borne by normal
> > > > configurations as well as these unusual ones.
> > > >
> > > > In sum,
> > > >
> > > > - this might look like a P1 (system down) issue, but there is a
> > > workaround
> > > > so it is certainly no more than P2
> > > >
> > > > - even P2 is unwarranted because the is a non-production
> configuration
> > > >
> > > > - a P3 issue isn't a stop-ship issue.
> > > >
> > > >
> > > >
> > > > On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > TLDR:
> > > > > During testing RC for 3.6.0, we found that ZooKeeper cluster with
> > large
> > > > > number of ensemble members (e.g. 23) can not start properly. This
> > issue
> > > > > seems to happen only on mac, and a workaround is to disable the
> ICMP
> > > > > throttling. The question is if this workaround is enough for the
> RC,
> > or
> > > > if
> > > > > we should change the code in ZooKeeper to limit the number of ICMP
> > > > > requests.
> > > > >
> > > > >
> > > > > The problem:
> > > > >
> > > > > On linux, I haven't been able to reproduce the problem. I tried
> with
> > 5,
> > > > 9,
> > > > > 15 and 23 ensemble members and the quorum always seems to start
> > > properly
> > > > in
> > > > > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> > > > >
> > > > > On mac, the problem is consistently happening for large ensembles.
> > The
> > > > > server is very slow to start and we see a lot of warnings in the
> log
> > > like
> > > > > these:
> > > > >
> > > > > 2020-01-15 20:02:13,431 [myid:13] - WARN
> > > > >  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193
> > > > :QuorumCnxManager@691
> > > > > ]
> > > > > - None of the addresses (/192.168.1.91:4190) are reachable for sid
> > 10
> > > > > java.net.NoRouteToHostException: No valid address among [/
> > > > > 192.168.1.91:4190]
> > > > >
> > > > > 2020-01-17 11:02:26,177 [myid:4] - WARN
> > > > >  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination
> > address
> > > /
> > > > > 127.0.0.1 not reachable anymore, shutting down the SendWorker for
> > sid 6
> > > > >
> > > > > The exception is happening when the new MultiAddress feature tries
> to
> > > > > filter the unreachable hosts from the address list when trying to
> > > decide
> > > > > which election address to connect. This involves the calling of the
> > > > > InetAddress.isReachable method with a default timeout of 500ms,
> which
> > > > goes
> > > > > down to a native call in java and basically try to do a ping (an
> ICMP
> > > > echo
> > > > > request) to the host. Naturally, the localhost should be always
> > > > reachable.
> > > > > This call gets timeouted on mac if we have many ensemble members. I
> > > > tested
> > > > > with 9 members and the cluster started properly. With 11-13-15
> > members
> > > it
> > > > > took more and more time to get the cluster to start, and the
> > > > > "NoRouteToHostException" started to appear in the logs. After
> around
> > 1
> > > > > minute the 15 ensemble members cluster started, but obviously this
> is
> > > way
> > > > > too long.
> > > > >
> > > > > On mac, we we have the ICMP rate limit set to 250 by default. You
> can
> > > > turn
> > > > > this off using the following command: sudo sysctl -w
> > > > > net.inet.icmp.icmplim=0
> > > > > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/
> )
> > > > >
> > > > > Using the above command before starting the 23 ensemble members
> > cluster
> > > > > locally seems to solve the problem for me. (can someone verify?)
> The
> > > > > question is if this workaround is enough or not.
> > > > >
> > > > > As far as I can tell, the current code will generate 2*A*(M-1) ICMP
> > > calls
> > > > > in each ZooKeeper server during startup, if 'X' is the number of
> > > ensemble
> > > > > members and 'A' is the number of election addresses provided for
> each
> > > > > member. This is not that high, if each ZooKeeper server is started
> > on a
> > > > > different machine, but if we start a lot of ZooKeeper servers on a
> > > single
> > > > > machine, then it can quickly go beyond the predefined limit of 250
> > for
> > > > mac.
> > > > >
> > > > > OPTION 1: we keep the code as it is. we might change the
> > documentation
> > > > for
> > > > > zkconf mentioning this mac specific issue and the way how to
> disable
> > > the
> > > > > ICMP rate limit.
> > > > >
> > > > > OPTION 2: we change the code not to filter the list of election
> > > addresses
> > > > > if the list has only a single element. This seems to be a logical
> way
> > > to
> > > > > decrease the ICMP requests. However, if we would run a large number
> > of
> > > > > ZooKeeper servers on a single machine using multiple election
> > addresses
> > > > for
> > > > > each server, we would get the same problem (most probably even
> > quicker)
> > > > >
> > > > > OPTION 3: make the address filtering configurable and change zkconf
> > to
> > > > > disable it by default. (but disabling will make the quorum
> > potentially
> > > > > unable to recover during network failures, so it is not recommended
> > > > during
> > > > > production)
> > > > >
> > > > > OPTION 4: refactor the MultiAddress feature and remove the ICMP
> calls
> > > > from
> > > > > the ZooKeeper code. However, it is clearly helps for the quick
> > recovery
> > > > > during network failures... at the moment I can't think any good
> > > solution
> > > > to
> > > > > avoid it.
> > > > >
> > > > > Kind regards,
> > > > > Mate
> > > > >
> > > >
> > >
> >
>

Reply via email to