Yet another "two datacenter" discussion

Shawn Heisey Fri, 26 May 2017 08:43:21 -0700

I feel fairly certain that this thread willbe an annoyance.  I don't
know enough about zookeeper to answer the questions that are being
asked, so I apologize about needing to relay questions about ZK fault
tolerance in two datacenters.


It seems that everyone wants to avoid the expense of a tie-breaker ZK VM
in a third datacenter.

The scenario, which this list has seen over and over:

DC1 - three ZK servers, one or more Solr servers.
DC2 - two ZK servers, one or more Solr servers.

I've already explained that if DC2 goes down, everything's fine, but if
DC1 goes down, Solr goes ready-only, and there's no way to prevent that.

The conversation went further, and I'm sure you guys have seen this
before too:  "Is there any way we can get DC2 back to operational with
manual intervention if DC1 goes down?"  I explained that any manual
intervention would briefly take Solr down ... at which point the
following proposal was mentioned:

Add an observer node to DC2, and in the event DC1 goes down, run a
script that reconfigures all the ZK servers to change the observer to a
voting member and does rolling restarts.

Will their proposal work?  What happens when DC1 comes back online?  As
you know, DC1 will contain a partial ensemble that still has quorum,
about to rejoin what it THINKS is a partial ensemble *without* quorum,
which is not what it will find.  I'm guessing that ZK assumes the
question of who has the "real" quorum shouldn't ever need to be
negotiated, because the rules prevent multiple partitions from gaining
quorum.

Solr currently ships with 3.4.6, but the next version of Solr (about to
drop any day now) will have 3.4.10.  Once 3.5 is released and Solr is
updated to use it, does the situation I've described above change in any
meaningful way?

Thanks,
Shawn

Yet another "two datacenter" discussion

Reply via email to