I feel fairly certain that this thread willbe an annoyance. I don't know enough about zookeeper to answer the questions that are being asked, so I apologize about needing to relay questions about ZK fault tolerance in two datacenters.
It seems that everyone wants to avoid the expense of a tie-breaker ZK VM in a third datacenter. The scenario, which this list has seen over and over: DC1 - three ZK servers, one or more Solr servers. DC2 - two ZK servers, one or more Solr servers. I've already explained that if DC2 goes down, everything's fine, but if DC1 goes down, Solr goes ready-only, and there's no way to prevent that. The conversation went further, and I'm sure you guys have seen this before too: "Is there any way we can get DC2 back to operational with manual intervention if DC1 goes down?" I explained that any manual intervention would briefly take Solr down ... at which point the following proposal was mentioned: Add an observer node to DC2, and in the event DC1 goes down, run a script that reconfigures all the ZK servers to change the observer to a voting member and does rolling restarts. Will their proposal work? What happens when DC1 comes back online? As you know, DC1 will contain a partial ensemble that still has quorum, about to rejoin what it THINKS is a partial ensemble *without* quorum, which is not what it will find. I'm guessing that ZK assumes the question of who has the "real" quorum shouldn't ever need to be negotiated, because the rules prevent multiple partitions from gaining quorum. Solr currently ships with 3.4.6, but the next version of Solr (about to drop any day now) will have 3.4.10. Once 3.5 is released and Solr is updated to use it, does the situation I've described above change in any meaningful way? Thanks, Shawn
