I suspect that you might hit ZOOKEEPER-2325 <https://issues.apache.org/jira/browse/ZOOKEEPER-2325> / ZOOKEEPER-261 <https://issues.apache.org/jira/browse/ZOOKEEPER-261> which could possible cause data loss. Consider this case - we have A, B, C servers but for some reasons A and B got replaced by Exhibitor with empty data directory. Then C is down (or C has slower response) so either A or B gets elected as leader then when C reaches out leader it would truncates its own data. This is an extreme case (complete data loss), but it sounds possible.
Do we have Exhibitor logs on what Exhibitors did - as you mentioned prior to Exhibitor things running fine, so it could be what Exhibitor did that cause this - such as reinitialize server / purge data directory. On Thu, Jan 5, 2017 at 2:27 PM, Washko, Daniel <[email protected]> wrote: > I am trying to get to the bottom of the cause for loss of configurations > for Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr > clouds in our data centers for about 5 years now with no problems. About 2 > years ago we started adding more clouds specifically in AWS. During those > two years, we have had instances where the Solr configurations stored in > Zookeeper have just disappeared. About a year ago we added some new Solr > clouds to our own datacenters and experienced two instances of the Solr > configurations disappearing in Zookeeper. The difference between our > original Solr Clouds instances and the ones we have spun up in the past two > years is that we are using Exhibitor for Zookeeper Ensemble management. > > > > We have not been able to find anything in the logs indicating why this > problem happens. We have not been able to replicate the problem reliably. > The closest I have come is when adding new Zookeepers to an ensemble and > performing a rolling restart via Exhibitor, there have been a few instances > where pretty much everything stored in Zookeeper has been deleted. > Everything except the Zookeeper information itself. We have asked around on > Exhibitor support channels and done a lot of searching but have come up > empty handed in regards to a solution or discovering other people who have > had this issue. > > > > What I suspect is happening is that when rolling restarts happen, if the > node that becomes the leader is a new node that has not had the data > replicated to it, when new nodes join to this leader, they see the leader > is without the data they have stored and thus they should delete said data. > In the cases where we are not adding new nodes, I suspect that there might > an issue causing the zookeeper node to fail or appear failed to Exhibitor. > A rolling restart occurs to remove this node. When exhibitor registers the > zookeeper is available, Exhibitor initiates a rolling restart to bring the > node back in. For some reason the data is corrupted or lost on that node > and this is the node that becomes the leader. The remaining nodes that join > to this leader then dump their data to match the leader. > > > > Does this scenario sound plausible? If a newly added node that does not > have data replicated to it is added to a zookeeper ensemble and the > zookeepers are restarted with the new node becoming the leader, could this > prompt the data stored in Zookeeper to be deleted? > > > > > > -- > > *Daniel S Washko* > > Solutions Architect > > > > Phone: 757 667 1463 <(757)%20667-1463> > [email protected] > > gannett.com <http://www.gannett.com/> > > > -- Cheers Michael.
