Hi Thomas, I did not get these split brains (probably our use case is simpler) but we got the spammed Zk phenomenon.
The easiest way to fix it is to: 1. Shut down all the Solr servers in the failing cluster 2. Connect to zk using its CLI 3. rmr overseer/queue 4. Restart Solr Think is way faster of the gist you posted. Ugo On Jan 7, 2015 11:02 AM, "Thomas Lamy" <t.l...@cytainment.de> wrote: > Hi there, > > we are running a 3 server cloud serving a dozen > single-shard/replicate-everywhere > collections. The 2 biggest collections are ~15M docs, and about 13GiB / > 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java > 1.7.0_72-b14 > > 10 of the 12 collections (the small ones) get filled by DIH full-import > once a day starting at 1am. The second biggest collection is updated usind > DIH delta-import every 10 minutes, the biggest one gets bulk json updates > with commits once in 5 minutes. > > On a regular basis, we have a leader information mismatch: > org.apache.solr.update.processor.DistributedUpdateProcessor; Request says > it is coming from leader, but we are the leader > or the opposite > org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState > says we are the leader, but locally we don't think so > > One of these pop up once a day at around 8am, making either some cores > going to "recovery failed" state, or all cores of at least one cloud node > into state "gone". > This started out of the blue about 2 weeks ago, without changes to neither > software, data, or client behaviour. > > Most of the time, we get things going again by restarting solr on the > current leader node, forcing a new election - can this be triggered while > keeping solr (and the caches) up? > But sometimes this doesn't help, we had an incident last weekend where our > admins didn't restart in time, creating millions of entries in > /solr/oversser/queue, making zk close the connection, and leader re-elect > fails. I had to flush zk, and re-upload collection config to get solr up > again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7 > ). > > We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 > requests/s) up and running, which does not have these problems since > upgrading to 4.10.2. > > > Any hints on where to look for a solution? > > Kind regards > Thomas > > -- > Thomas Lamy > Cytainment AG & Co KG > Nordkanalstrasse 52 > 20097 Hamburg > > Tel.: +49 (40) 23 706-747 > Fax: +49 (40) 23 706-139 > Sitz und Registergericht Hamburg > HRA 98121 > HRB 86068 > Ust-ID: DE213009476 > >