Re: leader split-brain at least once a day - need help

Thomas Lamy Thu, 08 Jan 2015 01:47:03 -0800

Hi Alan,
thanks for the pointer, I'll look at our gc logs


Am 07.01.2015 um 15:46 schrieb Alan Woodward:

I had a similar issue, which was caused by 
https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC 
pauses or similar before the leader mismatches occur?

Alan Woodward
www.flax.co.uk


On 7 Jan 2015, at 10:01, Thomas Lamy wrote:

Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections are 
~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 
7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import once a 
day starting at 1am. The second biggest collection is updated usind DIH 
delta-import every 10 minutes, the biggest one gets bulk json updates with 
commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is 
coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says 
we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores going to "recovery 
failed" state, or all cores of at least one cloud node into state "gone".
This started out of the blue about 2 weeks ago, without changes to neither 
software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the current 
leader node, forcing a new election - can this be triggered while keeping solr 
(and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where our 
admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader re-elect 
fails. I had to flush zk, and re-upload collection config to get solr up again 
(just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 
requests/s) up and running, which does not have these problems since upgrading 
to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



--
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476

Re: leader split-brain at least once a day - need help

Reply via email to