Re: leader split-brain at least once a day - need help

2015-01-13 Thread Thomas Lamy

Hi Mark,

we're currently at 4.10.2, update to 4.10.3 ist scheduled for tomorrow.

T

Am 12.01.15 um 17:30 schrieb Mark Miller:

bq. ClusterState says we are the leader, but locally we don't think so

Generally this is due to some bug. One bug that can lead to it was recently
fixed in 4.10.3 I think. What version are you on?

- Mark

On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy t.l...@cytainment.de wrote:


Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found
no free solution to analyze them that worked out of the box on a
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
before) on one of the nodes, after checking allocation after 1 hour run
time was at about 2-3GB. That didn't move the time frame where a restart
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into
Splunk now, just to get a better sorted view of what's going on in the
cloud once a problem occurs. We're also enabling GC logging for
zookeeper; maybe we were missing problems there while focussing on solr
logs.

Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:

It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de

wrote:

Hi there,

we are running a 3 server cloud serving a dozen
single-shard/replicate-everywhere collections. The 2 biggest

collections are

~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,

Tomcat

7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import

once

a day starting at 1am. The second biggest collection is updated usind

DIH

delta-import every 10 minutes, the biggest one gets bulk json updates

with

commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request

says it

is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor;

ClusterState

says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores

going

to recovery failed state, or all cores of at least one cloud node into
state gone.
This started out of the blue about 2 weeks ago, without changes to

neither

software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the
current leader node, forcing a new election - can this be triggered

while

keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where

our

admins didn't restart in time, creating millions of entries in
/solr/oversser/queue, making zk close the connection, and leader

re-elect

fails. I had to flush zk, and re-upload collection config to get solr up
again (just like in https://gist.github.com/

isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,

1500

requests/s) up and running, which does not have these problems since
upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476





--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: leader split-brain at least once a day - need help

2015-01-13 Thread Shawn Heisey
On 1/12/2015 5:34 AM, Thomas Lamy wrote:
 I found no big/unusual GC pauses in the Log (at least manually; I
 found no free solution to analyze them that worked out of the box on a
 headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
 before) on one of the nodes, after checking allocation after 1 hour
 run time was at about 2-3GB. That didn't move the time frame where a
 restart was needed, so I don't think Solr's JVM GC is the problem.
 We're trying to get all of our node's logs (zookeeper and solr) into
 Splunk now, just to get a better sorted view of what's going on in the
 cloud once a problem occurs. We're also enabling GC logging for
 zookeeper; maybe we were missing problems there while focussing on
 solr logs.

If you make a copy of the gc log, you can put it on another system with
a GUI and graph it with this:

http://sourceforge.net/projects/gcviewer

Just double-click on the jar to run the program.  I find it is useful
for clarity on the graph to go to the View menu and uncheck everything
except the two GC Times options.  You can also change the zoom to a
lower percentage so you can see more of the graph.

That program is how I got the graph you can see on my wiki page about GC
tuning:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Another possible problem is that your install is exhausting the thread
pool.  Tomcat defaults to a maxThreads value of only 200.  There's a
good chance that your setup will need more than 200 threads at least
occasionally.  If you're near the limit, having a thread problem once
per day based on index activity seems like a good possibility.  Try
setting maxThreads to 1 in the Tomcat config.

Thanks,
Shawn



Re: leader split-brain at least once a day - need help

2015-01-12 Thread Thomas Lamy

Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found 
no free solution to analyze them that worked out of the box on a 
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G 
before) on one of the nodes, after checking allocation after 1 hour run 
time was at about 2-3GB. That didn't move the time frame where a restart 
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into 
Splunk now, just to get a better sorted view of what's going on in the 
cloud once a problem occurs. We're also enabling GC logging for 
zookeeper; maybe we were missing problems there while focussing on solr 
logs.


Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:

It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de wrote:

Hi there,

we are running a 3 server cloud serving a dozen
single-shard/replicate-everywhere collections. The 2 biggest collections are
~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import once
a day starting at 1am. The second biggest collection is updated usind DIH
delta-import every 10 minutes, the biggest one gets bulk json updates with
commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores going
to recovery failed state, or all cores of at least one cloud node into
state gone.
This started out of the blue about 2 weeks ago, without changes to neither
software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the
current leader node, forcing a new election - can this be triggered while
keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where our
admins didn't restart in time, creating millions of entries in
/solr/oversser/queue, making zk close the connection, and leader re-elect
fails. I had to flush zk, and re-upload collection config to get solr up
again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
requests/s) up and running, which does not have these problems since
upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476




--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: leader split-brain at least once a day - need help

2015-01-12 Thread Mark Miller
bq. ClusterState says we are the leader, but locally we don't think so

Generally this is due to some bug. One bug that can lead to it was recently
fixed in 4.10.3 I think. What version are you on?

- Mark

On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy t.l...@cytainment.de wrote:

 Hi,

 I found no big/unusual GC pauses in the Log (at least manually; I found
 no free solution to analyze them that worked out of the box on a
 headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
 before) on one of the nodes, after checking allocation after 1 hour run
 time was at about 2-3GB. That didn't move the time frame where a restart
 was needed, so I don't think Solr's JVM GC is the problem.
 We're trying to get all of our node's logs (zookeeper and solr) into
 Splunk now, just to get a better sorted view of what's going on in the
 cloud once a problem occurs. We're also enabling GC logging for
 zookeeper; maybe we were missing problems there while focussing on solr
 logs.

 Thomas


 Am 08.01.15 um 16:33 schrieb Yonik Seeley:
  It's worth noting that those messages alone don't necessarily signify
  a problem with the system (and it wouldn't be called split brain).
  The async nature of updates (and thread scheduling) along with
  stop-the-world GC pauses that can change leadership, cause these
  little windows of inconsistencies that we detect and log.
 
  -Yonik
  http://heliosearch.org - native code faceting, facet functions,
  sub-facets, off-heap data
 
 
  On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de
 wrote:
  Hi there,
 
  we are running a 3 server cloud serving a dozen
  single-shard/replicate-everywhere collections. The 2 biggest
 collections are
  ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,
 Tomcat
  7.0.56, Oracle Java 1.7.0_72-b14
 
  10 of the 12 collections (the small ones) get filled by DIH full-import
 once
  a day starting at 1am. The second biggest collection is updated usind
 DIH
  delta-import every 10 minutes, the biggest one gets bulk json updates
 with
  commits once in 5 minutes.
 
  On a regular basis, we have a leader information mismatch:
  org.apache.solr.update.processor.DistributedUpdateProcessor; Request
 says it
  is coming from leader, but we are the leader
  or the opposite
  org.apache.solr.update.processor.DistributedUpdateProcessor;
 ClusterState
  says we are the leader, but locally we don't think so
 
  One of these pop up once a day at around 8am, making either some cores
 going
  to recovery failed state, or all cores of at least one cloud node into
  state gone.
  This started out of the blue about 2 weeks ago, without changes to
 neither
  software, data, or client behaviour.
 
  Most of the time, we get things going again by restarting solr on the
  current leader node, forcing a new election - can this be triggered
 while
  keeping solr (and the caches) up?
  But sometimes this doesn't help, we had an incident last weekend where
 our
  admins didn't restart in time, creating millions of entries in
  /solr/oversser/queue, making zk close the connection, and leader
 re-elect
  fails. I had to flush zk, and re-upload collection config to get solr up
  again (just like in https://gist.github.com/
 isoboroff/424fcdf63fa760c1d1a7).
 
  We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,
 1500
  requests/s) up and running, which does not have these problems since
  upgrading to 4.10.2.
 
 
  Any hints on where to look for a solution?
 
  Kind regards
  Thomas
 
  --
  Thomas Lamy
  Cytainment AG  Co KG
  Nordkanalstrasse 52
  20097 Hamburg
 
  Tel.: +49 (40) 23 706-747
  Fax: +49 (40) 23 706-139
  Sitz und Registergericht Hamburg
  HRA 98121
  HRB 86068
  Ust-ID: DE213009476
 


 --
 Thomas Lamy
 Cytainment AG  Co KG
 Nordkanalstrasse 52
 20097 Hamburg

 Tel.: +49 (40) 23 706-747
 Fax: +49 (40) 23 706-139

 Sitz und Registergericht Hamburg
 HRA 98121
 HRB 86068
 Ust-ID: DE213009476




Re: leader split-brain at least once a day - need help

2015-01-08 Thread Thomas Lamy

Hi Alan,
thanks for the pointer, I'll look at our gc logs

Am 07.01.2015 um 15:46 schrieb Alan Woodward:

I had a similar issue, which was caused by 
https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC 
pauses or similar before the leader mismatches occur?

Alan Woodward
www.flax.co.uk


On 7 Jan 2015, at 10:01, Thomas Lamy wrote:


Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections are 
~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 
7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import once a 
day starting at 1am. The second biggest collection is updated usind DIH 
delta-import every 10 minutes, the biggest one gets bulk json updates with 
commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is 
coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says 
we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores going to recovery 
failed state, or all cores of at least one cloud node into state gone.
This started out of the blue about 2 weeks ago, without changes to neither 
software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the current 
leader node, forcing a new election - can this be triggered while keeping solr 
(and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where our 
admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader re-elect 
fails. I had to flush zk, and re-upload collection config to get solr up again 
(just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 
requests/s) up and running, which does not have these problems since upgrading 
to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476






--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: leader split-brain at least once a day - need help

2015-01-08 Thread Yonik Seeley
It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de wrote:
 Hi there,

 we are running a 3 server cloud serving a dozen
 single-shard/replicate-everywhere collections. The 2 biggest collections are
 ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
 7.0.56, Oracle Java 1.7.0_72-b14

 10 of the 12 collections (the small ones) get filled by DIH full-import once
 a day starting at 1am. The second biggest collection is updated usind DIH
 delta-import every 10 minutes, the biggest one gets bulk json updates with
 commits once in 5 minutes.

 On a regular basis, we have a leader information mismatch:
 org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
 is coming from leader, but we are the leader
 or the opposite
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
 says we are the leader, but locally we don't think so

 One of these pop up once a day at around 8am, making either some cores going
 to recovery failed state, or all cores of at least one cloud node into
 state gone.
 This started out of the blue about 2 weeks ago, without changes to neither
 software, data, or client behaviour.

 Most of the time, we get things going again by restarting solr on the
 current leader node, forcing a new election - can this be triggered while
 keeping solr (and the caches) up?
 But sometimes this doesn't help, we had an incident last weekend where our
 admins didn't restart in time, creating millions of entries in
 /solr/oversser/queue, making zk close the connection, and leader re-elect
 fails. I had to flush zk, and re-upload collection config to get solr up
 again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

 We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
 requests/s) up and running, which does not have these problems since
 upgrading to 4.10.2.


 Any hints on where to look for a solution?

 Kind regards
 Thomas

 --
 Thomas Lamy
 Cytainment AG  Co KG
 Nordkanalstrasse 52
 20097 Hamburg

 Tel.: +49 (40) 23 706-747
 Fax: +49 (40) 23 706-139
 Sitz und Registergericht Hamburg
 HRA 98121
 HRB 86068
 Ust-ID: DE213009476



leader split-brain at least once a day - need help

2015-01-07 Thread Thomas Lamy

Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections 
are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, 
Tomcat 7.0.56, Oracle Java 1.7.0_72-b14


10 of the 12 collections (the small ones) get filled by DIH full-import 
once a day starting at 1am. The second biggest collection is updated 
usind DIH delta-import every 10 minutes, the biggest one gets bulk json 
updates with commits once in 5 minutes.


On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request 
says it is coming from leader, but we are the leader

or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; 
ClusterState says we are the leader, but locally we don't think so


One of these pop up once a day at around 8am, making either some cores 
going to recovery failed state, or all cores of at least one cloud 
node into state gone.
This started out of the blue about 2 weeks ago, without changes to 
neither software, data, or client behaviour.


Most of the time, we get things going again by restarting solr on the 
current leader node, forcing a new election - can this be triggered 
while keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where 
our admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader 
re-elect fails. I had to flush zk, and re-upload collection config to 
get solr up again (just like in 
https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).


We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 
1500 requests/s) up and running, which does not have these problems 
since upgrading to 4.10.2.



Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: leader split-brain at least once a day - need help

2015-01-07 Thread Ugo Matrangolo
Hi Thomas,

I did not get these split brains (probably our use case is simpler) but we
got the spammed Zk phenomenon.

The easiest way to fix it is to:
1. Shut down all the Solr servers in the failing cluster
2. Connect to zk using its CLI
3. rmr overseer/queue
4. Restart Solr

Think is way faster of the gist you posted.

Ugo
On Jan 7, 2015 11:02 AM, Thomas Lamy t.l...@cytainment.de wrote:

 Hi there,

 we are running a 3 server cloud serving a dozen 
 single-shard/replicate-everywhere
 collections. The 2 biggest collections are ~15M docs, and about 13GiB /
 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java
 1.7.0_72-b14

 10 of the 12 collections (the small ones) get filled by DIH full-import
 once a day starting at 1am. The second biggest collection is updated usind
 DIH delta-import every 10 minutes, the biggest one gets bulk json updates
 with commits once in 5 minutes.

 On a regular basis, we have a leader information mismatch:
 org.apache.solr.update.processor.DistributedUpdateProcessor; Request says
 it is coming from leader, but we are the leader
 or the opposite
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
 says we are the leader, but locally we don't think so

 One of these pop up once a day at around 8am, making either some cores
 going to recovery failed state, or all cores of at least one cloud node
 into state gone.
 This started out of the blue about 2 weeks ago, without changes to neither
 software, data, or client behaviour.

 Most of the time, we get things going again by restarting solr on the
 current leader node, forcing a new election - can this be triggered while
 keeping solr (and the caches) up?
 But sometimes this doesn't help, we had an incident last weekend where our
 admins didn't restart in time, creating millions of entries in
 /solr/oversser/queue, making zk close the connection, and leader re-elect
 fails. I had to flush zk, and re-upload collection config to get solr up
 again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7
 ).

 We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
 requests/s) up and running, which does not have these problems since
 upgrading to 4.10.2.


 Any hints on where to look for a solution?

 Kind regards
 Thomas

 --
 Thomas Lamy
 Cytainment AG  Co KG
 Nordkanalstrasse 52
 20097 Hamburg

 Tel.: +49 (40) 23 706-747
 Fax: +49 (40) 23 706-139
 Sitz und Registergericht Hamburg
 HRA 98121
 HRB 86068
 Ust-ID: DE213009476




Re: leader split-brain at least once a day - need help

2015-01-07 Thread Alan Woodward
I had a similar issue, which was caused by 
https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC 
pauses or similar before the leader mismatches occur?

Alan Woodward
www.flax.co.uk


On 7 Jan 2015, at 10:01, Thomas Lamy wrote:

 Hi there,
 
 we are running a 3 server cloud serving a dozen 
 single-shard/replicate-everywhere collections. The 2 biggest collections are 
 ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 
 7.0.56, Oracle Java 1.7.0_72-b14
 
 10 of the 12 collections (the small ones) get filled by DIH full-import once 
 a day starting at 1am. The second biggest collection is updated usind DIH 
 delta-import every 10 minutes, the biggest one gets bulk json updates with 
 commits once in 5 minutes.
 
 On a regular basis, we have a leader information mismatch:
 org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it 
 is coming from leader, but we are the leader
 or the opposite
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
 says we are the leader, but locally we don't think so
 
 One of these pop up once a day at around 8am, making either some cores going 
 to recovery failed state, or all cores of at least one cloud node into 
 state gone.
 This started out of the blue about 2 weeks ago, without changes to neither 
 software, data, or client behaviour.
 
 Most of the time, we get things going again by restarting solr on the current 
 leader node, forcing a new election - can this be triggered while keeping 
 solr (and the caches) up?
 But sometimes this doesn't help, we had an incident last weekend where our 
 admins didn't restart in time, creating millions of entries in 
 /solr/oversser/queue, making zk close the connection, and leader re-elect 
 fails. I had to flush zk, and re-upload collection config to get solr up 
 again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
 
 We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 
 requests/s) up and running, which does not have these problems since 
 upgrading to 4.10.2.
 
 
 Any hints on where to look for a solution?
 
 Kind regards
 Thomas
 
 -- 
 Thomas Lamy
 Cytainment AG  Co KG
 Nordkanalstrasse 52
 20097 Hamburg
 
 Tel.: +49 (40) 23 706-747
 Fax: +49 (40) 23 706-139
 Sitz und Registergericht Hamburg
 HRA 98121
 HRB 86068
 Ust-ID: DE213009476