Hi,

We have been using Solr 6.5.1 leveraging SolrCloud backed by ZooKeeper for a 
multi-client, multi-node cluster for several months now and have been having a 
few stability/recovery issues we’d like to confirm if they are fixed in Solr 7 
or not. We run 3 large Solr nodes in the cluster (each with 64 GBs of heap, 
100’s of collections, and 9000+ cores). We have found that if anything happens 
(e.g. long garbage collection pauses) that puts replicas of the same shard on 
more than one node into the down/recovering state at the same, then they 
usually fail to complete the recovery process and reach the active state again, 
even though all of the nodes in the cluster are up and running. The same thing 
happens if we shut down multiple nodes and try to start them up at the same 
time. It looks like each replica thinks the other is supposed to be the leader 
and results in neither ever accepting the responsibility. Is this a common 
error in SolrCloud and to the best of the knowledge of the mailing list has it 
been fixed in any Solr 7 releases?

A second issue we have faced with the same cluster is that when the above 
situation arises, our recovery steps is to stop all the impacted nodes 
(sometimes it’s all 3 nodes in the cluster) and start them one at a time, 
waiting for all cores on each node to recover before starting the next one. 
During this process we have found two different issues. One issue is that 
recovering the nodes seems to take a long time, sometimes more than one hour 
for all the cores to move into the active state. Another and more pressing 
issue that we have run into is sometimes it seems the order in which we start 
the servers back up matters. We sometimes run into cases where we start a node 
and all but a few (< 10 cores) won’t recover even after several hours and 
several attempts of restarting the same node. These cores never leave the down 
state. To fix this we need to then stop the node and attempt starting another 
node until we find one that fully recovers so that we can return to the 
originally problematic node to try again- which has always worked in the end 
but only after a lot of pain. Our hope is to get to a day where we can start 
all the nodes in the cluster at the same time and have it “just work” i.e. 
converge on  a fully active state 100% of the time vs. managing this one node 
at a time process. Our expectation is that if all the nodes in  the cluster are 
up and healthy then all of the cores in  the cluster would eventually reach the 
active state, but that seems to just not be the case. Are there any fixes in 
Solr 7 or potentially some configuration we can add to Solr 6 that resolve 
either of these issues?

Best,
Charlie Johnston


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.

Reply via email to