RETRY: SolrCloud does not recover after ZooKeeper ensemble loses (and then regains) a quorum

Kelly, Frank Sat, 19 Mar 2016 21:23:20 -0700

<This time without images :-) >

Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a 
quorum is normal or to-be-expected


Version of Solr: 5.3.1
Version of ZooKeeper: 3.4.7
Using SolrCloud with external ZooKeeper
Deployed on AWS

Our Solr cluster has 3 nodes

Our Zookeeper ensemble consists of three nodes with the same config using DNS 
names e.g.

$ more ../conf/zoo.cfg
tickTime=2000
dataDir=/var/zookeeper
dataLogDir=/var/log/zookeeper
clientPort=2181
initLimit=10
syncLimit=5
standaloneEnabled=false
server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888

If we terminate one of the zookeeper nodes we get a ZK election (and I think) a 
quorum is maintained.
Operation continues OK and we detect the terminated instance and relaunch a new 
ZK node which comes up fine

If we terminate two of the ZK nodes we lose a quorum and then we observe the 
following

1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could not 
connect to ZooKeeper"

1.2) SolrJ returns the following

org.apache.solr.common.SolrException: Could not load collection from 
ZK:qa_eu-west-1_public_index
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
at 
com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/collections/qa_eu-west-1_public_index/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
... 24 more

This makes sense based on our understanding.
When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix 
the DNS etc. we regain a quorum but at this point

2.1) Admin UI shows the shards as “GONE” (all greyed out)

2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now 
bound to new IP addresses

So at this point I restart the Solr nodes. At this point then

3.1) Admin UI shows the collections as OK (all shards are green) – yeah the 
nodes are back!

3.2) SolrJ Client still shows the same error – namely

org.apache.solr.common.SolrException: Could not load collection from 
ZK:qa_eu-west-1_here_account
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257)
.
.
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/collections/qa_eu-west-1_here_account/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)

I have a few questions
1) Is this behavior (lack of self-healing) a known behavior?
2) Is this the same or similar behavior as documented here 
https://issues.apache.org/jira/browse/SOLR-5129
3) If it is not covered by #2 should I log it in JIRA?

Thanks and Best Wishes,

-Frank

p.s. I can add Solr log files if they will help


Frank Kelly
Principal Software Engineer
Predictive Analytics Team (SCBE/HAC/CDA)






HERE
5 Wayside Rd, Burlington, MA 01803, USA
42° 29' 7" N 71° 11' 32” W

RETRY: SolrCloud does not recover after ZooKeeper ensemble loses (and then regains) a quorum

Reply via email to