<This time without images :-) > Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a quorum is normal or to-be-expected
Version of Solr: 5.3.1 Version of ZooKeeper: 3.4.7 Using SolrCloud with external ZooKeeper Deployed on AWS Our Solr cluster has 3 nodes Our Zookeeper ensemble consists of three nodes with the same config using DNS names e.g. $ more ../conf/zoo.cfg tickTime=2000 dataDir=/var/zookeeper dataLogDir=/var/log/zookeeper clientPort=2181 initLimit=10 syncLimit=5 standaloneEnabled=false server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888 server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888 server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888 If we terminate one of the zookeeper nodes we get a ZK election (and I think) a quorum is maintained. Operation continues OK and we detect the terminated instance and relaunch a new ZK node which comes up fine If we terminate two of the ZK nodes we lose a quorum and then we observe the following 1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could not connect to ZooKeeper" 1.2) SolrJ returns the following org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_public_index at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850) at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) at com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_public_index/state.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841) ... 24 more This makes sense based on our understanding. When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix the DNS etc. we regain a quorum but at this point 2.1) Admin UI shows the shards as “GONE” (all greyed out) 2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now bound to new IP addresses So at this point I restart the Solr nodes. At this point then 3.1) Admin UI shows the collections as OK (all shards are green) – yeah the nodes are back! 3.2) SolrJ Client still shows the same error – namely org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_here_account at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850) at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257) . . Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_here_account/state.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841) I have a few questions 1) Is this behavior (lack of self-healing) a known behavior? 2) Is this the same or similar behavior as documented here https://issues.apache.org/jira/browse/SOLR-5129 3) If it is not covered by #2 should I log it in JIRA? Thanks and Best Wishes, -Frank p.s. I can add Solr log files if they will help Frank Kelly Principal Software Engineer Predictive Analytics Team (SCBE/HAC/CDA) HERE 5 Wayside Rd, Burlington, MA 01803, USA 42° 29' 7" N 71° 11' 32” W