[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704663#comment-14704663 ]
Adrian Fitzpatrick commented on SOLR-7021: ------------------------------------------ Also have seen this issue on Solr 4.10.3, on a 3 node cluster. Issue affected one of 3 collections only, and each of the 3 collections configured with 5 shards and 3 replicas. In the affected collection, for each of the 5 shards, the leader was on the same node (hadoopnode02) and was showing as down for all shards. Other replicas for each shard were reporting that were waiting for leader (eg "I was asked to wait on state recovering for shard3 in the_collection_20150818161800 on hadoopnode01:8983_solr but I still do not see the requested state. I see state: recovering live:true leader from ZK: http://hadoopnode02:8983/solr/the_collection_20150818161800_shard3_replica2") Something like the work-around suggested by Andrey worked - we shut down the whole cluster, brought back up all nodes except the one which was reporting leader errors (hadoopnode02). This seemed to trigger a leader election but without a quorum. Then brought up hadoopnode02 - election then completed successfully and cluster state returned to normal. > Leader will not publish core as active without recovering first, but never > recovers > ----------------------------------------------------------------------------------- > > Key: SOLR-7021 > URL: https://issues.apache.org/jira/browse/SOLR-7021 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.10 > Reporter: James Hardwick > Priority: Critical > Labels: recovery, solrcloud, zookeeper > > A little background: 1 core solr-cloud cluster across 3 nodes, each with its > own shard and each shard with a single replica hence each replica is itself a > leader. > For reasons we won't get into, we witnessed a shard go down in our cluster. > We restarted the cluster but our core/shards still did not come back up. > After inspecting the logs, we found this: > {code} > 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is > http://xxx.xxx.xxx.35:8081/solr/xyzcore/ > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - I am the leader, no recovery necessary > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - publishing core=xyzcore state=active collection=xyzcore > 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - numShards not found on descriptor - reading it from system property > 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - publishing core=xyzcore state=down collection=xyzcore > 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - numShards not found on descriptor - reading it from system property > 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - > :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' > as active without recovering first! > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) > {code} > And at this point the necessary shards never recover correctly and hence our > core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org