[ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393589#comment-14393589 ]
Nishanth Shajahan commented on SOLR-7021: ----------------------------------------- We had to use the same workaround mentioned by Shalin.Thanks for that.This was in version 4.10.3 though, where we took down individual solr nodes one by one for a fail over testing but encountered the same bug.Shard 4 did not come up at all.The set up is that each shard has a leader and a follower.Leader was in recovery failed state and follower was in recovering state. 2015-04-02 21:55:05,932 [coreZkRegister-1-thread-2] INFO org.apache.solr.cloud.ShardLeaderElectionContext- I am the new leader: http://xxxx:8082/solr/coll1_replica1/ shard4 2015-04-02 21:55:05,933 [coreZkRegister-1-thread-2] INFO org.apache.solr.common.cloud.SolrZkClient- makePath: /collections/coll1/leaders/shard4 2015-04-02 21:55:06,089 [zkCallback-2-thread-1] INFO org.apache.solr.common.cloud.ZkStateReader- A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.js on, has occurred - updating... (live nodes size: 16) 2015-04-02 21:55:06,116 [coreZkRegister-1-thread-2] INFO org.apache.solr.cloud.ZkController- We are http://xxx:8082/solr/coll1_replica1/ and leader is http://xxx:8082/sol r/coll1_replica1/ 2015-04-02 21:55:06,117 [coreZkRegister-1-thread-2] INFO org.apache.solr.cloud.ZkController- No LogReplay needed for core=coll1_replica1 baseURL=http://xx:8082/solr 2015-04-02 21:55:06,117 [coreZkRegister-1-thread-2] INFO org.apache.solr.cloud.ZkController- I am the leader, no recovery necessary 2015-04-02 21:55:06,117 [coreZkRegister-1-thread-2] INFO org.apache.solr.cloud.ZkController- publishing core=coll1_replica1 state=active collection=coll1 2015-04-02 21:55:06,121 [ExecutorThreadPool_SOLR_9] INFO org.apache.solr.handler.admin.CoreAdminHandler- Going to wait for coreNodeName: core_node5, state: recovering, checkLive: true, onlyIfLeade r: true, onlyIfLeaderActive: true 2015-04-02 21:55:06,123 [coreZkRegister-1-thread-2] INFO org.apache.solr.cloud.ZkController- publishing core=coll1_replica1 state=down collection=coll1 2015-04-02 21:55:06,132 [coreZkRegister-1-thread-2] ERROR org.apache.solr.core.ZkContainer- :org.apache.solr.common.SolrException: Cannot publish state of core 'coll1_replica1' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1082) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1045) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1041) at org.apache.solr.cloud.ZkController.register(ZkController.java:856) at org.apache.solr.cloud.ZkController.register(ZkController.java:770) at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > Leader will not publish core as active without recovering first, but never > recovers > ----------------------------------------------------------------------------------- > > Key: SOLR-7021 > URL: https://issues.apache.org/jira/browse/SOLR-7021 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.10 > Reporter: James Hardwick > Priority: Critical > Labels: recovery, solrcloud, zookeeper > > A little background: 1 core solr-cloud cluster across 3 nodes, each with its > own shard and each shard with a single replica hence each replica is itself a > leader. > For reasons we won't get into, we witnessed a shard go down in our cluster. > We restarted the cluster but our core/shards still did not come back up. > After inspecting the logs, we found this: > {code} > 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is > http://xxx.xxx.xxx.35:8081/solr/xyzcore/ > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - I am the leader, no recovery necessary > 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - publishing core=xyzcore state=active collection=xyzcore > 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - numShards not found on descriptor - reading it from system property > 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - publishing core=xyzcore state=down collection=xyzcore > 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO cloud.ZkController > - numShards not found on descriptor - reading it from system property > 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer - > :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' > as active without recovering first! > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) > {code} > And at this point the necessary shards never recover correctly and hence our > core never returns to a functional state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org