[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

Nishanth Shajahan (JIRA) Thu, 02 Apr 2015 15:23:43 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393589#comment-14393589
 ]


Nishanth Shajahan commented on SOLR-7021:
-----------------------------------------

We had to use the same workaround mentioned by Shalin.Thanks for that.This was 
in version 4.10.3 though, where we took down individual  solr nodes  one by one 
 for a fail over testing but  encountered the same bug.Shard 4 did not come up 
at all.The set up is that each shard has a leader and a follower.Leader was in 
recovery failed state and follower was in recovering state.


2015-04-02 21:55:05,932 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.cloud.ShardLeaderElectionContext- I am the new leader: 
http://xxxx:8082/solr/coll1_replica1/ shard4
2015-04-02 21:55:05,933 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.common.cloud.SolrZkClient- makePath: 
/collections/coll1/leaders/shard4
2015-04-02 21:55:06,089 [zkCallback-2-thread-1] INFO  
org.apache.solr.common.cloud.ZkStateReader- A cluster state change: 
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.js
on, has occurred - updating... (live nodes size: 16)
2015-04-02 21:55:06,116 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.cloud.ZkController- We are http://xxx:8082/solr/coll1_replica1/ 
and leader is http://xxx:8082/sol
r/coll1_replica1/
2015-04-02 21:55:06,117 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.cloud.ZkController- No LogReplay needed for core=coll1_replica1 
baseURL=http://xx:8082/solr
2015-04-02 21:55:06,117 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.cloud.ZkController- I am the leader, no recovery necessary
2015-04-02 21:55:06,117 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.cloud.ZkController- publishing core=coll1_replica1 state=active 
collection=coll1
2015-04-02 21:55:06,121 [ExecutorThreadPool_SOLR_9] INFO  
org.apache.solr.handler.admin.CoreAdminHandler- Going to wait for coreNodeName: 
core_node5, state: recovering, checkLive: true, onlyIfLeade
r: true, onlyIfLeaderActive: true
2015-04-02 21:55:06,123 [coreZkRegister-1-thread-2] INFO  
org.apache.solr.cloud.ZkController- publishing core=coll1_replica1 state=down 
collection=coll1
2015-04-02 21:55:06,132 [coreZkRegister-1-thread-2] ERROR 
org.apache.solr.core.ZkContainer- :org.apache.solr.common.SolrException: Cannot 
publish state of core 'coll1_replica1' as active without recovering first!
        at org.apache.solr.cloud.ZkController.publish(ZkController.java:1082)
        at org.apache.solr.cloud.ZkController.publish(ZkController.java:1045)
        at org.apache.solr.cloud.ZkController.publish(ZkController.java:1041)
        at org.apache.solr.cloud.ZkController.register(ZkController.java:856)
        at org.apache.solr.cloud.ZkController.register(ZkController.java:770)
        at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

> Leader will not publish core as active without recovering first, but never 
> recovers
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-7021
>                 URL: https://issues.apache.org/jira/browse/SOLR-7021
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>            Priority: Critical
>              Labels: recovery, solrcloud, zookeeper
>
> A little background: 1 core solr-cloud cluster across 3 nodes, each with its 
> own shard and each shard with a single replica hence each replica is itself a 
> leader. 
> For reasons we won't get into, we witnessed a shard go down in our cluster. 
> We restarted the cluster but our core/shards still did not come back up. 
> After inspecting the logs, we found this:
> {code}
> 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is 
> http://xxx.xxx.xxx.35:8081/solr/xyzcore/
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - I am the leader, no recovery necessary
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=active collection=xyzcore
> 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - publishing core=xyzcore state=down collection=xyzcore
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  
> - numShards not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - 
> :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' 
> as active without recovering first!
>       at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
> {code}
> And at this point the necessary shards never recover correctly and hence our 
> core never returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers

Reply via email to