Re: 6.4.0 collection leader election and recovery issues

Ravi Solr Thu, 02 Feb 2017 01:54:07 -0800

Following up on my previous email, the intermittent server unavailability
seems to be linked to the interaction between Solr and Zookeeper. Can
somebody help me understand what this error means and how to recover from
it.


2017-02-02 09:44:24.648 ERROR
(recoveryExecutor-3-thread-16-processing-n:xx.xxx.xxx.xxx:1234_solr
x:clicktrack_shard1_replica4 s:shard1 c:clicktrack r:core_node3)
[c:clicktrack s:shard1 r:core_node3 x:clicktrack_shard1_replica4]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=clicktrack_shard1_replica4:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
    at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
    at
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
    at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
    at
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:244)
    at org.apache.solr.cloud.ZkController.publish(ZkController.java:1215)
    at org.apache.solr.cloud.ZkController.publish(ZkController.java:1128)
    at org.apache.solr.cloud.ZkController.publish(ZkController.java:1124)
    at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:334)
    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
    at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Thanks

Ravi Kiran Bhaskar

On Thu, Feb 2, 2017 at 2:27 AM, Ravi Solr <ravis...@gmail.com> wrote:

> Hello,
>          Yesterday I upgraded from 6.0.1 to 6.4.0, its been straight 12
> hours of debugging spree!! Can somebody kindly help me  out of this misery.
>
> I have a set has 8 single shard collections with 3 replicas. As soon as I
> updated the configs and started the servers one of my collection got stuck
> with no leader. I have restarted solr to no avail, I also tried to force a
> leader via collections API that dint work either. I also see that, from
> time to time multiple solr nodes go down all at the same time, only a
> restart resolves the issue.
>
> The error snippets are shown below
>
> 2017-02-02 01:43:42.785 ERROR (recoveryExecutor-3-thread-6-processing-n:
> 10.128.159.245:9001_solr x:clicktrack_shard1_replica1 s:shard1
> c:clicktrack r:core_node1) [c:clicktrack s:shard1 r:core_node1
> x:clicktrack_shard1_replica1] o.a.s.c.RecoveryStrategy Error while trying
> to recover. 
> core=clicktrack_shard1_replica1:org.apache.solr.common.SolrException:
> No registered leader was found after waiting for 4000ms , collection:
> clicktrack slice: shard1
>
> solr.log.9:2017-02-02 01:43:41.336 INFO  (zkCallback-4-thread-29-
> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
> cluster state change: [WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/collections/clicktrack/state.json] for
> collection [clicktrack] has occurred - updating... (live nodes size: [1])
> solr.log.9:2017-02-02 01:43:42.224 INFO  (zkCallback-4-thread-29-
> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
> cluster state change: [WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/collections/clicktrack/state.json] for
> collection [clicktrack] has occurred - updating... (live nodes size: [1])
> solr.log.9:2017-02-02 01:43:43.767 INFO  (zkCallback-4-thread-23-
> processing-n:10.128.159.245:9001_solr) [   ] o.a.s.c.c.ZkStateReader A
> cluster state change: [WatchedEvent state:SyncConnected
> type:NodeDataChanged path:/collections/clicktrack/state.json] for
> collection [clicktrack] has occurred - updating... (live nodes size: [1])
>
>
> Suspecting the worst I backed up the index and renamed the collection's
> data folder and restarted the servers, this time the collection got a
> proper leader. So is my index really corrupted ? Solr UI showed live nodes
> just like the logs but without any leader. Even with the leader issue
> somewhat alleviated after renaming the data folder and letting silr create
> a new data folder my servers did go down a couple of times.
>
> I am not all that well versed with zookeeper...any trick to make zookeeper
> pick a leader and be happy ? Did anybody have solr/zookeeper issues with
> 6.4.0 ?
>
> Thanks
>
> Ravi Kiran Bhaskar
>

Re: 6.4.0 collection leader election and recovery issues

Reply via email to