[ 
https://issues.apache.org/jira/browse/SOLR-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200061#comment-16200061
 ] 

Shalin Shekhar Mangar commented on SOLR-11445:
----------------------------------------------

I think it is better that we explicitly check for NoNode or NodeExists 
exceptions in the isBadMessageOrInvalidState() method. Most other 
KeeperExceptions shouldn't cause us to poll items off the queue. Also, the same 
kind of handling should be done for exceptions thrown when processing messages 
from state update queue.

> Overseer.processQueueItem()....  zkStateWriter.enqueueUpdate might ideally 
> have a try{}catch{} around it
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11445
>                 URL: https://issues.apache.org/jira/browse/SOLR-11445
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.6.1, 7.0, master (8.0)
>            Reporter: Greg Harris
>         Attachments: SOLR-11445.patch
>
>
> So we had the following stack trace with a customer:
> 2017-10-04 11:25:30.339 ERROR (xxxx) [ ] o.a.s.c.Overseer Exception in 
> Overseer main queue loop
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /collections/xxxx/state.json
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>     at 
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
>     at 
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
>     at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>     at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
>     at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:235)
>     at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
>     at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
>     at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:199)
>     at java.lang.Thread.run(Thread.java:748)
> I want to highlight: 
>   at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
>     at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
> This ends up coming from Overseer:
> while (data != null)  {
>                 final ZkNodeProps message = ZkNodeProps.load(data);
>                 log.debug("processMessage: workQueueSize: {}, message = {}", 
> workQueue.getStats().getQueueLength(), message);
>                 // force flush to ZK after each message because there is no 
> fallback if workQueue items
>                 // are removed from workQueue but fail to be written to ZK
>                 *clusterState = processQueueItem(message, clusterState, 
> zkStateWriter, false, null);
>                 workQueue.poll(); // poll-ing removes the element we got by 
> peek-ing*
>                 data = workQueue.peek();
>               }
> Note: The processQueueItem comes before the poll, therefore upon a thrown 
> exception the same node/message that won't process becomes stuck. This made a 
> large cluster unable to come up on it's own without deleting the problem 
> node. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to