[ 
https://issues.apache.org/jira/browse/SOLR-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201324#comment-16201324
 ] 

Cao Manh Dat commented on SOLR-11445:
-------------------------------------

bq. I think it is better that we explicitly check for NoNode or NodeExists 
exceptions in the isBadMessageOrInvalidState() method.
Yeah, that's a good idea.

bq. Also, the same kind of handling should be done for exceptions thrown when 
processing messages from state update queue.
We can't do this. We process state update queue in batch, so we don't know 
which message is the bad message. So we must fall-back on using workqueue + 
reread cluster state from Zk.

> Overseer.processQueueItem()....  zkStateWriter.enqueueUpdate might ideally 
> have a try{}catch{} around it
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11445
>                 URL: https://issues.apache.org/jira/browse/SOLR-11445
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.6.1, 7.0, master (8.0)
>            Reporter: Greg Harris
>         Attachments: SOLR-11445.patch
>
>
> So we had the following stack trace with a customer:
> 2017-10-04 11:25:30.339 ERROR (xxxx) [ ] o.a.s.c.Overseer Exception in 
> Overseer main queue loop
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /collections/xxxx/state.json
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
>     at 
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391)
>     at 
> org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388)
>     at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>     at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388)
>     at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:235)
>     at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
>     at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
>     at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:199)
>     at java.lang.Thread.run(Thread.java:748)
> I want to highlight: 
>   at 
> org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152)
>     at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271)
> This ends up coming from Overseer:
> while (data != null)  {
>                 final ZkNodeProps message = ZkNodeProps.load(data);
>                 log.debug("processMessage: workQueueSize: {}, message = {}", 
> workQueue.getStats().getQueueLength(), message);
>                 // force flush to ZK after each message because there is no 
> fallback if workQueue items
>                 // are removed from workQueue but fail to be written to ZK
>                 *clusterState = processQueueItem(message, clusterState, 
> zkStateWriter, false, null);
>                 workQueue.poll(); // poll-ing removes the element we got by 
> peek-ing*
>                 data = workQueue.peek();
>               }
> Note: The processQueueItem comes before the poll, therefore upon a thrown 
> exception the same node/message that won't process becomes stuck. This made a 
> large cluster unable to come up on it's own without deleting the problem 
> node. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to