[ https://issues.apache.org/jira/browse/SOLR-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200061#comment-16200061 ]
Shalin Shekhar Mangar commented on SOLR-11445: ---------------------------------------------- I think it is better that we explicitly check for NoNode or NodeExists exceptions in the isBadMessageOrInvalidState() method. Most other KeeperExceptions shouldn't cause us to poll items off the queue. Also, the same kind of handling should be done for exceptions thrown when processing messages from state update queue. > Overseer.processQueueItem().... zkStateWriter.enqueueUpdate might ideally > have a try{}catch{} around it > -------------------------------------------------------------------------------------------------------- > > Key: SOLR-11445 > URL: https://issues.apache.org/jira/browse/SOLR-11445 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 6.6.1, 7.0, master (8.0) > Reporter: Greg Harris > Attachments: SOLR-11445.patch > > > So we had the following stack trace with a customer: > 2017-10-04 11:25:30.339 ERROR (xxxx) [ ] o.a.s.c.Overseer Exception in > Overseer main queue loop > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /collections/xxxx/state.json > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at > org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:391) > at > org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:388) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) > at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:388) > at > org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:235) > at > org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:199) > at java.lang.Thread.run(Thread.java:748) > I want to highlight: > at > org.apache.solr.cloud.overseer.ZkStateWriter.enqueueUpdate(ZkStateWriter.java:152) > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:271) > This ends up coming from Overseer: > while (data != null) { > final ZkNodeProps message = ZkNodeProps.load(data); > log.debug("processMessage: workQueueSize: {}, message = {}", > workQueue.getStats().getQueueLength(), message); > // force flush to ZK after each message because there is no > fallback if workQueue items > // are removed from workQueue but fail to be written to ZK > *clusterState = processQueueItem(message, clusterState, > zkStateWriter, false, null); > workQueue.poll(); // poll-ing removes the element we got by > peek-ing* > data = workQueue.peek(); > } > Note: The processQueueItem comes before the poll, therefore upon a thrown > exception the same node/message that won't process becomes stuck. This made a > large cluster unable to come up on it's own without deleting the problem > node. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org