[ https://issues.apache.org/jira/browse/SOLR-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188305#comment-14188305 ]
Timothy Potter commented on SOLR-6631: -------------------------------------- Thanks for the feedback Hoss. I was actually wondering if it would suffice to handle NodeChildrenChanged EventTypes in the LatchChildWatcher process method, i.e. change the code to: if (eventType == Event.EventType.NodeChildrenChanged) { ... } [~markrmil...@gmail.com] or [~andyetitmoves] do either of you have any insight you can share on this? Specifically, I'd like to change the LatchChildWatcher.process to set the event member and notifyAll only if the EventType is NodeChildrenChanged, i.e. {code} @Override public void process(WatchedEvent event) { Event.EventType eventType = event.getType(); LOG.info("LatchChildWatcher fired on path: " + event.getPath() + " state: " + event.getState() + " type " + eventType); if (eventType == Event.EventType.NodeChildrenChanged) { synchronized (lock) { this.event = event; lock.notifyAll(); } } } {code} Or do we need to handle the other event types and just not affect the event if the type is None as originally suggested by [~mewmewball]? Need to get this one committed soon ;-) > DistributedQueue spinning on calling zookeeper getChildren() > ------------------------------------------------------------ > > Key: SOLR-6631 > URL: https://issues.apache.org/jira/browse/SOLR-6631 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Jessica Cheng Mallet > Assignee: Timothy Potter > Labels: solrcloud > Attachments: SOLR-6631.patch > > > The change from SOLR-6336 introduced a bug where now I'm stuck in a loop > making getChildren() request to zookeeper with this thread dump: > {quote} > Thread-51 [WAITING] CPU time: 1d 15h 0m 57s > java.lang.Object.wait() > org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, > ZooKeeper$WatchRegistration) > org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher) > org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls> > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation) > org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, > boolean) > org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher) > org.apache.solr.cloud.DistributedQueue.getChildren(long) > org.apache.solr.cloud.DistributedQueue.peek(long) > org.apache.solr.cloud.DistributedQueue.peek(boolean) > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run() > java.lang.Thread.run() > {quote} > Looking at the code, I think the issue is that LatchChildWatcher#process > always sets the event to its member variable event, regardless of its type, > but the problem is that once the member event is set, the await no longer > waits. In this state, the while loop in getChildren(long), when called with > wait being Integer.MAX_VALUE will loop back, NOT wait at await because event > != null, but then it still will not get any children. > {quote} > while (true) \{ > if (!children.isEmpty()) break; > watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait); > if (watcher.getWatchedEvent() != null) > \{ children = orderedChildren(null); \} > if (wait != Long.MAX_VALUE) break; > \} > {quote} > I think the fix would be to only set the event in the watcher if the type is > not None. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org