[ https://issues.apache.org/jira/browse/SOLR-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timothy Potter updated SOLR-6631: --------------------------------- Attachment: SOLR-6631.patch Thanks for the help [~markrmil...@gmail.com] and [~mewmewball]! I think this patch is ready to go (and no hangs this time)! Please give a quick review and I'll get it committed. > DistributedQueue spinning on calling zookeeper getChildren() > ------------------------------------------------------------ > > Key: SOLR-6631 > URL: https://issues.apache.org/jira/browse/SOLR-6631 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Jessica Cheng Mallet > Assignee: Timothy Potter > Labels: solrcloud > Attachments: SOLR-6631.patch, SOLR-6631.patch > > > The change from SOLR-6336 introduced a bug where now I'm stuck in a loop > making getChildren() request to zookeeper with this thread dump: > {quote} > Thread-51 [WAITING] CPU time: 1d 15h 0m 57s > java.lang.Object.wait() > org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, > ZooKeeper$WatchRegistration) > org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher) > org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls> > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation) > org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, > boolean) > org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher) > org.apache.solr.cloud.DistributedQueue.getChildren(long) > org.apache.solr.cloud.DistributedQueue.peek(long) > org.apache.solr.cloud.DistributedQueue.peek(boolean) > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run() > java.lang.Thread.run() > {quote} > Looking at the code, I think the issue is that LatchChildWatcher#process > always sets the event to its member variable event, regardless of its type, > but the problem is that once the member event is set, the await no longer > waits. In this state, the while loop in getChildren(long), when called with > wait being Integer.MAX_VALUE will loop back, NOT wait at await because event > != null, but then it still will not get any children. > {quote} > while (true) \{ > if (!children.isEmpty()) break; > watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait); > if (watcher.getWatchedEvent() != null) > \{ children = orderedChildren(null); \} > if (wait != Long.MAX_VALUE) break; > \} > {quote} > I think the fix would be to only set the event in the watcher if the type is > not None. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org