[
https://issues.apache.org/jira/browse/SOLR-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188373#comment-14188373
]
ASF subversion and git services commented on SOLR-6631:
-------------------------------------------------------
Commit 1635142 from [~thelabdude] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1635142 ]
SOLR-6631: DistributedQueue spinning on calling zookeeper getChildren()
> DistributedQueue spinning on calling zookeeper getChildren()
> ------------------------------------------------------------
>
> Key: SOLR-6631
> URL: https://issues.apache.org/jira/browse/SOLR-6631
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Jessica Cheng Mallet
> Assignee: Timothy Potter
> Labels: solrcloud
> Attachments: SOLR-6631.patch
>
>
> The change from SOLR-6336 introduced a bug where now I'm stuck in a loop
> making getChildren() request to zookeeper with this thread dump:
> {quote}
> Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
> java.lang.Object.wait()
> org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record,
> ZooKeeper$WatchRegistration)
> org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
> org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
> org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher,
> boolean)
> org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
> org.apache.solr.cloud.DistributedQueue.getChildren(long)
> org.apache.solr.cloud.DistributedQueue.peek(long)
> org.apache.solr.cloud.DistributedQueue.peek(boolean)
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
> java.lang.Thread.run()
> {quote}
> Looking at the code, I think the issue is that LatchChildWatcher#process
> always sets the event to its member variable event, regardless of its type,
> but the problem is that once the member event is set, the await no longer
> waits. In this state, the while loop in getChildren(long), when called with
> wait being Integer.MAX_VALUE will loop back, NOT wait at await because event
> != null, but then it still will not get any children.
> {quote}
> while (true) \{
> if (!children.isEmpty()) break;
> watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
> if (watcher.getWatchedEvent() != null)
> \{ children = orderedChildren(null); \}
> if (wait != Long.MAX_VALUE) break;
> \}
> {quote}
> I think the fix would be to only set the event in the watcher if the type is
> not None.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]