[ 
https://issues.apache.org/jira/browse/SOLR-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338301#comment-14338301
 ] 

ASF subversion and git services commented on SOLR-6631:
-------------------------------------------------------

Commit 1662429 from sha...@apache.org in branch 'dev/branches/lucene_solr_4_10'
[ https://svn.apache.org/r1662429 ]

SOLR-6631: DistributedQueue spinning on calling zookeeper getChildren()

> DistributedQueue spinning on calling zookeeper getChildren()
> ------------------------------------------------------------
>
>                 Key: SOLR-6631
>                 URL: https://issues.apache.org/jira/browse/SOLR-6631
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Jessica Cheng Mallet
>            Assignee: Timothy Potter
>              Labels: solrcloud
>             Fix For: 4.10.4, 5.0
>
>         Attachments: SOLR-6631.patch, SOLR-6631.patch
>
>
> The change from SOLR-6336 introduced a bug where now I'm stuck in a loop 
> making getChildren() request to zookeeper with this thread dump:
> {quote}
> Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
> java.lang.Object.wait()
> org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, 
> ZooKeeper$WatchRegistration)
> org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
> org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
> org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, 
> boolean)
> org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
> org.apache.solr.cloud.DistributedQueue.getChildren(long)
> org.apache.solr.cloud.DistributedQueue.peek(long)
> org.apache.solr.cloud.DistributedQueue.peek(boolean)
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
> java.lang.Thread.run()
> {quote}
> Looking at the code, I think the issue is that LatchChildWatcher#process 
> always sets the event to its member variable event, regardless of its type, 
> but the problem is that once the member event is set, the await no longer 
> waits. In this state, the while loop in getChildren(long), when called with 
> wait being Integer.MAX_VALUE will loop back, NOT wait at await because event 
> != null, but then it still will not get any children.
> {quote}
> while (true) \{
>   if (!children.isEmpty()) break;
>   watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
>   if (watcher.getWatchedEvent() != null)
>     \{ children = orderedChildren(null); \}
>   if (wait != Long.MAX_VALUE) break;
> \}
> {quote}
> I think the fix would be to only set the event in the watcher if the type is 
> not None.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to