[ https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955908#comment-14955908 ]
ASF subversion and git services commented on SOLR-8152: ------------------------------------------------------- Commit 1708538 from gcha...@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1708538 ] SOLR-8152: Overseer Task Processor/Queue can miss responses, leading to timeouts > Overseer Task Processor/Queue can miss responses, leading to timeouts > --------------------------------------------------------------------- > > Key: SOLR-8152 > URL: https://issues.apache.org/jira/browse/SOLR-8152 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Gregory Chanan > Assignee: Gregory Chanan > Attachments: SOLR-8152.patch > > > I noticed some jenkins reports of timeouts in the > TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of > work to be done is small and the timeout generous at 300 seconds. > I added some statistics gathering and started beasting the test and sure > enough, some tests reported tasks taking slightly more than 300 seconds, > while most tests ran with a maximum task run of less than a second. This > suggested something was hanging until the timeout. > Some investigation lead to this code: > https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194 > There appears to be a few issues here: > {code} > String path = createData(dir + "/" + PREFIX, data, > CreateMode.PERSISTENT_SEQUENTIAL); > String watchID = createData( > dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") > + 1), > null, CreateMode.EPHEMERAL); > Object lock = new Object(); > LatchWatcher watcher = new LatchWatcher(lock); > synchronized (lock) { > if (zookeeper.exists(watchID, watcher, true) != null) { > watcher.await(timeout); > } > } > {code} > For one, the request object is created before the response object. If the > request is quickly picked up and processed, two things can happen: > 1) The response is written before the watch is set, which means we wait until > the timeout even though the response is ready. This will still pass the test > because the response is available, the client will just wait needlessly. > 2) The response is attempted to be written before the response node is even > created. The fact that the response node doesn't exist is ignored: > https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94 > In this case, the task is processed but the client will actually see a > failure because there is no response. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org