Gregory Chanan created SOLR-8152:
------------------------------------

             Summary: Overseer Task Processor/Queue can miss responses, leading 
to timeouts
                 Key: SOLR-8152
                 URL: https://issues.apache.org/jira/browse/SOLR-8152
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
            Reporter: Gregory Chanan
            Assignee: Gregory Chanan


I noticed some jenkins reports of timeouts in the 
TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of work 
to be done is small and the timeout generous at 300 seconds.

I added some statistics gathering and started beasting the test and sure 
enough, some tests reported tasks taking slightly more than 300 seconds, while 
most tests ran with a maximum task run of less than a second.  This suggested 
something was hanging until the timeout.

Some investigation lead to this code:
https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194

There appears to be a few issues here:
{code}
 String path = createData(dir + "/" + PREFIX, data,
          CreateMode.PERSISTENT_SEQUENTIAL);
      String watchID = createData(
          dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") + 
1),
          null, CreateMode.EPHEMERAL);

      Object lock = new Object();
      LatchWatcher watcher = new LatchWatcher(lock);
      synchronized (lock) {
        if (zookeeper.exists(watchID, watcher, true) != null) {
          watcher.await(timeout);
        }
      }
{code}

For one, the request object is created before the response object.  If the 
request is quickly picked up and processed, two things can happen:
1) The response is written before the watch is set, which means we wait until 
the timeout even though the response is ready.  This will still pass the test 
because the response is available, the client will just wait needlessly.
2) The response is attempted to be written before the response node is even 
created.  The fact that the response node doesn't exist is ignored:
https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
In this case, the task is processed but the client will actually see a failure 
because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to