Houston Putman created SOLR-17709:
-------------------------------------
Summary: Fix race condition when checking distrib async cmd status
Key: SOLR-17709
URL: https://issues.apache.org/jira/browse/SOLR-17709
Project: Solr
Issue Type: Bug
Reporter: Houston Putman
Assignee: Houston Putman
The {{DistributedApiAsyncTracker}} mentioned that there could be a race
condition between completing an asynchronous request and checking its status.
This is causing very infrequent test failures, such as:
{{{}ReindexCollectionTest.testAbort{}}}.
The solution is to just check the ZK paths in reverse order from how they are
updated.
So when completing or canceling tasks, they are updated in the following order:
# {{trackedAsyncTasks.put(asyncId, ...)}} or
{{trackedAsyncTasks.remove(asyncId)}}
# {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}}
Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check
{{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can get
a false-positive "Submitted" or "Running" result (race condition described
below). But that will just lead to the client checking again at a later time,
and the next time they call, {{inFlightAsyncTasks}} will have been updated and
we will get the actual response from {{{}trackedAsyncTasks{}}}.
Before this PR, the race condition would give us a false-negative "Operation
failed. Please resubmit" result. (race condition described below). This would
tell the client to try again, when in fact the task could have been successful.
This false-negative is much worse than the false-positive described above.
Race condition before this PR: (false-negative)
# {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no
response is found
# {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response is
put into ZK
# {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID is
deleted from ZK
# {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked --
asyncId is not found
** Return a failure - Assume node died because {{inFlightAsyncTasks }}
ephemeral node is gone
Race condition after this PR: (false-positive)
# {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response is
put into ZK
# {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked --
asyncId is found
** Return that the task is in progress
# {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID is
deleted from ZK
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]