[
https://issues.apache.org/jira/browse/SOLR-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Houston Putman resolved SOLR-17709.
-----------------------------------
Fix Version/s: 9.9
Resolution: Fixed
> Fix race condition when checking distrib async cmd status
> ---------------------------------------------------------
>
> Key: SOLR-17709
> URL: https://issues.apache.org/jira/browse/SOLR-17709
> Project: Solr
> Issue Type: Bug
> Reporter: Houston Putman
> Assignee: Houston Putman
> Priority: Major
> Labels: pull-request-available
> Fix For: 9.9
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The {{DistributedApiAsyncTracker}} mentioned that there could be a race
> condition between completing an asynchronous request and checking its status.
> This is causing very infrequent test failures, such as:
> {{{}ReindexCollectionTest.testAbort{}}}.
> The solution is to just check the ZK paths in reverse order from how they are
> updated.
> So when completing or canceling tasks, they are updated in the following
> order:
> # {{trackedAsyncTasks.put(asyncId, ...)}} or
> {{trackedAsyncTasks.remove(asyncId)}}
> # {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}}
> Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check
> {{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can
> get a false-positive "Submitted" or "Running" result (race condition
> described below). But that will just lead to the client checking again at a
> later time, and the next time they call, {{inFlightAsyncTasks}} will have
> been updated and we will get the actual response from
> {{{}trackedAsyncTasks{}}}.
> Before this PR, the race condition would give us a false-negative "Operation
> failed. Please resubmit" result. (race condition described below). This would
> tell the client to try again, when in fact the task could have been
> successful. This false-negative is much worse than the false-positive
> described above.
> Race condition before this PR: (false-negative)
> # {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no
> response is found
> # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response
> is put into ZK
> # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID
> is deleted from ZK
> # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked --
> asyncId is not found
> ** Return a failure - Assume node died because {{inFlightAsyncTasks }}
> ephemeral node is gone
> Race condition after this PR: (false-positive)
> # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response
> is put into ZK
> # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked --
> asyncId is found
> ** Return that the task is in progress
> # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID
> is deleted from ZK
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]