[
https://issues.apache.org/jira/browse/SOLR-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936920#comment-17936920
]
ASF subversion and git services commented on SOLR-17709:
--------------------------------------------------------
Commit e51dd47d88445259d57fc63dc655aecaafecf265 in solr's branch
refs/heads/branch_9x from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=e51dd47d884 ]
SOLR-17709: Fix race condition when checking distrib async cmd status (#3268)
(cherry picked from commit d0d4f280b6410d8996fa998620d9b6661848d1f0)
> Fix race condition when checking distrib async cmd status
> ---------------------------------------------------------
>
> Key: SOLR-17709
> URL: https://issues.apache.org/jira/browse/SOLR-17709
> Project: Solr
> Issue Type: Bug
> Reporter: Houston Putman
> Assignee: Houston Putman
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The {{DistributedApiAsyncTracker}} mentioned that there could be a race
> condition between completing an asynchronous request and checking its status.
> This is causing very infrequent test failures, such as:
> {{{}ReindexCollectionTest.testAbort{}}}.
> The solution is to just check the ZK paths in reverse order from how they are
> updated.
> So when completing or canceling tasks, they are updated in the following
> order:
> # {{trackedAsyncTasks.put(asyncId, ...)}} or
> {{trackedAsyncTasks.remove(asyncId)}}
> # {{inFlightAsyncTasks.deleteInFlightTask(asyncId)}}
> Therefore in {{{}getAsyncTaskRequestStatus(asyncId){}}}, we need to check
> {{inFlightAsyncTasks}} before {{{}trackedAsyncTasks{}}}. This means we can
> get a false-positive "Submitted" or "Running" result (race condition
> described below). But that will just lead to the client checking again at a
> later time, and the next time they call, {{inFlightAsyncTasks}} will have
> been updated and we will get the actual response from
> {{{}trackedAsyncTasks{}}}.
> Before this PR, the race condition would give us a false-negative "Operation
> failed. Please resubmit" result. (race condition described below). This would
> tell the client to try again, when in fact the task could have been
> successful. This false-negative is much worse than the false-positive
> described above.
> Race condition before this PR: (false-negative)
> # {{getAsyncTaskRequestStatus()}} -- {{trackedAsyncTasks}} is checked -- no
> response is found
> # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response
> is put into ZK
> # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID
> is deleted from ZK
> # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked --
> asyncId is not found
> ** Return a failure - Assume node died because {{inFlightAsyncTasks }}
> ephemeral node is gone
> Race condition after this PR: (false-positive)
> # {{setTaskCompleted()}} -- {{trackedAsyncTasks}} id is updated -- response
> is put into ZK
> # {{getAsyncTaskRequestStatus()}} -- {{inFlightAsyncTasks }} is checked --
> asyncId is found
> ** Return that the task is in progress
> # {{setTaskCompleted()}} -- {{inFlightAsyncTasks}} id is deleted -- asyncID
> is deleted from ZK
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]