[
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882245#comment-16882245
]
Hoss Man commented on SOLR-13616:
---------------------------------
{quote}I'm not sure we should change the waitForState logic to rethrow
Exceptions or revert back PrepRecoveryOp to its previous version ...
{quote}
{quote}Hoss and Dat – thank you for investigating this! All usages of
CollectionStateWatcher or LiveNodesWatcher will suffer from this problem i.e.
the thread that runs the watcher swallows the exception ...
{quote}
Well, generally speaking there isn't any way (i can think of) for the thread
executing a Watcher to do anything _but_ swallow any exceptions from the
watcher – it can't propogated it back to the "caller" of registrWatcher or
anything like that .. if the caller wanted to be informed then the Watcher it
registered should be catching the exceptions itself.
But to Dat's point: in the specific case of {{waitForState}} – there
ZkStateReader *is* creating it's own Watcher to wrap the input Predicate, and
we could in fact make waitForState do something inside that Watcher that
catches any Exception thrown by the Predicate and short circuts out of the
{{waitForState}} call, wrapping/re-throwing the exception in the meantime.
But those seem like "broader" problems with regards to where/how the different
callers are using the Watcher/waitForState APIs that we should probably create
a new issue to track (for auditing all of them and clarifying the behavior in
the javadocs) ... frankly i think in this specific jira we should be asking a
lot more questions about the _specific_ predicate used in PrepForRecovery's
waitForState call ... notably what exactly is the expectation here when the
SolrCore (that prepRecovery wants to recover from) can't be found _in the local
CoreContainer_ ... deleting the collection is just one example, are there other
situations where the core may not be found at this point in the code? (node
shutdown perhaps? autoscaling removing a replica) ?
what about a few lines later...
{code:java}
if (onlyIfLeader != null && onlyIfLeader) {
if (!core.getCoreDescriptor().getCloudDescriptor().isLeader()) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "We
are not the leader");
}
}
{code}
...even if the SolrCore is found, if we expect it to be the shard leader, and
it's not (what if there has beena leader election in the meantime?) then that's
another type of problem that will also cause the predicate to throw an
exception that will (aparently) cause PrepRecovery to stall. what should
PrepRecovery do here?
i suspect that in general the use of waitForState here in PrepRecoveryOp is "ok
in concept" ... we just need to make the predicate smarter about exiting
immeidately in these situations instead of throwing an exception that gets
swallowed ... i'm just not sure what the right behavior for PrepRecovery *is*
in these sitautions.
----
I don't suppose either of you were able to spot what's "wrong" with my test
that it doesn't force a failure in this situation?
> Possible racecondition/deadlock between collection DELETE and PrepRecovery ?
> (TestPolicyCloud failures)
> -------------------------------------------------------------------------------------------------------
>
> Key: SOLR-13616
> URL: https://issues.apache.org/jira/browse/SOLR-13616
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: SOLR-13616.test-incomplete.patch,
> thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is
> a possible deadlock condition when attempting to delete a collection while
> recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but
> it does not appear to be a test specific problem, and seems like it could
> potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]