[ 
https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882245#comment-16882245
 ] 

Hoss Man commented on SOLR-13616:
---------------------------------

{quote}I'm not sure we should change the waitForState logic to rethrow 
Exceptions or revert back PrepRecoveryOp to its previous version ...
{quote}
{quote}Hoss and Dat – thank you for investigating this! All usages of 
CollectionStateWatcher or LiveNodesWatcher will suffer from this problem i.e. 
the thread that runs the watcher swallows the exception ...
{quote}
Well, generally speaking there isn't any way (i can think of) for the thread 
executing a Watcher to do anything _but_ swallow any exceptions from the 
watcher – it can't propogated it back to the "caller" of registrWatcher or 
anything like that .. if the caller wanted to be informed then the Watcher it 
registered should be catching the exceptions itself.

But to Dat's point: in the specific case of {{waitForState}} – there 
ZkStateReader *is* creating it's own Watcher to wrap the input Predicate, and 
we could in fact make waitForState do something inside that Watcher that 
catches any Exception thrown by the Predicate and short circuts out of the 
{{waitForState}} call, wrapping/re-throwing the exception in the meantime.

But those seem like "broader" problems with regards to where/how the different 
callers are using the Watcher/waitForState APIs that we should probably create 
a new issue to track (for auditing all of them and clarifying the behavior in 
the javadocs) ... frankly i think in this specific jira we should be asking a 
lot more questions about the _specific_ predicate used in PrepForRecovery's 
waitForState call ... notably what exactly is the expectation here when the 
SolrCore (that prepRecovery wants to recover from) can't be found _in the local 
CoreContainer_ ... deleting the collection is just one example, are there other 
situations where the core may not be found at this point in the code? (node 
shutdown perhaps? autoscaling removing a replica) ?

what about a few lines later...
{code:java}
          if (onlyIfLeader != null && onlyIfLeader) {
            if (!core.getCoreDescriptor().getCloudDescriptor().isLeader()) {
              throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "We 
are not the leader");
            }
          }
{code}
...even if the SolrCore is found, if we expect it to be the shard leader, and 
it's not (what if there has beena leader election in the meantime?) then that's 
another type of problem that will also cause the predicate to throw an 
exception that will (aparently) cause PrepRecovery to stall. what should 
PrepRecovery do here?

i suspect that in general the use of waitForState here in PrepRecoveryOp is "ok 
in concept" ... we just need to make the predicate smarter about exiting 
immeidately in these situations instead of throwing an exception that gets 
swallowed ... i'm just not sure what the right behavior for PrepRecovery *is* 
in these sitautions.
----
I don't suppose either of you were able to spot what's "wrong" with my test 
that it doesn't force a failure in this situation?

> Possible racecondition/deadlock between collection DELETE and PrepRecovery ? 
> (TestPolicyCloud failures)
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13616
>                 URL: https://issues.apache.org/jira/browse/SOLR-13616
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13616.test-incomplete.patch, 
> thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is 
> a possible deadlock condition when attempting to delete a collection while 
> recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but 
> it does not appear to be a test specific problem, and seems like it could 
> potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to