[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

Chris M. Hostetter (Jira) Tue, 17 Jun 2025 11:14:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980470#comment-17980470
 ]


Chris M. Hostetter commented on SOLR-17764:
-------------------------------------------

{quote}• JettySolrRunner.stop() short-circuits the normal Jetty life-cycle so 
that unit tests finish quickly
• it explicitly calls coreContainer.shutdown() before it invokes Server.stop();
{quote}
this does not seem to be true, nor was it true before SOLR-17744
{quote}Lots of tests may never add the StatisticsHandler?
{quote}
SOLR-17744 added this to JettySolrRunner so it should be in any jetty based 
test: [https://github.com/apache/solr/commit/fe7fe7966a6]
{quote}I honestly think its some kind of regression that its not retried. These 
kinds of tests always would have been very flakey otherwise, and I seem to 
remember making shutdown throw 503 so you could retry just for this issue. 
{quote}
AFAICT:
 * {*}BEFORE{*}: SOLR-17744:
 ** on shutdown, jetty immediately closed any open connections causing clients 
to get a {{SocketException}} (or maybe a {{ConnectException}} if it hasn't 
fully established the connection yet at that point in time)
 *** Solr "server" code may have thrown a 503 error on shutdown, that you would 
see in the logs, but that never made it to the client
 **  {{CloudSolrClient}} instances – like the one used in this test – 
automatically retries on all  {{SocketException}} (and all {{ConnectException}} 
)
 * {*}AFTER{*}: SOLR-17744:
 ** on shutdown, jetty waits for requests on currently open connections to 
finish...
 *** But Solr "server" code may see that shutdown has been called, and return a 
503 exception to the client (at which point the request completes w/o any sort 
of socket/network error)
 ** {{CloudSolrClient}} instances – like the one used in this test – get the 
503 exception and *_do not automatically retry 503 errors_*

 *** {{CloudSolrClient}} _*only*_ retries on 503 errors if the error was a 
{{RouteException}}
 *** The *_only_* code path in SolrJ that will ever throw a {{RouteException}} 
is the {{directUpdate(...)}} code path – which doesn't happen in these failures 
because test randomization decided to use a CloudSolrClient that has 
{{isUpdatesToLeaders()==false}}

----
As i mentioned before...
{quote}While it would be easy to "fix" this test by forcing 
isUpdatesToLeaders()==true, I'm not sure what the best "fix" is for the 
underlying behavior in Solr/SolrJ is?
{quote}
My biggest concern is not this test. My biggest concern is the broader 
questions this test has raised about how/when/why SolrJ decides to "retry" on 
exceptions, and what exceptions trigger that retry logic.

It doesn't seem like it does us much good to have jetty allow in flight 
requests to finish if the solr code that "finishes" that request throws a 503 
error that solrJ does not recognize as a "communication error" – and it _only_ 
retries on communication errors...
{code:java}
      int errorCode =
          (rootCause instanceof SolrException)
              ? ((SolrException) rootCause).code()
              : SolrException.ErrorCode.UNKNOWN.code;

      boolean wasCommError =
          (rootCause instanceof ConnectException
              || rootCause instanceof SocketException
              || wasCommError(rootCause));

      if (wasCommError
          || (exc instanceof RouteException
              && (errorCode == 503)) ...
{code}

So perhaps the "real" question to ask is:

* Why does this code care if {{(exc instanceof RouteException)}} ? ... why 
doesn't it retry on {{(wasCommError || (errorCode == 503))}} ?

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-17764
>                 URL: https://issues.apache.org/jira/browse/SOLR-17764
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

Reply via email to