[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

Mark Robert Miller (Jira) Mon, 16 Jun 2025 17:27:33 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979336#comment-17979336
 ]


Mark Robert Miller commented on SOLR-17764:
-------------------------------------------

I don't have the code in front of me, so maybe you changed this or was changed, 
but from my memory:

        • JettySolrRunner.stop() short-circuits the normal Jetty life-cycle so 
that unit tests finish quickly
        • it explicitly calls coreContainer.shutdown() before it invokes 
Server.stop();
        • it sets server.setStopTimeout(0) so Jetty never blocks waiting for 
in-flight requests.
        • Lots of tests may never add the StatisticsHandler?

In which case, tests in general would not be testing graceful shutdown and 
would expect to hit a 503 or random issue due to something being closed 
depending on races / how peppered that is closed check is in the code. 

503 should mean retry: that won't bullet proof that test if its counting on a 
request finishing after cluster shutdown or a whole shard is shutdown, but 
should be fairly bullet proof for a single instance or all instances in a shard 
but one getting shutdown. 

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-17764
>                 URL: https://issues.apache.org/jira/browse/SOLR-17764
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

Reply via email to