[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

Chris M. Hostetter (Jira) Wed, 18 Jun 2025 10:08:12 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17982750#comment-17982750
 ]


Chris M. Hostetter commented on SOLR-17764:
-------------------------------------------

{quote}If Jetty in tests is no longer does a hard shutdown and is doing 
graceful shutdowns, those 503s mean graceful shutdown does not work.

If you think 503 did not ever cause retries, those chaos monkey tests would 
have shown this same fail all the time previously.

The main thing is, these two things can't be true if you see the 503 exception:
 * Jetty shuts down gracefully in tests
 * graceful shutdown works properly{quote}
Your claims here don't make sense to me logically – and they don't match the 
reality you can see in the logs when running tests like this one.

Before we added Jetty's graceful shutdown module, stopping a jetty instance 
would cause jetty to immediately terminate any open connections. Solr server 
code _may/could_ throw a SolrException with a 503 status if it noticed 
{{cc..isShutDown()==true}} while handling a request, but the client would never 
get it that 503 status code, because the client would get a connection error 
(not an HTTP response).

You can see this by running tests like 
ChaosMonkeySafeLeaderWithPullReplicasTest, using the seed mentioned above, 
against GIT SHA {{d0d71a7b38f~}} (ie: one commit before the graceful module was 
added to main) and looking at the logs from the successful test...
{noformat}
hossman@slate:~/lucene/solr [j21] [7288eb38863] $ ./gradlew clean test 
--max-workers=1 --tests ChaosMonkeySafeLeaderWithPullReplicasTest.test 
-Dtests.seed=E7F93005B9386058 -Dtests.multiplier=1 -Dtests.nightly=true 
-Dtests.locale=fr-SN -Dtests.timezone=America/Anguilla -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8 -Ptests.verbose=true &> tmp.txt
hossman@slate:~/lucene/solr [j21] [7288eb38863] $ grep 'Request to collection ' 
tmp.txt
  2> 20491 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(0) org.apache.http.NoHttpResponseException: 127.0.0.1:43553 failed to respond, 
retry=0 maxRetries=5 commError=true errorCode=0 - retrying
  2> 24586 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(0) org.apache.http.NoHttpResponseException: 127.0.0.1:38055 failed to respond, 
retry=0 maxRetries=5 commError=true errorCode=0 - retrying
  2> 25106 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(0) org.apache.http.NoHttpResponseException: 127.0.0.1:38439 failed to respond, 
retry=0 maxRetries=5 commError=true errorCode=0 - retrying
{noformat}
...the CloudSolrClient retry logic does *NOT* get an HTTP response from Solr 
with a 503 error code, it gets a NoHttpResponseException from the Apache 
HttpClient which it considers a "commError" so it retries.

(IIRC tests using CloudHttp2SolrClient in a similar situation would get a 
SocketException in this case, which would also be considered a "commError" and 
retried)

The *ONLY* way the CloudSolrClient can get a 503 error (thrown by Solr server 
code) from a node that is being shutdown, is _because_ Jetty's graceful module 
is allowing the solr code to finish handling an in flight request. And when/if 
Solr sends that 503, the client (in this test, with this seed) does *NOT* retry 
– which can be seen by running the same seed above against GIT SHA 
{{d0d71a7b38f}} (when the graceful module was added to main)...
{noformat}
hossman@slate:~/lucene/solr [j21] [d0d71a7b38f] $ ./gradlew clean test 
--max-workers=1 --tests ChaosMonkeySafeLeaderWithPullReplicasTest.test 
-Dtests.seed=E7F93005B9386058 -Dtests.multiplier=1 -Dtests.nightly=true 
-Dtests.locale=fr-SN -Dtests.timezone=America/Anguilla -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8 -Ptests.verbose=true &> tmp.txt
hossman@slate:~/lucene/solr [j21] [d0d71a7b38f] $ grep 'Request to collection 
\|errorCode' tmp.txt | head -4
  2> 19272 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from 
server at http://127.0.0.1:43665/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. <html>
  2> , retry=0 maxRetries=5 commError=false errorCode=503 
  2> 19282 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from 
server at http://127.0.0.1:43665/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. <html>
  2> , retry=0 maxRetries=5 commError=false errorCode=503 
{noformat}
...with the graceful module in place, the client _does_ get a 
RemoteSolrException with status=503, but as you can see from the tail end of 
the log message it is not {{retrying}} (which is why the test ultimately fails)
----
{quote}Tests do not test graceful shutdown. They don't do a graceful shutdown. 
They do a hard interrupt driven stop, just as they have in one way or another 
for over a decade.
{quote}
I feel like we're going in circles here?

As I said before (when you commented "Lots of tests may never add the 
StatisticsHandler") ...
{quote}SOLR-17744 added this to JettySolrRunner so it should be in any jetty 
based test: [https://github.com/apache/solr/commit/fe7fe7966a6]
{quote}
...when a test "stops" a JettySolrRunner, the jetty graceful module *DOES* 
allow Solr to finish handling in flight requests, while rejecting new requests.
----
{quote}You can see retry code looking for 503 right here in the cloud client:
{code:java}
      if (wasCommError
          || (exc instanceof RouteException
              && (errorCode == 503)) 
{code}
{quote}
Except that you seem to be blindly ignoring 1/3rd of the conditional ! ! !

That if statement doe *NOT* say {{(wasCommError OR 503)}} it says 
{{(wasCommError OR (RouteException AND 503))}} – and I've already discussed, in 
depth, that RouteException is only used in a very limited circumstances – and 
does not come into play in this test...
{quote}The only code path in SolrJ that will ever throw a RouteException is the 
directUpdate(...) code path – which doesn't happen in these failures because 
test randomization decided to use a CloudSolrClient that has 
isUpdatesToLeaders()==false
{quote}

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-17764
>                 URL: https://issues.apache.org/jira/browse/SOLR-17764
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

Reply via email to