[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-07-15 Thread Houston Putman (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007347#comment-18007347
 ] 

Houston Putman commented on SOLR-17764:
---

I haven't had time to get into this yet, but it will be complex to get it 
working correctly in Solr. I just want to add that this is also causing the 
failures for LeaderTragicEventTest that we are seeing.

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986925#comment-17986925
 ] 

ASF subversion and git services commented on SOLR-17764:


Commit 0fda63b788bbd232039f079d1ccd3ac452b29b85 in solr's branch 
refs/heads/fix-native-access-warning from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=0fda63b788b ]

SOLR-17744: Change default SOLR_JETTY_GRACEFUL value to false

Based on test failures noted in SOLR-17764, the implications of what this might 
mean for some SolrJ users, and a lack of concensus on how to move forward, I'm 
changing this defualt to minimize the risk of impact on users when upgrading


> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985681#comment-17985681
 ] 

ASF subversion and git services commented on SOLR-17764:


Commit 0fda63b788bbd232039f079d1ccd3ac452b29b85 in solr's branch 
refs/heads/main from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=0fda63b788b ]

SOLR-17744: Change default SOLR_JETTY_GRACEFUL value to false

Based on test failures noted in SOLR-17764, the implications of what this might 
mean for some SolrJ users, and a lack of concensus on how to move forward, I'm 
changing this defualt to minimize the risk of impact on users when upgrading


> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-23 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985687#comment-17985687
 ] 

Chris M. Hostetter commented on SOLR-17764:
---

As alluded to in the above commit messages...

Since it sounds like a 9.9 release is likely to happen this week, and we have 
no yet been able agree on what the SolrJ is  currently doing (let alone have a 
discussion about what it _should_ be doing) I change the default value of  
{{SOLR_JETTY_GRACEFUL}} in {{bin/solr}} to false, out of an abundance of 
caution to avoid impacting existing users.

I did *NOT* revert any of the changes to JettySolrRunner, in the hopes that we 
can move _forward_ with a discussion of the nature of these failures and how to 
improve Solr/SolrJ

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985684#comment-17985684
 ] 

ASF subversion and git services commented on SOLR-17764:


Commit 35db065e2a987f37e037aef0c8efda23d7ee3cb7 in solr's branch 
refs/heads/branch_9x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=35db065e2a9 ]

SOLR-17744: Change default SOLR_JETTY_GRACEFUL value to false

Based on test failures noted in SOLR-17764, the implications of what this might 
mean for some SolrJ users, and a lack of concensus on how to move forward, I'm 
changing this defualt to minimize the risk of impact on users when upgrading

(cherry picked from commit 0fda63b788bbd232039f079d1ccd3ac452b29b85)


> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-18 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17982750#comment-17982750
 ] 

Chris M. Hostetter commented on SOLR-17764:
---

{quote}If Jetty in tests is no longer does a hard shutdown and is doing 
graceful shutdowns, those 503s mean graceful shutdown does not work.

If you think 503 did not ever cause retries, those chaos monkey tests would 
have shown this same fail all the time previously.

The main thing is, these two things can't be true if you see the 503 exception:
 * Jetty shuts down gracefully in tests
 * graceful shutdown works properly{quote}
Your claims here don't make sense to me logically – and they don't match the 
reality you can see in the logs when running tests like this one.

Before we added Jetty's graceful shutdown module, stopping a jetty instance 
would cause jetty to immediately terminate any open connections. Solr server 
code _may/could_ throw a SolrException with a 503 status if it noticed 
{{cc..isShutDown()==true}} while handling a request, but the client would never 
get it that 503 status code, because the client would get a connection error 
(not an HTTP response).

You can see this by running tests like 
ChaosMonkeySafeLeaderWithPullReplicasTest, using the seed mentioned above, 
against GIT SHA {{d0d71a7b38f~}} (ie: one commit before the graceful module was 
added to main) and looking at the logs from the successful test...
{noformat}
hossman@slate:~/lucene/solr [j21] [7288eb38863] $ ./gradlew clean test 
--max-workers=1 --tests ChaosMonkeySafeLeaderWithPullReplicasTest.test 
-Dtests.seed=E7F93005B9386058 -Dtests.multiplier=1 -Dtests.nightly=true 
-Dtests.locale=fr-SN -Dtests.timezone=America/Anguilla -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8 -Ptests.verbose=true &> tmp.txt
hossman@slate:~/lucene/solr [j21] [7288eb38863] $ grep 'Request to collection ' 
tmp.txt
  2> 20491 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(0) org.apache.http.NoHttpResponseException: 127.0.0.1:43553 failed to respond, 
retry=0 maxRetries=5 commError=true errorCode=0 - retrying
  2> 24586 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(0) org.apache.http.NoHttpResponseException: 127.0.0.1:38055 failed to respond, 
retry=0 maxRetries=5 commError=true errorCode=0 - retrying
  2> 25106 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(0) org.apache.http.NoHttpResponseException: 127.0.0.1:38439 failed to respond, 
retry=0 maxRetries=5 commError=true errorCode=0 - retrying
{noformat}
...the CloudSolrClient retry logic does *NOT* get an HTTP response from Solr 
with a 503 error code, it gets a NoHttpResponseException from the Apache 
HttpClient which it considers a "commError" so it retries.

(IIRC tests using CloudHttp2SolrClient in a similar situation would get a 
SocketException in this case, which would also be considered a "commError" and 
retried)

The *ONLY* way the CloudSolrClient can get a 503 error (thrown by Solr server 
code) from a node that is being shutdown, is _because_ Jetty's graceful module 
is allowing the solr code to finish handling an in flight request. And when/if 
Solr sends that 503, the client (in this test, with this seed) does *NOT* retry 
– which can be seen by running the same seed above against GIT SHA 
{{d0d71a7b38f}} (when the graceful module was added to main)...
{noformat}
hossman@slate:~/lucene/solr [j21] [d0d71a7b38f] $ ./gradlew clean test 
--max-workers=1 --tests ChaosMonkeySafeLeaderWithPullReplicasTest.test 
-Dtests.seed=E7F93005B9386058 -Dtests.multiplier=1 -Dtests.nightly=true 
-Dtests.locale=fr-SN -Dtests.timezone=America/Anguilla -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8 -Ptests.verbose=true &> tmp.txt
hossman@slate:~/lucene/solr [j21] [d0d71a7b38f] $ grep 'Request to collection 
\|errorCode' tmp.txt | head -4
  2> 19272 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from 
server at http://127.0.0.1:43665/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. 
  2> , retry=0 maxRetries=5 commError=false errorCode=503 
  2> 19282 INFO  (StoppableIndexingThread) [n: c: s: r: x: t:] 
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to 
(503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from 
server at http://127.0.0.1:43665/solr: Expected mime type in 
[application/octet-stream, application/vnd.apache.solr.javabin] but got 
text/html. 
  2> , retry=0 maxRetries=5 commError=false errorCode=503 
{nofo

[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-17 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980492#comment-17980492
 ] 

Mark Robert Miller commented on SOLR-17764:
---

You can see retry code looking for 503 right here in the cloud client:


{noformat}
  if (wasCommError
  || (exc instanceof RouteException
  && (errorCode == 503)) 
{noformat}

If it's not retrying on 503 on some code path, it's just a bug.


> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-17 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980488#comment-17980488
 ] 

Mark Robert Miller commented on SOLR-17764:
---

Never mind, the default for QueuedThreadPool is 5000, for Server it's 0. That 
comment is still correct.

Tests do not test graceful shutdown. They don't do a graceful shutdown. They do 
a hard interrupt driven stop, just as they have in one way or another for over 
a decade.

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-17 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980485#comment-17980485
 ] 

Mark Robert Miller commented on SOLR-17764:
---

You can see the tests did that kind of hard shutdown here (how it was done 
varied over time)
   // stop timeout is 0, so we will interrupt right away

And that was the case, but now it appears that comment is out of date and it 
uses the default 5 second grace period. Unless your graceful shutdown made 
things even worse than a 5 second wait, it doesn't make any sense it would 
correlate with this fail.

{noformat}
// Do not let Jetty/Solr pollute the MDC for this thread
Map prevContext = MDC.getCopyOfContextMap();
MDC.clear();
try {
  QueuedThreadPool qtp = (QueuedThreadPool) server.getThreadPool();
  ReservedThreadExecutor rte = qtp.getBean(ReservedThreadExecutor.class);

  server.stop();

  // stop timeout is 0, so we will interrupt right away
  while (!qtp.isStopped()) {
qtp.stop();
if (qtp.isStopped()) {
  Thread.sleep(50);
}
  }

  // we tried to kill everything, now we wait for executor to stop
  qtp.setStopTimeout(Integer.MAX_VALUE);
  qtp.stop();
  qtp.join();
{noformat}


> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-17 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980482#comment-17980482
 ] 

Mark Robert Miller commented on SOLR-17764:
---

“these two things can't be true”

Well, I suppose, this could be after a graceful shutdown timeout. But it would 
have to be a pretty darn short timeout if a bunch of indexing requests or 
queries can't finish.

Anyway, these tests (safe and leader killing chaos monkey tests) fire requests 
at core containers that are shutting down all day. They would have had to 
explicitly not care about a 503 or retry on it. I don't know what they do now, 
but I don't buy any historical analysis that says otherwise. 

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-17 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980479#comment-17980479
 ] 

Mark Robert Miller commented on SOLR-17764:
---

Couldn't tell you. 

If Jetty in tests is no longer does a hard shutdown and is doing graceful 
shutdowns, those 503s mean graceful shutdown does not work. 

If you think 503 did not ever cause retries, those chaos monkey tests would 
have shown this same fail all the time previously. 

The main thing is, these two things can't be true if you see the 503 exception:

* Jetty shuts down gracefully in tests
* graceful shutdown works properly

Whether you would retry in 503 is a no brainer. Of course you would. 

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-17 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980470#comment-17980470
 ] 

Chris M. Hostetter commented on SOLR-17764:
---

{quote}• JettySolrRunner.stop() short-circuits the normal Jetty life-cycle so 
that unit tests finish quickly
• it explicitly calls coreContainer.shutdown() before it invokes Server.stop();
{quote}
this does not seem to be true, nor was it true before SOLR-17744
{quote}Lots of tests may never add the StatisticsHandler?
{quote}
SOLR-17744 added this to JettySolrRunner so it should be in any jetty based 
test: [https://github.com/apache/solr/commit/fe7fe7966a6]
{quote}I honestly think its some kind of regression that its not retried. These 
kinds of tests always would have been very flakey otherwise, and I seem to 
remember making shutdown throw 503 so you could retry just for this issue. 
{quote}
AFAICT:
 * {*}BEFORE{*}: SOLR-17744:
 ** on shutdown, jetty immediately closed any open connections causing clients 
to get a {{SocketException}} (or maybe a {{ConnectException}} if it hasn't 
fully established the connection yet at that point in time)
 *** Solr "server" code may have thrown a 503 error on shutdown, that you would 
see in the logs, but that never made it to the client
 **  {{CloudSolrClient}} instances – like the one used in this test – 
automatically retries on all  {{SocketException}} (and all {{ConnectException}} 
)
 * {*}AFTER{*}: SOLR-17744:
 ** on shutdown, jetty waits for requests on currently open connections to 
finish...
 *** But Solr "server" code may see that shutdown has been called, and return a 
503 exception to the client (at which point the request completes w/o any sort 
of socket/network error)
 ** {{CloudSolrClient}} instances – like the one used in this test – get the 
503 exception and *_do not automatically retry 503 errors_*

 *** {{CloudSolrClient}} _*only*_ retries on 503 errors if the error was a 
{{RouteException}}
 *** The *_only_* code path in SolrJ that will ever throw a {{RouteException}} 
is the {{directUpdate(...)}} code path – which doesn't happen in these failures 
because test randomization decided to use a CloudSolrClient that has 
{{isUpdatesToLeaders()==false}}


As i mentioned before...
{quote}While it would be easy to "fix" this test by forcing 
isUpdatesToLeaders()==true, I'm not sure what the best "fix" is for the 
underlying behavior in Solr/SolrJ is?
{quote}
My biggest concern is not this test. My biggest concern is the broader 
questions this test has raised about how/when/why SolrJ decides to "retry" on 
exceptions, and what exceptions trigger that retry logic.

It doesn't seem like it does us much good to have jetty allow in flight 
requests to finish if the solr code that "finishes" that request throws a 503 
error that solrJ does not recognize as a "communication error" – and it _only_ 
retries on communication errors...
{code:java}
  int errorCode =
  (rootCause instanceof SolrException)
  ? ((SolrException) rootCause).code()
  : SolrException.ErrorCode.UNKNOWN.code;

  boolean wasCommError =
  (rootCause instanceof ConnectException
  || rootCause instanceof SocketException
  || wasCommError(rootCause));

  if (wasCommError
  || (exc instanceof RouteException
  && (errorCode == 503)) ...
{code}

So perhaps the "real" question to ask is:

* Why does this code care if {{(exc instanceof RouteException)}} ? ... why 
doesn't it retry on {{(wasCommError || (errorCode == 503))}} ?

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---

[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-16 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979858#comment-17979858
 ] 

Mark Robert Miller commented on SOLR-17764:
---

I honestly think its some kind of regression that its not retried. These kinds 
of tests always would have been very flakey otherwise, and I seem to remember 
making shutdown throw 503 so you could retry just for this issue. 

But who knows - if jetty is still stopped how it used to be in tests I don't 
see how this could be caused by graceful shutdown being added either or how it 
could be a relatively recent thing to pop up. 

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-16 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979340#comment-17979340
 ] 

David Smiley commented on SOLR-17764:
-

Agreed that 503 is retry-able.

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-16 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979336#comment-17979336
 ] 

Mark Robert Miller commented on SOLR-17764:
---

I don't have the code in front of me, so maybe you changed this or was changed, 
but from my memory:

• JettySolrRunner.stop() short-circuits the normal Jetty life-cycle so 
that unit tests finish quickly
• it explicitly calls coreContainer.shutdown() before it invokes 
Server.stop();
• it sets server.setStopTimeout(0) so Jetty never blocks waiting for 
in-flight requests.
• Lots of tests may never add the StatisticsHandler?

In which case, tests in general would not be testing graceful shutdown and 
would expect to hit a 503 or random issue due to something being closed 
depending on races / how peppered that is closed check is in the code. 

503 should mean retry: that won't bullet proof that test if its counting on a 
request finishing after cluster shutdown or a whole shard is shutdown, but 
should be fairly bullet proof for a single instance or all instances in a shard 
but one getting shutdown. 

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-16 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979335#comment-17979335
 ] 

Mark Robert Miller commented on SOLR-17764:
---

Assuming https://issues.apache.org/jira/browse/SOLR-17744 works correctly, I’d 
then assume its due to the difference in how tests manage Jetty vs production?

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Commented] (SOLR-17764) "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest failures

2025-06-16 Thread Mark Robert Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979332#comment-17979332
 ] 

Mark Robert Miller commented on SOLR-17764:
---


 * Should code in Solr that checks for shutdown and throws a 503 _stop_ doing 
that so that in flight requests can finish?

Its just showing your graceful shutdown is half baked. If things are closing, 
outstanding requests are going to randomly choke. If you want a graceful 
shutdown, you can't shutdown Solr until Jetty has stopped accepting new 
requests and allowed all in flight requests to finish. 

 * Should CloudSolrClient retry on any 503?

Yes. But graceful shutdown will still be half baked, though it may allow that 
test to skirt by. 

> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> ---
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]