[
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980470#comment-17980470
]
Chris M. Hostetter commented on SOLR-17764:
-------------------------------------------
{quote}• JettySolrRunner.stop() short-circuits the normal Jetty life-cycle so
that unit tests finish quickly
• it explicitly calls coreContainer.shutdown() before it invokes Server.stop();
{quote}
this does not seem to be true, nor was it true before SOLR-17744
{quote}Lots of tests may never add the StatisticsHandler?
{quote}
SOLR-17744 added this to JettySolrRunner so it should be in any jetty based
test: [https://github.com/apache/solr/commit/fe7fe7966a6]
{quote}I honestly think its some kind of regression that its not retried. These
kinds of tests always would have been very flakey otherwise, and I seem to
remember making shutdown throw 503 so you could retry just for this issue.
{quote}
AFAICT:
* {*}BEFORE{*}: SOLR-17744:
** on shutdown, jetty immediately closed any open connections causing clients
to get a {{SocketException}} (or maybe a {{ConnectException}} if it hasn't
fully established the connection yet at that point in time)
*** Solr "server" code may have thrown a 503 error on shutdown, that you would
see in the logs, but that never made it to the client
** {{CloudSolrClient}} instances – like the one used in this test –
automatically retries on all {{SocketException}} (and all {{ConnectException}}
)
* {*}AFTER{*}: SOLR-17744:
** on shutdown, jetty waits for requests on currently open connections to
finish...
*** But Solr "server" code may see that shutdown has been called, and return a
503 exception to the client (at which point the request completes w/o any sort
of socket/network error)
** {{CloudSolrClient}} instances – like the one used in this test – get the
503 exception and *_do not automatically retry 503 errors_*
*** {{CloudSolrClient}} _*only*_ retries on 503 errors if the error was a
{{RouteException}}
*** The *_only_* code path in SolrJ that will ever throw a {{RouteException}}
is the {{directUpdate(...)}} code path – which doesn't happen in these failures
because test randomization decided to use a CloudSolrClient that has
{{isUpdatesToLeaders()==false}}
----
As i mentioned before...
{quote}While it would be easy to "fix" this test by forcing
isUpdatesToLeaders()==true, I'm not sure what the best "fix" is for the
underlying behavior in Solr/SolrJ is?
{quote}
My biggest concern is not this test. My biggest concern is the broader
questions this test has raised about how/when/why SolrJ decides to "retry" on
exceptions, and what exceptions trigger that retry logic.
It doesn't seem like it does us much good to have jetty allow in flight
requests to finish if the solr code that "finishes" that request throws a 503
error that solrJ does not recognize as a "communication error" – and it _only_
retries on communication errors...
{code:java}
int errorCode =
(rootCause instanceof SolrException)
? ((SolrException) rootCause).code()
: SolrException.ErrorCode.UNKNOWN.code;
boolean wasCommError =
(rootCause instanceof ConnectException
|| rootCause instanceof SocketException
|| wasCommError(rootCause));
if (wasCommError
|| (exc instanceof RouteException
&& (errorCode == 503)) ...
{code}
So perhaps the "real" question to ask is:
* Why does this code care if {{(exc instanceof RouteException)}} ? ... why
doesn't it retry on {{(wasCommError || (errorCode == 503))}} ?
> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest
> failures
> -----------------------------------------------------------------------------------
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
> Issue Type: Bug
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments:
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time
> right around the time that SOLR-17744 was committed.
> Things i have observed:
> * Seeds from failing runs seem to reliably reproduce the failure
> ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
> * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem
> to reliably succeed on all subsequent attempts
> ** Suggesting that the root cause is something deterministic in the
> {{{}random(){}}}-ness of the test, and not something dependent on timing or
> concurrency.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]