[
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris M. Hostetter updated SOLR-17764:
--------------------------------------
Attachment:
E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
Status: Open (was: Open)
(Note all comments below relate to testing done against GIT SHA
d0d71a7b38f87f3c70551678b2eab62d925483ef)
I'm attaching an example
OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
using seed=E7F93005B9386058
While the test "officially" fails at
{{ChaosMonkeySafeLeaderWithPullReplicasTest.java:210}} this is just an
assertion that none of the {{StoppableIndexingThread}} encountered any
failures. If we search the test for the first instance of {{REQUEST FAILED}}
logged by any {{StoppableIndexingThread}} we see the following: Note the port
number of request URI in the error...
{noformat}
2> 25374 ERROR (StoppableIndexingThread) [n: c: s: r: x: t:]
o.a.s.c.StoppableIndexingThread REQUEST FAILED for id=0-14
2> => org.apache.solr.client.solrj.SolrClient$RemoteSolrException:
Error from server at http://127.0.0.1:33083/solr: Expected mime type in
[application/octet-stream, application/vnd.apache.solr.javabin] but got
text/html. <html>
2> <head>
2> org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from
server at http://127.0.0.1:33083/solr: Expected mime type in
[application/octet-stream, application/vnd.apache.solr.javabin] but got
text/html. <html>
2> <head>
2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
2> <title>Error 503 Service Unavailable</title>
2> </head>
2> <body><h2>HTTP ERROR 503 Service Unavailable</h2>
2> <table>
2> <tr><th>URI:</th><td>/solr/collection1_shard1_replica_p9/update</td></tr>
2> <tr><th>STATUS:</th><td>503</td></tr>
2> <tr><th>MESSAGE:</th><td>Service Unavailable</td></tr>
2> <tr><th>SERVLET:</th><td>-</td></tr>
2> </table>
2> <hr/><a href="https://jetty.org">Powered by Jetty:// 10.0.22</a><hr/>
2>
2> </body>
2> </html>
2>
2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:651)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:250)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:515)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:497)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:524)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:466)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1151)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:844)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:775)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:285)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:305)
~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT
1a8131c45868ee606348f2502fe57233e3d4489c [snapshot build, details omitted]]
2> at
org.apache.solr.cloud.StoppableIndexingThread.indexDocs(StoppableIndexingThread.java:193)
~[solr-test-framework-10.0.0-SNAPSHOT.jar:?]
2> at
org.apache.solr.cloud.StoppableIndexingThread.run(StoppableIndexingThread.java:124)
[solr-test-framework-10.0.0-SNAPSHOT.jar:?]
{noformat}
If we go backwards a little bit in the log file, looking specifically for port
number {{33083}} we see...
{noformat}
2> 25336 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.a.s.c.ChaosMonkey monkey:
chose a victim! 33083
2> 25337 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.Server Stopped
Server@69659ae4{STOPPING}[10.0.22,sto=15000]
2> 25337 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.Server Shutdown
Server@69659ae4{STOPPING}[10.0.22,sto=15000]
...
2> 25373 INFO (StoppableIndexingThread) [n: c: s: r: x: t:]
o.a.s.c.s.i.CloudSolrClient request was not communication error it seems
2> 25373 INFO (StoppableIndexingThread) [n: c: s: r: x: t:]
o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to
(503) org.apache.solr.client.solrj.SolrClient$RemoteSolrException: Error from
server at http://127.0.0.1:33083/solr: Expected mime type in
[application/octet-stream, application/vnd.apache.solr.javabin] but got
text/html. <html>
2> <head>
2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
2> <title>Error 503 Service Unavailable</title>
2> </head>
2> <body><h2>HTTP ERROR 503 Service Unavailable</h2>
2> <table>
2> <tr><th>URI:</th><td>/solr/collection1_shard1_replica_p9/update</td></tr>
2> <tr><th>STATUS:</th><td>503</td></tr>
2> <tr><th>MESSAGE:</th><td>Service Unavailable</td></tr>
2> <tr><th>SERVLET:</th><td>-</td></tr>
2> </table>
2> <hr/><a href="https://jetty.org">Powered by Jetty:// 10.0.22</a><hr/>
2>
2> </body>
2> </html>
2> , retry=0 maxRetries=5 commError=false errorCode=503
2> 25374 INFO (ChaosMonkey) [n: c: s: r: x: t:] o.e.j.s.AbstractConnector
Stopped ServerConnector@37112d56{HTTP/1.1, (http/1.1, h2c)}{127.0.0.1:0}
{noformat}
Here's what seems to be happening AFAICT:
* FYI: This test seems to put a single replica on each jetty for simplicity
* The test randomization can randomly decide to use a CloudSolrClient that has
{{isUpdatesToLeaders()==false}}
** When that happens, updates may be sent to a pull replica, which is expected
to forward the request to it's leader
* _Before_ we had graceful jetty shutdown:
** If ChaosMonkey decided to shutdown a pull-replica node, any existing HTTP
connections from a StoppableIndexingThread would fail immediately with a
connection level error
** At which point the CloudSolrClient used by the StoppableIndexingThread
considers it a "communications error" and will retry a different replica/jetty
* _After_ the addition of graceful jetty shutdown:
** if ChaosMonkey decided to shutdown a pull-replica node, any existing HTTP
connections from a StoppableIndexingThread stay alive until the server is
"finished" with the HTTP request
** But logic in solr may decide to "finish" the request prematurely by
throwing a 503 SolrException if it notices CoreContainer shutdown is in
progress. Example: {{DistributedZkUpdateProcessor.zkCheck()}}
** At which point the CloudSolrClient used by the StoppableIndexingThread gets
a {{RemoteSolrException}} , and does _not_ consider that to be a communcation
error, and does _not_ retry the update with a different replica/jetty
----
While it would be easy to "fix" this test by forcing
{{{}isUpdatesToLeaders()==true{}}}, I'm not sure what the best "fix" is for the
underlying behavior in Solr/SolrJ is?
* Should code in Solr that checks for shutdown and throws a 503 _stop_ doing
that so that in flight requests can finish?
* Should CloudSolrClient retry on any 503 ?
* Should we revert the use of jetty's graceful shutdown (or change it's
default to 'false') ?
** This seems kind of insane, but at the moment: the behavior of
CloudSolrClient + solr-core shutdown checks means CloudSolrClient is more
likely to retry other nodes w/o jetty's graceful shutdown logic in place.
> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest
> failures
> -----------------------------------------------------------------------------------
>
> Key: SOLR-17764
> URL: https://issues.apache.org/jira/browse/SOLR-17764
> Project: Solr
> Issue Type: Bug
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments:
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time
> right around the time that SOLR-17744 was committed.
> Things i have observed:
> * Seeds from failing runs seem to reliably reproduce the failure
> ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
> * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem
> to reliably succeed on all subsequent attempts
> ** Suggesting that the root cause is something deterministic in the
> {{{}random(){}}}-ness of the test, and not something dependent on timing or
> concurrency.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]