[ 
https://issues.apache.org/jira/browse/SOLR-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18066548#comment-18066548
 ] 

Jan Høydahl commented on SOLR-17916:
------------------------------------

Question about this JIRA. It is still OPEN, but has a commit to main which is 
included in the 10.0.0 release.

The commit (PR #3695) has the title "Bump up jetty to 12.0.27" which is 
different from the title of this JIRA. This commit is not mentioned in 
CHANGELOG for 10.0, nor is Jetty 12.0.27 mentioned.

The commit did not only upgrade to Jetty 12.0.27, but also disabled request 
cancellation 
([https://github.com/apache/solr/blob/main/solr/solrj-jetty/src/java/org/apache/solr/client/solrj/jetty/HttpJettySolrClient.java#L446-L451)]
 without any real justification, new unit tests etc, and the request 
cancellation was not removed, it was just commented out. Was this a mistake? 
The same code was commented out in 9.x with justification that it was needed 
due to a Jetty 10.x bug that was fixed in Jetty 12. So why was this needed?

We're seeing deadlock and timeouts in customer production environment on 
9.10.1, that was what lead me here. I'll post a problem description and 
analysis elsewhere that will also link to this issue.

So - should this Jira be closed as fixed in 10.0? And perhaps add a changelog 
file that will show solr 10.1 users that 10.0 included both the new jetty 
version and the request cancellation change.

> Jetty 12.0.25 upgrade exposes RST_STREAM burst issue
> ----------------------------------------------------
>
>                 Key: SOLR-17916
>                 URL: https://issues.apache.org/jira/browse/SOLR-17916
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Sanjay Dutt
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After upgrading Jetty from {*}12.0.19 → 12.0.25{*}, the test 
> {{DistributedDebugComponentTest.testTolerantSearch}} starts failing.
> The test sets up a query with a deliberately bad shard:
> {code:java}
> String badShard = DEAD_HOST_1 + "/solr/collection1";
> query.set("shards", badShard+ "," + shard2 + "," + shard1);
> for (int i = 0; i < (TEST_NIGHTLY ? 500 : 200); i++) {
>       // verify that the request would fail if shards.tolerant=false
>       query.set(ShardParams.SHARDS_TOLERANT, "false");
>       ignoreException("Connection refused");
>       expectThrows(SolrException.class, () -> collection1.query(query));
>       // verify that the request would succeed if shards.tolerant=true
>       query.set(ShardParams.SHARDS_TOLERANT, "true");
>       QueryResponse response = collection1.query(query); // fail here!
> ....
> {code}
> For each iteration, it issues:
>  * *shards.tolerant = false* → as expected, the coordinator fails fast 
> because one shard is dead.
>  * *shards.tolerant = true* → expected to succeed using results from the good 
> shard(s), but {*}fails after the Jetty upgrade{*}.
> *Observed behavior*
>  * In the non-tolerant branch, {{SearchHandler}} throws early on the shard 
> exception.
>  * At this point {{HttpShardHandler}} cancels the outstanding async requests 
> to the other shards, calling {{future.cancel(true)}} / 
> {{{}request.abort(){}}}.
>  * That abort translates into *RST_STREAM* frames sent to Jetty.
>  * With the loop running hundreds of iterations, these cancels accumulate on 
> a single HTTP/2 session.
>  * Jetty 12.0.25 enforces stricter HTTP/2 rate control:
> GoAwayFrame\{... enhance_your_calm_error/invalid_rst_stream_frame_rate}
>  * Once the rate limit is tripped, the server responds with GOAWAY and closes 
> the connection.
>  * The subsequent tolerant request then fails, even though at least one shard 
> is healthy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to