HoustonPutman commented on PR #3398: URL: https://github.com/apache/solr/pull/3398#issuecomment-3074670118
So this broke `DistributedDebugComponentTest.testTolerantSearch` and `TestJsonFacets.testTolerant`. The reason is that originally `HttpShardHandler.take()` would return on an error, cancelling all other responses. Now the logic is that on an error, outstanding requests will be cancelled, but if any responses come in in the meantime, they will be processed. However, the code currently expects that the last response will have the error. So since the last response might not have the error anymore, we have to change the code that checks the responses for errors. However, after fixing that and beasting the tests, there is a really weird error around cancelling requests. The `DistributedDebugComponentTest.testTolerantSearch` does a non-tolerant search then does a tolerant search. The error I described above was breaking the non-tolerant search. That is easily fixable. The second part of the test, testing tolerant search fails very occasionally (but only when the non-tolerant search is done first, when that is commented out, the tolerant search does not fail). The tolerant search fails (occasionally) because all three shard requests fail instead of just 1 of the shard requests failing (because of a non-exisistant endpoint). the bad shard has the failure that the test expects, but the good shards both fail with `java.io.IOException: cancel_stream_error/unexpected_data_frame` meaning that the requests were cancelled, even thought the request is "tolerant". I did a lot of debugging here, and noticed that Solr is behaving correctly and we are not cancelling shard requests for tolerant solr requests. And the fact that if the "non-tolerant search" request case right before the tolerant search request is commented out, the failures stop, tell us that the cancellations from the non-tolerant request are bleeding into the tolerant request. This is bad. I also confirmed this by commenting out the line that actually cancels the HTTP requests: https://github.com/apache/solr/blob/branch_9_9/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2S olrClient.java#L570-L574 This only happens on branch_9x (and presumable branch_9_9), not on main. So I believe it's a bug in Jetty 10, which Jetty 12 has solved. So we are probably fine just fixing this part on branch_9x and branch_9_9, and leaving the request cancellation enabled on main (10.x). Amazingly, when beasting, there is a big difference in whether the non-existent endpoint is put first or last in the list of shards. The failure rate is much higher when the bad shard is the first listed rather the last one listed. I'm going to create another ticket for the cancellation bug. But I'll create another PR to fix what this PR broke (the first part of this long comment). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
