[jira] [Comment Edited] (SOLR-17792) ParallelHttpShardHandler has massive performance issues.

Houston Putman (Jira) Tue, 15 Jul 2025 10:54:04 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007303#comment-18007303
 ]


Houston Putman edited comment on SOLR-17792 at 7/15/25 5:53 PM:
----------------------------------------------------------------

Copying from the PR:

So this broke {{DistributedDebugComponentTest.testTolerantSearch}} and 
{{{}TestJsonFacets.testTolerant{}}}. The reason is that originally 
{{HttpShardHandler.take()}} would return on an error, cancelling all other 
responses. Now the logic is that on an error, outstanding requests will be 
cancelled, but if any responses come in in the meantime, they will be 
processed. However, the code currently expects that the last response will have 
the error. So since the last response might not have the error anymore, we have 
to change the code that checks the responses for errors.

However, after fixing that and beasting the tests, there is a really weird 
error around cancelling requests. The 
{{DistributedDebugComponentTest.testTolerantSearch}} does a non-tolerant search 
then does a tolerant search. The error I described above was breaking the 
non-tolerant search. That is easily fixable. The second part of the test, 
testing tolerant search fails very occasionally (but only when the non-tolerant 
search is done first, when that is commented out, the tolerant search does not 
fail).

The tolerant search fails (occasionally) because all three shard requests fail 
instead of just 1 of the shard requests failing (because of a non-exisistant 
endpoint). the bad shard has the failure that the test expects, but the good 
shards both fail with {{java.io.IOException: 
cancel_stream_error/unexpected_data_frame}} meaning that the requests were 
cancelled, even thought the request is "tolerant". I did a lot of debugging 
here, and noticed that Solr is behaving correctly and we are not cancelling 
shard requests for tolerant solr requests. And the fact that if the 
"non-tolerant search" request case right before the tolerant search request is 
commented out, the failures stop, tell us that the cancellations from the 
non-tolerant request are bleeding into the tolerant request. This is bad. I 
also confirmed this by commenting out the line that actually cancels the HTTP 
requests: 
[https://github.com/apache/solr/blob/branch_9_9/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L570-L574]

This only happens on branch_9x (and presumable branch_9_9), not on main. So I 
believe it's a bug in Jetty 10, which Jetty 12 has solved. So we are probably 
fine just fixing this part on branch_9x and branch_9_9, and leaving the request 
cancellation enabled on main (10.x).

Amazingly, when beasting, there is a big difference in whether the non-existent 
endpoint is put first or last in the list of shards. The failure rate is much 
higher when the bad shard is the first listed rather the last one listed.

I've create another ticket (SOLR-17819) for the cancellation bug. But I'll 
create another PR to fix what this PR broke (the first part of this long 
comment).


was (Author: houston):
Copying from the PR:

So this broke {{DistributedDebugComponentTest.testTolerantSearch}} and 
{{{}TestJsonFacets.testTolerant{}}}. The reason is that originally 
{{HttpShardHandler.take()}} would return on an error, cancelling all other 
responses. Now the logic is that on an error, outstanding requests will be 
cancelled, but if any responses come in in the meantime, they will be 
processed. However, the code currently expects that the last response will have 
the error. So since the last response might not have the error anymore, we have 
to change the code that checks the responses for errors.

However, after fixing that and beasting the tests, there is a really weird 
error around cancelling requests. The 
{{DistributedDebugComponentTest.testTolerantSearch}} does a non-tolerant search 
then does a tolerant search. The error I described above was breaking the 
non-tolerant search. That is easily fixable. The second part of the test, 
testing tolerant search fails very occasionally (but only when the non-tolerant 
search is done first, when that is commented out, the tolerant search does not 
fail).

The tolerant search fails (occasionally) because all three shard requests fail 
instead of just 1 of the shard requests failing (because of a non-exisistant 
endpoint). the bad shard has the failure that the test expects, but the good 
shards both fail with {{java.io.IOException: 
cancel_stream_error/unexpected_data_frame}} meaning that the requests were 
cancelled, even thought the request is "tolerant". I did a lot of debugging 
here, and noticed that Solr is behaving correctly and we are not cancelling 
shard requests for tolerant solr requests. And the fact that if the 
"non-tolerant search" request case right before the tolerant search request is 
commented out, the failures stop, tell us that the cancellations from the 
non-tolerant request are bleeding into the tolerant request. This is bad. I 
also confirmed this by commenting out the line that actually cancels the HTTP 
requests: 
[https://github.com/apache/solr/blob/branch_9_9/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L570-L574]

This only happens on branch_9x (and presumable branch_9_9), not on main. So I 
believe it's a bug in Jetty 10, which Jetty 12 has solved. So we are probably 
fine just fixing this part on branch_9x and branch_9_9, and leaving the request 
cancellation enabled on main (10.x).

Amazingly, when beasting, there is a big difference in whether the non-existent 
endpoint is put first or last in the list of shards. The failure rate is much 
higher when the bad shard is the first listed rather the last one listed.

I'm going to create another ticket for the cancellation bug. But I'll create 
another PR to fix what this PR broke (the first part of this long comment).

> ParallelHttpShardHandler has massive performance issues.
> --------------------------------------------------------
>
>                 Key: SOLR-17792
>                 URL: https://issues.apache.org/jira/browse/SOLR-17792
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 9.8
>            Reporter: Houston Putman
>            Assignee: Houston Putman
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 9.9
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> SOLR-17158 changed the way that the HttpShardHandler (And 
> ParallelHttpShardHandler) did locking and concurrency. However, after 
> upgrading, and running distributed queries (at a relatively slow rate), I 
> noticed that there were 3 types of responses:
>  * QTimes between 3-6ms
>  * QTimes between 53-56 ms
>  * And requests that timed out
> Looking at the logic in HttpShardHandler, there is a poll(50ms) call that is 
> very suspicious, and likely the reason for the jump between 3-6 ms and 53-56 
> ms. I would also assume that this change in concurrency logic is the reason 
> that many requests started timing out. Changing to the 
> HttpShardHandlerFactory from the ParallellHttpShardHandlerFactory fixed these 
> issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-17792) ParallelHttpShardHandler has massive performance issues.

Reply via email to