[jira] [Comment Edited] (SOLR-17419) Improve HttpShardHandler performance in many-shard collections

Jason Gerlowski (Jira) Fri, 30 Aug 2024 06:46:04 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878128#comment-17878128
 ]


Jason Gerlowski edited comment on SOLR-17419 at 8/30/24 1:45 PM:
-----------------------------------------------------------------

I opened a PR for this ticket but accidentally swapped two digits in the JIRA 
number, so it hasn't been linked here yet. Would love reviews on the code 
[here|https://github.com/apache/solr/pull/2681] if anyone has a few moments.

Rather than trying to improve HttpShardHandler, the linked PR introduces an 
alternate implementation: ParallelHttpShardHandler[Factory]. The new 
implementation extends HttpShardHandler[Factory] and reuses most of its code. 
The only difference: ParallelHttpShardHandler.submit uses the executor to both 
send the request and await the response. This allows the "submit" call to 
return sooner, and for SearchHandler (or other callers) to iterate through a 
series of "submit" calls more quickly. The tradeoff of course is potentially 
higher CPU utilization, depending on the number of shards in play, etc.
----
I've run some perf tests (details below), and the results look pretty promising 
in certain scenarios!
!shardhandler-perf-graph.png|width=607,height=365!

To summarize:
 # With auth disabled HttpShardHandler outperforms the new implementation even 
in collections with many shards.
 # With auth enabled though, ParallelShardHandler outperforms HttpShardHandler 
by a factor of nearly "3x" regardless of the number of shards.

At first glance, it's surprising that "auth" has such a huge impact. The cause, 
according to some profiling, is PKI auth. Generating and signing the "PKI" 
header consumes a massive amount of CPU and drives the cost of 
HttpShardHandler.submit. Without PKI, request-sending is cheap enough that 
doing it serially outperforms the thread-creation overhead incurred by the 
"Parallel" implementation. As soon as PKI is needed though, it dwarfs any 
thread-creation overhead and ParallelHttpShardHandler wins out.

P.S: With PKI being so expensive, should we stress the security.json 
"forwardCredentials=true" option more strongly in the docs? Or make it a 
default in the AuthPlugins that support it?

*Appendix: Perf Test Details*

The results discussed above come from a series of perf experiments I ran on a 
local 4-node Solr cluster created via "bin/solr start -e cloud -m 2g". "Avg 
QTime" measurements were taken by (1) creating a collection with a particular 
'numShard' value and using a particular 'ShardHandlerFactory', (2) loading in 
some data, (3) warming the JVM and collection with an initial round of queries, 
and (4) running 'numQueries' from each of 'numThreads' QueryRunner threads.

As a data set, I used emails from our "dev@" list exported from 
"lists.apache.org". (I have a little script to export and clean up this data 
that I'm considering saving somewhere - it's been an interesting dataset to 
work with and might be useful in our benchmark module or elsewhere?) For 
queries, the perf-test code uses the "/terms" handler to fetch common terms and 
then creates single-term queries from those.

Other details:
 * *Solr Version:* 10.0.0-SNAPSHOT (a branch based off of 'main', with the 
commit: 1818841868d70adebde364165d60114f164952de)
 * *JVM:* openjdk version "17.0.8" 2023-07-18
 * *OS:* MacOS Sonoma 14.6.1
 * *Hardware:* 2017 iMac Pro, 2.5GHz 14-core Intel Xeon, 128gb ram


was (Author: gerlowskija):
I opened a PR for this ticket but accidentally swapped two digits in the JIRA 
number, so it hasn't been linked here yet.  Would love reviews on the code 
[here|https://github.com/apache/solr/pull/2681] if anyone has a few moments.

Rather than trying to improve HttpShardHandler, the linked PR introduces an 
alternate implementation: ParallelHttpShardHandler[Factory].  The new 
implementation extends HttpShardHandler[Factory] and reuses most of its code.  
The only difference: ParallelHttpShardHandler.submit uses the executor to both 
send the request and await the response.  This allows the "submit" call to 
return sooner, and for SearchHandler (or other callers) to iterate through a 
series of "submit" calls more quickly.  The tradeoff of course is potentially 
higher CPU utilization, depending on the number of shards in play, etc.

 ----

I've run some perf tests (details below), and the results look pretty promising 
in certain scenarios!
 !shardhandler-perf-graph.png! 

To summarize:
# With auth disabled HttpShardHandler outperforms the new implementation even 
in collections with many shards.
# With auth enabled though, ParallelShardHandler outperforms HttpShardHandler 
by a factor of nearly "3x" regardless of the number of shards.

At first glance, it's surprising that "auth" has such a huge impact.  The 
cause, according to some profiling, is PKI auth.  Generating and signing the 
"PKI" header consumes a massive amount of CPU and drives the cost of 
HttpShardHandler.submit.  Without PKI, request-sending is cheap enough that 
doing it serially outperforms the thread-creation overhead incurred by the 
"Parallel" implementation.  As soon as PKI is needed though, it dwarfs any 
thread-creation overhead and ParallelHttpShardHandler wins out.

P.S: With PKI being so expensive, should we stress the security.json 
"forwardCredentials=true" option more strongly in the docs?  Or make it a 
default in the AuthPlugins that support it?

*Appendix: Perf Test Details*

The results discussed above come from a series of perf experiments I ran on a 
local 4-node Solr cluster created via "bin/solr start -e cloud -m 2g".  "Avg 
QTime" measurements were taken by (1) creating a collection with a particular 
'numShard' value and using a particular 'ShardHandlerFactory', (2) loading in 
some data, (3) warming the JVM and collection with an initial round of queries, 
and (4) running 'numQueries' from each of 'numThreads' QueryRunner threads.

As a data set, I used emails from our "dev@" list exported from 
"lists.apache.org".  (I have a little script to export and clean up this data 
that I'm considering saving somewhere - it's been an interesting dataset to 
work with and might be useful in our benchmark module or elsewhere?) For 
queries, the perf-test code uses the "/terms" handler to fetch common terms and 
then creates single-term queries from those.

Other details:
* *Solr Version:* 10.0.0-SNAPSHOT (a branch based off of 'main', with the 
commit: 1818841868d70adebde364165d60114f164952de)
* *JVM:* openjdk version "17.0.8" 2023-07-18
* *OS:* MacOS Sonoma 14.6.1
* *Hardware:* 2017 iMac Pro, 2.5GHz 14-core Intel Xeon, 128gb ram

> Improve HttpShardHandler performance in many-shard collections
> --------------------------------------------------------------
>
>                 Key: SOLR-17419
>                 URL: https://issues.apache.org/jira/browse/SOLR-17419
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 9.0, 9.6.1
>            Reporter: Jason Gerlowski
>            Priority: Major
>         Attachments: shardhandler-perf-graph.png
>
>
> In Solr 8, HttpShardHandler sends shard-requests by submitting Callables to 
> an ExecutorService. As a result, both the "request-sending" and 
> "response-awaiting" happened asynchronous to the original request-thread.
> {code:java}
>   @Override
>   public void submit(final ShardRequest sreq, final String shard, final 
> ModifiableSolrParams params) {
>     ShardRequestor shardRequestor = new ShardRequestor(sreq, shard, params, 
> this); // Callable
>     try {
>       shardRequestor.init();
>       pending.add(completionService.submit(shardRequestor));
>     } finally {
>       shardRequestor.end();
>     }   
>   }
> {code}
> However, in Solr 9.x HttpShardHandler ditched the 
> ExecutorService/per-request-thread approach in favor of [sending all requests 
> serially using 
> "SolrClient.requestAsync"|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/component/HttpShardHandler.java#L163].
>  SOLR-14354, which made this change, did this in an effort to avoid 
> unnecessary thread and CPU context-switching. As Dat described in SOLR-14354:
> {quote}after sending a request that thread basically do nothing just waiting 
> for response from other side. That thread will be swapped out and CPU will 
> try to handle another thread (this is called context switch, CPU will save 
> the context of the current thread and switch to another one). When some data 
> (not all) come back, that thread will be called to parsing these data, then 
> it will wait until more data come back. So there will be lots of context 
> switching in CPU. That is quite inefficient
> {quote}
> This approach comes with a downside though - all the shard requests are sent 
> serially. If sending each request takes ~1ms, then a user is unlikely to 
> notice this in their collection with 5 or 10 shards.  But the cost here 
> scales linearly, so in *a collection with 50 shards - this approach would 
> bake a ~50ms delay into the critical path of every single query!*
> This issue is intended to reevaluate whether there's a better way to balance 
> these concerns. Ideally we can come up with an approach that improves all 
> scenarios. Lacking that, maybe Solr could choose between one of several 
> approaches semi-intelligently based on the number of shards or other factors?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-17419) Improve HttpShardHandler performance in many-shard collections

Reply via email to