Rafał Harabień created SOLR-17275:
-------------------------------------

             Summary: Major performance regression of CloudSolrClient in Solr 
9.6.0 when using aliases
                 Key: SOLR-17275
                 URL: https://issues.apache.org/jira/browse/SOLR-17275
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrJ
    Affects Versions: 9.6.0
         Environment: SolrJ 9.6.0, Ubuntu 22.04, Java 17
            Reporter: Rafał Harabień
         Attachments: image-2024-05-06-17-23-42-236.png

I observe worse performance of CloudSolrClient after upgrading from SolrJ 9.5.0 
to 9.6.0, especially on p99. 

p99 jumped from ~25 ms to ~400 ms
p90 jumped from ~9.9 ms to ~22 ms
p75 jumped from ~7 ms to ~11 ms
p50 jumped from ~4.5 ms to ~7.5 ms

Screenshot from Grafana (at ~14:30 was deployed the new version):

!image-2024-05-06-17-23-42-236.png!

I've got a thread-dump and I can see many threads waiting in 
[ZkStateReader.forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]:
{noformat}
Thread info: "suggest-solrThreadPool-thread-52" prio=5 Id=600 BLOCKED on 
org.apache.solr.common.cloud.ZkStateReader@62e6bc3d owned by 
"suggest-solrThreadPool-thread-34" Id=582
        at 
app//org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:506)
        -  blocked on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d
        at 
app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getState(ZkClientClusterStateProvider.java:155)
        at 
app//org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(CloudSolrClient.java:1207)
        at 
app//org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1099)
        at 
app//org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:892)
        at 
app//org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:820)
        at 
app//org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:255)
        at 
app//org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:927)
        ...
        Number of locked synchronizers = 1
        - java.util.concurrent.ThreadPoolExecutor$Worker@1beb7ed3
{noformat}
At the same time qTime from Solr hasn't changed so I'm pretty sure it's a 
client regression.

I've tried reproducing it locally and I can see 
[forceUpdateCollection|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L503]
 function being called for every request in my application. I can see that 
[this|https://github.com/apache/solr/commit/8cf552aa3642be473c6a08ce44feceb9cbe396d7]
 commit
 changed the logic in ZkClientClusterStateProvider.getState so the mentioned 
function gets called if clusterState.getCollectionRef [returns 
null|https://github.com/apache/solr/blob/f8e5a93c11267e13b7b43005a428bfb910ac6e57/solr/solrj-zookeeper/src/java/org/apache/solr/client/solrj/impl/ZkClientClusterStateProvider.java#L151].
 In 9.5.0 it wasn't the case (forceUpdateCollection was not called in this 
place). I can see in the debugger that getCollectionRef only supports 
collections and not aliases (collectionStates map contains only collections). 
In my application all collections are referenced using aliases so I guess 
that's why I can see the regression in Solr response time.

I am not familiar with the code enough to prepare a PR but I hope this insight 
will be enough to fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to