[PR] Avoid instantiating NodeStateProvider several times when computing replica placement [solr]

via GitHub Thu, 25 Apr 2024 07:17:38 -0700


ycallea opened a new pull request, #2420:
URL: https://github.com/apache/solr/pull/2420


   https://issues.apache.org/jira/browse/SOLR-XXXXX
   
   <!--
   _(If you are a project committer then you may remove some/all of the 
following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system 
for Solr:
   
   * https://issues.apache.org/jira/projects/SOLR
   
   For something minor (i.e. that wouldn't be worth putting in release notes), 
you can skip JIRA. 
   To create a Jira issue, you will need to create an account there first.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * SOLR-####: <short description of problem or changes>
   
   SOLR must be fully capitalized. A short description helps people scanning 
pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly 
updated with code review comments and commits. -->
   
   
   # Description
   
   When using either the Minimize Cores or the Affinity placement strategy with 
Solr 9, positioning a new replica becomes very inefficient past a certain 
collection count.
   At Salesforce, we are operating clusters packed with tens of thousands of 
collection, and each replica placement operation takes several seconds to 
complete at that scale.
   
   This pull request aims at bringing a first simple change to improve the 
performance of replica placement when operating with a large number of 
collections. It will eventually be followed by additional changes as we are 
progressing towards solutions for addressing such scalability issues.
   
   # Solution
   
   When fetching the cluster's metrics and information needed to compute the 
optimal position of the replica to be positioned, we are calling a seemingly 
harmless `cloudManager.getNodeStateProvider()` method `2 * number_of_nodes` 
times in 
[AttributeFetcherImpl.java#L144-L209](https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/cluster/placement/impl/AttributeFetcherImpl.java#L144-L209).
   
   However, this method instantiates a 
[SolrClientNodeStateProvider](https://github.com/apache/solr/blob/main/solr/solrj-zookeeper/src/java/org/apache/solr/client/solrj/impl/SolrClientNodeStateProvider.java)
 instance every time it is called, which performs an expensive operation within 
its constructor to build a `nodeVsCollectionVsShardVsReplicaInfo` data 
structure, filled with information about all existing collections and replicas.
   
   ```java
     protected final Map<String, Map<String, Map<String, List<Replica>>>>
         nodeVsCollectionVsShardVsReplicaInfo = new HashMap<>();
   
     public SolrClientNodeStateProvider(CloudLegacySolrClient solrClient) {
       this.solrClient = solrClient;
       try {
         readReplicaDetails();
       } catch (IOException e) {
         throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
       }
     }
   
     protected void readReplicaDetails() throws IOException {
       ClusterStateProvider clusterStateProvider = getClusterStateProvider();
       ClusterState clusterState = clusterStateProvider.getClusterState();
       if (clusterState == null) { // zkStateReader still initializing
         return;
       }
       Map<String, ClusterState.CollectionRef> all =
           clusterStateProvider.getClusterState().getCollectionStates();
       all.forEach(
           (collName, ref) -> {
             DocCollection coll = ref.get();
             if (coll == null) return;
             coll.forEachReplica(
                 (shard, replica) -> {
                   Map<String, Map<String, List<Replica>>> nodeData =
                       nodeVsCollectionVsShardVsReplicaInfo.computeIfAbsent(
                           replica.getNodeName(), k -> new HashMap<>());
                   Map<String, List<Replica>> collData =
                       nodeData.computeIfAbsent(collName, k -> new HashMap<>());
                   List<Replica> replicas = collData.computeIfAbsent(shard, k 
-> new ArrayList<>());
                   replicas.add((Replica) replica.clone());
                 });
           });
     }
   ```
   
   This operation has been observed to take several seconds, even on clusters 
with a very moderate amount of small collections (~1000).
   The simple change proposed in this pull request will reduce the number of 
times this expensive operation is performed for each replica placement 
operation from `2 * number_of_nodes` to only once.
   
   While the underlying algorithm will in all likelihood require further 
optimizations to provide acceptable execution times on large clusters with tens 
of thousands collections, we will already observe a noticeable improvement with 
this first fix.
   
   # Tests
   
   Changes have been validated as part of pre-production tests at Salesforce.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/solr/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [X] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Reference 
Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[PR] Avoid instantiating NodeStateProvider several times when computing replica placement [solr]

Reply via email to