Paul McArthur created SOLR-17198:
------------------------------------

             Summary: Affinity Placement Plugin can fail when getting metrics, 
if multiple replicas claim shard leadership 
                 Key: SOLR-17198
                 URL: https://issues.apache.org/jira/browse/SOLR-17198
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SolrCloud
    Affects Versions: 9.4
            Reporter: Paul McArthur


Using Solr 9.4 with 16 nodes, I observe that about 25% of our Split Shard 
requests are failing. The error is a RuntimeException that is raised by the 
AttributeFetcher as it compiles the metrics that will be used by the plugin.

 

The AttributeFetcher is making /admin/metrics requests to each node, and 
currently it expects to be able to establish a consistent  view of shard 
leadership across the cluster from the responses.

However, we see this exception:

 
{code:java}
Caused by: java.lang.RuntimeException: two replicas claim to be the shard 
leader! 
existing=org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ReplicaMetricsBuilder@56e219b9
 and current 
org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ReplicaMetricsBuilder@406bcfd8
        at 
org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ShardMetricsBuilder.lambda$build$0(CollectionMetricsBuilder.java:84)
        at java.base/java.util.HashMap.forEach(HashMap.java:1429)
        at 
org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ShardMetricsBuilder.build(CollectionMetricsBuilder.java:76)
        at 
org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder.lambda$build$0(CollectionMetricsBuilder.java:39)
        at java.base/java.util.HashMap.forEach(HashMap.java:1429)
        at 
org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder.build(CollectionMetricsBuilder.java:39)
        at 
org.apache.solr.cluster.placement.impl.AttributeFetcherImpl.lambda$fetchAttributes$17(AttributeFetcherImpl.java:213)
        at java.base/java.util.HashMap.forEach(HashMap.java:1429)
        at 
org.apache.solr.cluster.placement.impl.AttributeFetcherImpl.fetchAttributes(AttributeFetcherImpl.java:212)
        at 
org.apache.solr.cluster.placement.plugins.AffinityPlacementFactory$AffinityPlacementPlugin.getBaseWeightedNodes(AffinityPlacementFactory.java:284)
        at 
org.apache.solr.cluster.placement.plugins.OrderedNodePlacementPlugin.getWeightedNodes(OrderedNodePlacementPlugin.java:311)
        at 
org.apache.solr.cluster.placement.plugins.OrderedNodePlacementPlugin.computePlacements(OrderedNodePlacementPlugin.java:85)
        at 
org.apache.solr.cluster.placement.impl.PlacementPluginAssignStrategy.assign(PlacementPluginAssignStrategy.java:84)
        at 
org.apache.solr.cloud.api.collections.Assign$AssignStrategy.assign(Assign.java:446)
        at 
org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:689)
 {code}
 

 

This indicates that more than one replica for a given Shard has responded with 
leader=true in the replica metrics.

I think there are legitimate reasons this can occur:

1. It may be fundamentally impossible to always be able to build a consistent 
view of shard leadership from querying a set of distributed nodes
2. /admin/metrics requests are sent sequentially to each node in turn. It is 
possible that shard leadership may change between making the request to 
different nodes that host replicas for a shard



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to