[jira] [Commented] (SOLR-17198) Affinity Placement Plugin can fail when getting metrics, if multiple replicas claim shard leadership

Paul McArthur (Jira) Wed, 06 Mar 2024 11:11:37 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824141#comment-17824141
 ]


Paul McArthur commented on SOLR-17198:
--------------------------------------

I think there is a somewhat trivial solution for this issue. The metric that 
the placement plugin is interested in is the size of the index, and I don't 
think it matters if the Attribute Fetcher observes multiple leaders for a Shard.

It could just pick any replica that claims leadership, because all of them are 
able to provide the required index size metric.

I am not sure if there are other cases fulfilled by the Attribute Fetcher where 
this would be a problem, I haven't found any yet. 

 

I will put together a PR with a proposed solution.

> Affinity Placement Plugin can fail when getting metrics, if multiple replicas 
> claim shard leadership 
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-17198
>                 URL: https://issues.apache.org/jira/browse/SOLR-17198
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 9.4
>            Reporter: Paul McArthur
>            Priority: Minor
>
> Using Solr 9.4 with 16 nodes, I observe that about 25% of our Split Shard 
> requests are failing. The error is a RuntimeException that is raised by the 
> AttributeFetcher as it compiles the metrics that will be used by the plugin.
>  
> The AttributeFetcher is making /admin/metrics requests to each node, and 
> currently it expects to be able to establish a consistent  view of shard 
> leadership across the cluster from the responses.
> However, we see this exception:
>  
> {code:java}
> Caused by: java.lang.RuntimeException: two replicas claim to be the shard 
> leader! 
> existing=org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ReplicaMetricsBuilder@56e219b9
>  and current 
> org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ReplicaMetricsBuilder@406bcfd8
>       at 
> org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ShardMetricsBuilder.lambda$build$0(CollectionMetricsBuilder.java:84)
>       at java.base/java.util.HashMap.forEach(HashMap.java:1429)
>       at 
> org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ShardMetricsBuilder.build(CollectionMetricsBuilder.java:76)
>       at 
> org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder.lambda$build$0(CollectionMetricsBuilder.java:39)
>       at java.base/java.util.HashMap.forEach(HashMap.java:1429)
>       at 
> org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder.build(CollectionMetricsBuilder.java:39)
>       at 
> org.apache.solr.cluster.placement.impl.AttributeFetcherImpl.lambda$fetchAttributes$17(AttributeFetcherImpl.java:213)
>       at java.base/java.util.HashMap.forEach(HashMap.java:1429)
>       at 
> org.apache.solr.cluster.placement.impl.AttributeFetcherImpl.fetchAttributes(AttributeFetcherImpl.java:212)
>       at 
> org.apache.solr.cluster.placement.plugins.AffinityPlacementFactory$AffinityPlacementPlugin.getBaseWeightedNodes(AffinityPlacementFactory.java:284)
>       at 
> org.apache.solr.cluster.placement.plugins.OrderedNodePlacementPlugin.getWeightedNodes(OrderedNodePlacementPlugin.java:311)
>       at 
> org.apache.solr.cluster.placement.plugins.OrderedNodePlacementPlugin.computePlacements(OrderedNodePlacementPlugin.java:85)
>       at 
> org.apache.solr.cluster.placement.impl.PlacementPluginAssignStrategy.assign(PlacementPluginAssignStrategy.java:84)
>       at 
> org.apache.solr.cloud.api.collections.Assign$AssignStrategy.assign(Assign.java:446)
>       at 
> org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:689)
>  {code}
>  
>  
> This indicates that more than one replica for a given Shard has responded 
> with leader=true in the replica metrics.
> I think there are legitimate reasons this can occur:
> 1. It may be fundamentally impossible to always be able to build a consistent 
> view of shard leadership from querying a set of distributed nodes
> 2. /admin/metrics requests are sent sequentially to each node in turn. It is 
> possible that shard leadership may change between making the request to 
> different nodes that host replicas for a shard



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17198) Affinity Placement Plugin can fail when getting metrics, if multiple replicas claim shard leadership

Reply via email to