[ https://issues.apache.org/jira/browse/SOLR-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824141#comment-17824141 ]
Paul McArthur commented on SOLR-17198: -------------------------------------- I think there is a somewhat trivial solution for this issue. The metric that the placement plugin is interested in is the size of the index, and I don't think it matters if the Attribute Fetcher observes multiple leaders for a Shard. It could just pick any replica that claims leadership, because all of them are able to provide the required index size metric. I am not sure if there are other cases fulfilled by the Attribute Fetcher where this would be a problem, I haven't found any yet. I will put together a PR with a proposed solution. > Affinity Placement Plugin can fail when getting metrics, if multiple replicas > claim shard leadership > ----------------------------------------------------------------------------------------------------- > > Key: SOLR-17198 > URL: https://issues.apache.org/jira/browse/SOLR-17198 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 9.4 > Reporter: Paul McArthur > Priority: Minor > > Using Solr 9.4 with 16 nodes, I observe that about 25% of our Split Shard > requests are failing. The error is a RuntimeException that is raised by the > AttributeFetcher as it compiles the metrics that will be used by the plugin. > > The AttributeFetcher is making /admin/metrics requests to each node, and > currently it expects to be able to establish a consistent view of shard > leadership across the cluster from the responses. > However, we see this exception: > > {code:java} > Caused by: java.lang.RuntimeException: two replicas claim to be the shard > leader! > existing=org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ReplicaMetricsBuilder@56e219b9 > and current > org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ReplicaMetricsBuilder@406bcfd8 > at > org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ShardMetricsBuilder.lambda$build$0(CollectionMetricsBuilder.java:84) > at java.base/java.util.HashMap.forEach(HashMap.java:1429) > at > org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder$ShardMetricsBuilder.build(CollectionMetricsBuilder.java:76) > at > org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder.lambda$build$0(CollectionMetricsBuilder.java:39) > at java.base/java.util.HashMap.forEach(HashMap.java:1429) > at > org.apache.solr.cluster.placement.impl.CollectionMetricsBuilder.build(CollectionMetricsBuilder.java:39) > at > org.apache.solr.cluster.placement.impl.AttributeFetcherImpl.lambda$fetchAttributes$17(AttributeFetcherImpl.java:213) > at java.base/java.util.HashMap.forEach(HashMap.java:1429) > at > org.apache.solr.cluster.placement.impl.AttributeFetcherImpl.fetchAttributes(AttributeFetcherImpl.java:212) > at > org.apache.solr.cluster.placement.plugins.AffinityPlacementFactory$AffinityPlacementPlugin.getBaseWeightedNodes(AffinityPlacementFactory.java:284) > at > org.apache.solr.cluster.placement.plugins.OrderedNodePlacementPlugin.getWeightedNodes(OrderedNodePlacementPlugin.java:311) > at > org.apache.solr.cluster.placement.plugins.OrderedNodePlacementPlugin.computePlacements(OrderedNodePlacementPlugin.java:85) > at > org.apache.solr.cluster.placement.impl.PlacementPluginAssignStrategy.assign(PlacementPluginAssignStrategy.java:84) > at > org.apache.solr.cloud.api.collections.Assign$AssignStrategy.assign(Assign.java:446) > at > org.apache.solr.cloud.api.collections.SplitShardCmd.split(SplitShardCmd.java:689) > {code} > > > This indicates that more than one replica for a given Shard has responded > with leader=true in the replica metrics. > I think there are legitimate reasons this can occur: > 1. It may be fundamentally impossible to always be able to build a consistent > view of shard leadership from querying a set of distributed nodes > 2. /admin/metrics requests are sent sequentially to each node in turn. It is > possible that shard leadership may change between making the request to > different nodes that host replicas for a shard -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org