[
https://issues.apache.org/jira/browse/SOLR-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000279#comment-17000279
]
Andrzej Bialecki commented on SOLR-14123:
-----------------------------------------
Also, {{autoAddReplicas}} is a reactive measure that reacts to {{nodeLost}}
events, it's not a proactive measure that monitors the number of replicas. In
some situations autoscaling events may be lost or their execution interrupted
without the failure generating a re-try.
For example, if a lost replica is being re-created on a node that itself goes
down during replica creation the replica may stay lost - because it's not yet
visible on the target node (it's still being created) but at this point the
autoscaling trigger considers the original {{nodeLost}} event "handled" so it
does not re-create the event. See SOLR-12749, and the description in SOLR-13828
(it was only partially fixed).
> autoAddReplicas is not reliable when multiple nodes go down.
> ------------------------------------------------------------
>
> Key: SOLR-14123
> URL: https://issues.apache.org/jira/browse/SOLR-14123
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: AutoScaling
> Affects Versions: 8.3
> Reporter: David Hunt
> Priority: Major
> Labels: autoscale
>
> I started noticing problems in our production environment with indexing being
> blocked due to a minimum replication factor not being met. We have
> autoAddReplicas triggers in place to add replicas when nodes our lost but it
> doesn't seem to correctly add all replicas that have been lost when nodes are
> lost. I’ve been able to reproduce this behavior consistently in a development
> environment.
> Repro:
> # Setup a 10 node SolrCloud cluster.
> # Create autoAddReplicas to trigger on nodeLost with waitFor set to 10
> minutes.
> # Create 15 collections with 2 shards and 4 replicas.
> # Kill 3 Solr nodes.
> # 15 minutes later kill 1 more Solr node.
> Results:
> Monitor your shards/replicas. You’ll see some replicas added to make up for
> the lost replicas but not all. An hour later many shards are still missing
> replicas.
> Expected:
> All lost replicas should be added on the 6 remaining healthy nodes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]