[ 
https://issues.apache.org/jira/browse/SOLR-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Houston Putman reassigned SOLR-17331:
-------------------------------------

    Assignee: Houston Putman

> MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky
> -------------------------------------------------------------------
>
>                 Key: SOLR-17331
>                 URL: https://issues.apache.org/jira/browse/SOLR-17331
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Yohann Callea
>            Assignee: Houston Putman
>            Priority: Minor
>
> The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is 
> sometimes (< 3% failure rate) failing on its last assertion, as shows the 
> [trend history of test 
> failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget].
>  
> This test spins off a 5 nodes cluster, creates a collection with 3 shards and 
> a replication factor of 2.
> It then vacate 2 randomly chosen nodes using the Migrate Replicas command 
> and, after the migration completion, expect the vacated node to be assigned 
> no replicas and the 6 replicas to be evenly spread across the 3 non-vacated 
> nodes (i.e., 2 replicas positioned on each node).
> However, this last assertion happen to fail as the replicas are sometimes not 
> evenly spread over the 3 non-vacated nodes.
> {code:java}
> The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas 
> after the migration expected:<2> but was:<1> {code}
>  
> If we analyse more in detail a failure situation, it appears that this test 
> is inherently expected to fail under some circumstances, given how the 
> Migrate Replicas command operate.
> When migrating replicas, the new position of the replicas to be moved are 
> calculated sequentially and, for every consecutive move, the position is 
> decided according to the logic implemented by the replica placement plugin 
> currently configured.
> We can therefore end up in the following situation.
> h2. Failing scenario
> Note that this test always uses the default replica placement strategy, which 
> is Simple as of today.
> Let's assume the following initial state, after the collection creation.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |         |         |    X    |         |
> SHARD_2 |         |    X    |         |    X    |         |
> SHARD_3 |         |         |    X    |         |    X    | {code}
> The test now runs the migrate command to vacate *_NODE_3_* and 
> {*}_NODE_4_{*}. It therefore needs to go through 3 replica movements for 
> emptying these two nodes.
> h4. Move 1
> We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}.
> _*NODE_0*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and 
> _*NODE_2*_ can be chosen as they host the same number of replicas.
> *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |         |    X    |         |    X    |         |
> SHARD_3 |         |         |    X    |         |    X    | {code}
> h4. Move 2
> We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}.
> _*NODE_1*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and 
> _*NODE_2*_ can be chosen as they host the same number of replicas.
> *_NODE_0_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |    X    |    X    |         |         |         |
> SHARD_3 |         |         |    X    |         |    X    |{code}
> h4. Move 3
> We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}.
> _*NODE_2*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and 
> _*NODE_1*_ can be chosen as they host the same number of replicas.
> *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |    X    |    X    |         |         |         |
> SHARD_3 |         |    X    |    X    |         |         |{code}
>  
> The test will then fail as the replicas are not evenly positioned across the 
> non-vacated nodes, while it is arguably the expected outcome in the current 
> situation given the Simple placement strategy implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to