[
https://issues.apache.org/jira/browse/SOLR-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018818#comment-18018818
]
Andrzej Bialecki commented on SOLR-12729:
-----------------------------------------
Re. the existing design - yes, that's the way it was supposed to work. And I
agree that it doesn't work well in these edge-cases.
Re. your proposed change: at the moment I can't find any reason why it
shouldn't work :) However ... it's been 7 years since I worked on this code
but I still remember I had quite a few nasty surprises regarding the proper
ordering of steps, rolling back while new requests may be coming, etc .. so I
can't guarantee that something else won't break.
Maybe a simpler fix would be to recognize these additional cases and do a
re-lock + cleanup, or reject the new request?
> SplitShardCmd should lock the parent shard to prevent parallel splitting
> requests
> ---------------------------------------------------------------------------------
>
> Key: SOLR-12729
> URL: https://issues.apache.org/jira/browse/SOLR-12729
> Project: Solr
> Issue Type: Bug
> Components: AutoScaling
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Priority: Major
> Fix For: 7.6, 8.0
>
>
> This scenario was discovered by the simulation framework, but it exists also
> in the non-simulated code.
> When {{IndexSizeTrigger}} requests SPLITSHARD, which is then successfully
> started and “completed” from the point of view of {{ExecutePlanAction}}, the
> reality is that it still can take significant amount of time until the moment
> when the new replicas fully recover and cause the switch of shard states
> (parent to INACTIVE, child from RECOVERY to ACTIVE).
> If this time is longer than the trigger's {{waitFor}} the trigger will issue
> the same SPLITSHARD request again. {{SplitShardCmd}} doesn't prevent this new
> request from being processed because the parent shard is still ACTIVE.
> However, a section of the code in {{SplitShardCmd}} will realize that
> sub-slices with the target names already exist and they are not active, at
> which point it will delete the new sub-slices ({{SplitShardCmd:182}}).
> The end result is an infinite loop, where {{IndexSizeTrigger}} will keep
> generating SPLITSHARD, and {{SplitShardCmd}} will keep deleting the
> recovering sub-slices created by the previous command.
> A simple solution is for the parent shard to be marked to indicate that it’s
> in a process of splitting, so that no other split is attempted on the same
> shard. Furthermore, {{IndexSizeTrigger}} could temporarily exclude such
> shards from monitoring.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]