[ 
https://issues.apache.org/jira/browse/SOLR-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-18277:
--------------------------------
    Attachment: improve_ShardSplitTest_robustness.patch

> SplitShard cleanupAfterFailure race flaw
> ----------------------------------------
>
>                 Key: SOLR-18277
>                 URL: https://issues.apache.org/jira/browse/SOLR-18277
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: David Smiley
>            Priority: Major
>         Attachments: 
> OUTPUT-org.apache.solr.cloud.api.collections.ShardSplitTest.txt, 
> improve_ShardSplitTest_robustness.patch
>
>
> {{testSplitAfterFailedSplit2}} fails because the parent shard (shard1) is 
> permanently stuck in INACTIVE state after a failed split attempt, preventing 
> the retry split from succeeding.
> _Disclaimer: issue is AI generated_
> h3. Root Cause
> There is a race condition in {{{}SplitShardCmd.cleanupAfterFailure(){}}}:
>  # The normal split flow queues an Overseer state update: {{shard1→inactive, 
> shard1_0→active, shard1_1→active}}
>  # {{cleanupAfterFailure()}} calls {{forceUpdateCollection()}} — but reads 
> the collection state *before* the Overseer has processed message 1
>  # Cleanup sees shard1 still as ACTIVE, so it does *not* include 
> {{shard1→active}} in its corrective state update
>  # Cleanup queues: {{shard1_0→construction, shard1_1→construction}}
>  # Overseer processes message 1: shard1 goes INACTIVE
>  # Overseer processes message 2: sub-shards go to CONSTRUCTION (no fix for 
> shard1)
>  # Sub-shards are then deleted. shard1 is permanently stuck INACTIVE with no 
> sub-shards.
> h3. Impact
> The retry split fails with: {{Parent slice is not active: collection1/ 
> shard1, state=inactive}}
> h3. Suggested Fix
> {{cleanupAfterFailure()}} should unconditionally include 
> {{parentShard→active}} in its state update propMap (or re-read state after 
> ensuring the Overseer queue is drained), rather than relying on a 
> point-in-time read that may be stale due to the concurrent Overseer 
> processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to