David Smiley created SOLR-18277:
-----------------------------------

             Summary: SplitShardCmd cleanupAfterFailure race flaw
                 Key: SOLR-18277
                 URL: https://issues.apache.org/jira/browse/SOLR-18277
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
            Reporter: David Smiley


{{testSplitAfterFailedSplit2}} fails because the parent shard (shard1) is 
permanently stuck in INACTIVE state after a failed split attempt, preventing 
the retry split from succeeding.

_Disclaimer: issue is AI generated_
h3. Root Cause

There is a race condition in {{{}SplitShardCmd.cleanupAfterFailure(){}}}:
 # The normal split flow queues an Overseer state update: {{shard1→inactive, 
shard1_0→active, shard1_1→active}}
 # {{cleanupAfterFailure()}} calls {{forceUpdateCollection()}} — but reads the 
collection state *before* the Overseer has processed message 1
 # Cleanup sees shard1 still as ACTIVE, so it does *not* include 
{{shard1→active}} in its corrective state update
 # Cleanup queues: {{shard1_0→construction, shard1_1→construction}}
 # Overseer processes message 1: shard1 goes INACTIVE
 # Overseer processes message 2: sub-shards go to CONSTRUCTION (no fix for 
shard1)
 # Sub-shards are then deleted. shard1 is permanently stuck INACTIVE with no 
sub-shards.

h3. Impact

The retry split fails with: {{Parent slice is not active: collection1/ shard1, 
state=inactive}}
h3. Suggested Fix

{{cleanupAfterFailure()}} should unconditionally include {{parentShard→active}} 
in its state update propMap (or re-read state after ensuring the Overseer queue 
is drained), rather than relying on a point-in-time read that may be stale due 
to the concurrent Overseer processing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to