David Smiley created SOLR-18277:
-----------------------------------
Summary: SplitShardCmd cleanupAfterFailure race flaw
Key: SOLR-18277
URL: https://issues.apache.org/jira/browse/SOLR-18277
Project: Solr
Issue Type: Bug
Components: SolrCloud
Reporter: David Smiley
{{testSplitAfterFailedSplit2}} fails because the parent shard (shard1) is
permanently stuck in INACTIVE state after a failed split attempt, preventing
the retry split from succeeding.
_Disclaimer: issue is AI generated_
h3. Root Cause
There is a race condition in {{{}SplitShardCmd.cleanupAfterFailure(){}}}:
# The normal split flow queues an Overseer state update: {{shard1→inactive,
shard1_0→active, shard1_1→active}}
# {{cleanupAfterFailure()}} calls {{forceUpdateCollection()}} — but reads the
collection state *before* the Overseer has processed message 1
# Cleanup sees shard1 still as ACTIVE, so it does *not* include
{{shard1→active}} in its corrective state update
# Cleanup queues: {{shard1_0→construction, shard1_1→construction}}
# Overseer processes message 1: shard1 goes INACTIVE
# Overseer processes message 2: sub-shards go to CONSTRUCTION (no fix for
shard1)
# Sub-shards are then deleted. shard1 is permanently stuck INACTIVE with no
sub-shards.
h3. Impact
The retry split fails with: {{Parent slice is not active: collection1/ shard1,
state=inactive}}
h3. Suggested Fix
{{cleanupAfterFailure()}} should unconditionally include {{parentShard→active}}
in its state update propMap (or re-read state after ensuring the Overseer queue
is drained), rather than relying on a point-in-time read that may be stale due
to the concurrent Overseer processing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]