Paulo Motta created CASSANDRA-21115:
---------------------------------------
Summary: Auto-repair skips incomplete first repair after node
restart due to ordering of checks
Key: CASSANDRA-21115
URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
Project: Apache Cassandra
Issue Type: Bug
Reporter: Paulo Motta
When a node starts its very first auto-repair and crashes before completing it,
the repair won't be resumed properly after restart. Instead, it gets skipped by
the "too soon to repair" check for up to 24 hours.
*What happens*
1. Node joins the cluster, no repair history exists yet
2. insertNewRepairHistory() creates a record with both repair_start_ts and
repair_finish_ts set to the current time (let's call it T1)
3. When repair actually starts, only repair_start_ts gets updated to T2
4. Node crashes mid-repair
5. On restart, tooSoonToRunRepair() is called before myTurnToRunRepair()
6. It queries repair_finish_ts which is still T1 (the record creation time,
not an actual repair completion)
7. If less than 24h have passed since T1, the check returns "too soon" and
bails out
8. The logic in myTurnToRunRepair() that detects ongoing repairs
(repair_start_ts > repair_finish_ts) never gets a chance to run
*Expected behavior*
A repair that was in progress should be resumed after restart, regardless of
the min_repair_interval setting. The "too soon" check should not apply to
incomplete repairs.
*How to reproduce*
1. Set up a fresh node with auto-repair enabled
2. Wait for the first repair to start
3. Kill the node before repair completes
4. Restart the node within 24 hours
5. Observe that repair is skipped with "Too soon to run repair" in the logs
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]