[ 
https://issues.apache.org/jira/browse/CASSANDRA-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaydeepkumar Chovatia updated CASSANDRA-21115:
----------------------------------------------
    Reviewers: Jaydeepkumar Chovatia, Jaydeepkumar Chovatia
               Jaydeepkumar Chovatia, Jaydeepkumar Chovatia  (was: Jaydeepkumar 
Chovatia)
       Status: Review In Progress  (was: Patch Available)

> Auto-repair skips incomplete first repair after node restart due to ordering 
> of checks
> --------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21115
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21115
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When a node starts its very first auto-repair and crashes before completing 
> it, the repair won't be resumed properly after restart. Instead, it gets 
> skipped by the "too soon to repair" check for up to 24 hours.
> *What happens*
>   1. Node joins the cluster, no repair history exists yet
>   2. insertNewRepairHistory() creates a record with both repair_start_ts and 
> repair_finish_ts set to the current time (let's call it T1)
>   3. When repair actually starts, only repair_start_ts gets updated to T2
>   4. Node crashes mid-repair
>   5. On restart, tooSoonToRunRepair() is called before myTurnToRunRepair()
>   6. It queries repair_finish_ts which is still T1 (the record creation time, 
> not an actual repair completion)
>   7. If less than 24h have passed since T1, the check returns "too soon" and 
> bails out
>   8. The logic in myTurnToRunRepair() that detects ongoing repairs 
> (repair_start_ts > repair_finish_ts) never gets a chance to run
> *Expected behavior*
>   A repair that was in progress should be resumed after restart, regardless 
> of the min_repair_interval setting. The "too soon" check should not apply to 
> incomplete repairs.
>  *How to reproduce*
>   1. Set up a fresh node with auto-repair enabled
>   2. Wait for the first repair to start
>   3. Kill the node before repair completes
>   4. Restart the node within 24 hours
>   5. Observe that repair is skipped with "Too soon to run repair" in the logs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to