[jira] [Commented] (CASSANDRA-11264) Repair scheduling - Failure handling and retry

Marcus Olsson (JIRA) Thu, 14 Apr 2016 10:35:02 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241559#comment-15241559
 ]


Marcus Olsson commented on CASSANDRA-11264:
-------------------------------------------

bq. After having a look at your original patch I saw that a failed task will be 
re-prioritized against other scheduled jobs/tasks with a high priority (given 
its last run time will not be updated), so that's already a retry mechanism in 
itself.
While this is true, I believe that this part should probably be reworked a bit. 
If we have a scenario where one particular job will always fail, we will end up 
in a loop where that job would get retried constantly which leads to starvation 
of other jobs. One option is to keep it simple and only run it once (by 
removing the retry logic) and also add a flag for the job which is used to 
determine when the job is allowed to run again. Something like:
{code}
execute()
{
 runTasks();
 if (allTasksWasSuccessful())
 {
  nextRun = -1
  lastRunTime = now;
 }
 else
 {
  nextRun = now + defaultWaitTime;
 }
}
{code}
Then that flag would be used to avoid prioritizing the failing job against the 
other jobs until the {{defaultWaitTime}} has elapsed. This flag could also work 
nicely with the rejection policies (assuming that they estimate the time until 
the job can actually be run), especially if we would be able to reject repairs 
on a specific table rather than all tables. WDYT?

bq. Rather than cluttering the scheduled repair mechanism with retry logic, I 
think that it's better to add a retry option to (non-scheduled) repair job, and 
do more fine grained retry on individual steps such as validation and sync, 
since this will be more effective against transient failures rather than 
retrying the whole task and potentially losing work of non-failed tasks.
Great idea! If e.g. a validation would fail on one node, would we clean up the 
resources on that node by CASSANDRA-11190 (specifically about cleaning up 
resources, so that we can safely retry it) or would we need a separate way of 
doing that? 

bq. We can of course log warns and gather statistics when a scheduled task 
fails, but I think we should add retry support to repair independently of this. 
WDYT?
Sounds good!

> Repair scheduling - Failure handling and retry
> ----------------------------------------------
>
>                 Key: CASSANDRA-11264
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11264
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>
> Make it possible for repairs to be run again if they fail and clean up the 
> associated resources (validations and streaming sessions) before retrying. 
> Log a warning for each re-attempt and an error if it can't complete in X 
> times. The number of retries before considering the repair a failure could be 
> configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11264) Repair scheduling - Failure handling and retry

Reply via email to