[jira] [Created] (CASSANDRA-5396) Repair process is a joke leading to a downward spiralling and eventually unusable cluster

David Berkman (JIRA) Wed, 27 Mar 2013 15:19:16 -0700

David Berkman created CASSANDRA-5396:
----------------------------------------


             Summary: Repair process is a joke leading to a downward spiralling 
and eventually unusable cluster
                 Key: CASSANDRA-5396
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5396
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.2.3
         Environment: all
            Reporter: David Berkman
            Priority: Critical
             Fix For: 2.1


Let's review the repair process...

1) It's mandatory to run repair.
2) Repair has a high impact and can take hours.
3) Repair provides no estimation of completion time and no progress indicator.
4) Repair is extremely fragile, and can fail to complete, or become stuck quite 
easily in real operating environments.
5) When repair fails it provides no feedback whatsoever of the problem or 
possible resolution.
6) A failed repair operation saddles the effected nodes with a huge amount of 
extra data (judging from node size).
7) There is no way to rid the node of the extra data associated with a failed 
repair short of completely rebuilding the node.
8) The extra data from a failed repair makes any subsequent repair take longer 
and increases the likelihood that it will simply become stuck or fail, leading 
to yet more node corruption.
9) Eventually no repair operation will complete successfully, and node 
operations will eventually become impacted leading to a failing cluster.

Who would design such a system for a service meant to operate as a fault 
tolerant clustered data store operating on a lot of commodity hardware?

Solution...

1) Repair must be robust.
2) Repair must *never* become 'stuck'.
3) Failure to complete must result in reasonable feedback.
4) Failure to complete must not result in a node whose state is worse than 
before the operation began.
5) Repair must provide some means of determining completion percentage.
6) It would be nice if repair could estimate its run time, even if it could do 
so only based upon previous runs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-5396) Repair process is a joke leading to a downward spiralling and eventually unusable cluster

Reply via email to