Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds and 
scheduled repairs
-------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-3620
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
    Affects Versions: 1.0.5
            Reporter: Dominic Williams
             Fix For: 1.1


Here is a proposal for an improved system for handling distributed deletes.

*** The Problem ***

Repair has issues:

-- Repair is expensive anyway

-- Repair jobs are often made more expensive than they should be by other 
issues (nodes dropping requests, hinted handoff not working, downtime etc)

-- Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment

-- When you fail to run repair before GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 

-- If you cannot run repair and have to increase GCSeconds, tombstones can 
start overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse where you have 
lots of column families or where you have to run a low GCSeconds on a column 
family to reduce tombstone load. You know that if you don't manage to run 
repair with the GCSeconds window, you are going to hit problems, and this is 
the Sword of Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

*** Proposed "Reaper Model" ***

1. Tombstones do not expire, and there is no GCSeconds. 

2. Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them

3. Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas.

4. If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset

5. Background "reaper" threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs

A number of systems could be used to maintain synchronization while nodes are 
added/removed that can be discussed in separate Jira

** Advantages **

-- The labour/administration overhead associated with running repair will be 
removed

-- The reapers can utilize "spare" cycles and run constantly to prevent the 
load spikes and performance issues associated with repair

-- There will no longer be the risk of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)

-- Reducing the number of tombstones databases carry will improve performance, 
sometimes *dramatically*



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to