[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:
----------------------------------------

    Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# New tombstones replace old tombstones and always start with an empty ACK list
# Upon deletion, a tombstone is written to a "relic" list, which is scavenged 
according to some configurable period, thereby allowing deleted tombstones to 
still be acknowledged (the writer acknowledges this has some of the drawbacks 
of GCSeconds)
# Background "reaper" threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize "spare" cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds. 
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# If a cf/key/name is deleted, and there is a preexisting tombstone, its ACK 
list is simply reset
# Upon deletion, a tombstone is written to a "relic" list, which is scavenged 
according to some configurable period, thereby allowing deleted tombstones to 
still be acknowledged (the writer acknowledges this has some of the drawbacks 
of GCSeconds)
# Background "reaper" threads constantly stream ACK requests and ACKs from 
other replicas and deletes tombstones that have received all their ACKs
# If a reaper receives a request to ACK a missing tombstone, it creates the 
tombstone, adds an ACK for the requestor, and replies with an ACK

A number of systems could be used to maintain synchronization while cluster 
nodes are added/removed.

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize "spare" cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



    
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds 
> and scheduled repairs
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.5
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, 
> merkle_trees, repair,
>             Fix For: 1.1
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> Repair has issues:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other 
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair can often itself fail and need restarting, especially in cloud 
> environments where a network issue might make a node disappear 
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either because you are dumb 
> or because of issues with Cassandra, deleted data can reappear 
> * If you cannot run repair and have to increase GCSeconds, tombstones can 
> start overloading your system
> Because of the foregoing, in high throughput environments you often cannot 
> make repair a cron job. You prefer to keep a terminal open and run repair 
> jobs one by one, making sure they succeed and keeping and eye on overall load 
> so you don't impact your system. This isn't great, and it is made worse if 
> you have lots of column families or have to run a low GCSeconds on a column 
> family to reduce tombstone load. You know that if you don't manage to run 
> repair with the GCSeconds window, you are going to hit problems, and this is 
> the Sword of Damocles over your head.
> Running repair to deal with missing writes isn't so important, since QUORUM 
> reads will always receive data successfully written with QUORUM.
> Ideally there should be no ongoing requirement to run repair to avoid data 
> loss, and no GCSeconds. Repair should be an optional maintenance utility used 
> in special cases, or to ensure ONE reads get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds. 
> # Tombstones have associated ACK lists, which record the replicas that have 
> acknowledged them
> # Tombstones are only deleted (or marked for compaction) when they have been 
> acknowledged by all replicas
> # New tombstones replace old tombstones and always start with an empty ACK 
> list
> # Upon deletion, a tombstone is written to a "relic" list, which is scavenged 
> according to some configurable period, thereby allowing deleted tombstones to 
> still be acknowledged (the writer acknowledges this has some of the drawbacks 
> of GCSeconds)
> # Background "reaper" threads constantly stream ACK requests and ACKs from 
> other replicas and deletes tombstones that have received all their ACKs
> # If a reaper receives a request to ACK a missing tombstone, it creates the 
> tombstone, adds an ACK for the requestor, and replies with an ACK
> A number of systems could be used to maintain synchronization while cluster 
> nodes are added/removed.
> h3. Benefits
> * The labour/administration overhead associated with running repair will be 
> removed
> * The reapers can utilize "spare" cycles and run constantly to prevent the 
> load spikes and performance issues associated with repair
> * There will no longer be the risk of data loss if repair can't be run for 
> some reason (for example because of a new adopter's lack of Cassandra 
> expertise, a cron script failing, or Cassandra bugs preventing repair being 
> run etc)
> * Reducing the average number of tombstones databases carry will improve 
> performance, sometimes dramatically

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to