[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:
----------------------------------------

    Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast "relic" index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background "reaper" threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# The relic index is scavenged according to some configurable period
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after the configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas (which is no big deal since 
does not corrupt data)

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize "spare" cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

Repair has issues:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair can often itself fail and need restarting, especially in cloud 
environments where a network issue might make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds, tombstones can start 
overloading your system

Because of the foregoing, in high throughput environments you often cannot make 
repair a cron job. You prefer to keep a terminal open and run repair jobs one 
by one, making sure they succeed and keeping and eye on overall load so you 
don't impact your system. This isn't great, and it is made worse if you have 
lots of column families or have to run a low GCSeconds on a column family to 
reduce tombstone load. You know that if you don't manage to run repair with the 
GCSeconds window, you are going to hit problems, and this is the Sword of 
Damocles over your head.

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Ideally there should be no ongoing requirement to run repair to avoid data 
loss, and no GCSeconds. Repair should be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when
## They have been acknowledged by all replicas
## All replicas have acknowledged receiving all acknowledgements
# Background "reaper" threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# Once a tombstone has been acknowledged by all replicas, after a configurable 
period, the reaper asks the replicas to acknowledge that the others have 
received all their acknowledgements
## If a node is down or otherwise can't reply, this is retried after a back-off 
period
## If a node is asked to fully acknowledge a tombstone, and it is not ready to 
do so, it may try to receive outstanding acknowledgements so that it can reply 
with an ACK
# When a tombstone is deleted, it is added to a fast "relic" index, comprised 
of MD5 hashes calculated from cf-key-name[-subName]-ackList. The relic index 
makes it possible for a reaper to acknowledge that it has received all 
acknowledgements after it has deleted a tombstone
# The relic index is scavenged according to some configurable period
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

h3. Benefits

* The labour/administration overhead associated with running repair will be 
removed
* The reapers can utilize "spare" cycles and run constantly to prevent the load 
spikes and performance issues associated with repair
* There will no longer be the risk of data loss if repair can't be run for some 
reason (for example because of a new adopter's lack of Cassandra expertise, a 
cron script failing, or Cassandra bugs preventing repair being run etc)
* Reducing the average number of tombstones databases carry will improve 
performance, sometimes dramatically



    
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds 
> and scheduled repairs
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.5
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, 
> merkle_trees, repair,
>             Fix For: 1.1
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> Repair has issues:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other 
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair can often itself fail and need restarting, especially in cloud 
> environments where a network issue might make a node disappear 
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either because you are dumb 
> or because of issues with Cassandra, deleted data can reappear 
> * If you cannot run repair and have to increase GCSeconds, tombstones can 
> start overloading your system
> Because of the foregoing, in high throughput environments you often cannot 
> make repair a cron job. You prefer to keep a terminal open and run repair 
> jobs one by one, making sure they succeed and keeping and eye on overall load 
> so you don't impact your system. This isn't great, and it is made worse if 
> you have lots of column families or have to run a low GCSeconds on a column 
> family to reduce tombstone load. You know that if you don't manage to run 
> repair with the GCSeconds window, you are going to hit problems, and this is 
> the Sword of Damocles over your head.
> Running repair to deal with missing writes isn't so important, since QUORUM 
> reads will always receive data successfully written with QUORUM.
> Ideally there should be no ongoing requirement to run repair to avoid data 
> loss, and no GCSeconds. Repair should be an optional maintenance utility used 
> in special cases, or to ensure ONE reads get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have 
> acknowledged them
> # Tombstones are only deleted (or marked for compaction) when they have been 
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a fast "relic" index of MD5 
> hashes of cf-key-name[-subName]-ackList. The relic index makes it possible 
> for a reaper to acknowledge a tombstone after it is deleted
> # Background "reaper" threads constantly stream ACK requests to other nodes, 
> and stream back ACK responses back to requests they have received (throttling 
> their usage of CPU and bandwidth so as not to affect performance)
> # The relic index is scavenged according to some configurable period
> # If a reaper receives a request to ACK a tombstone that does not exist, it 
> creates the tombstone and adds an ACK for the requestor, and replies with an 
> ACK 
> NOTES
> * The existence of entries in the relic index do not affect normal query 
> performance
> * If a node goes down, and comes up after the configurable relic entry 
> timeout, the worst that can happen is that a tombstone that hasn't received 
> all its acknowledgements is re-created across the replicas (which is no big 
> deal since does not corrupt data)
> h3. Benefits
> * The labour/administration overhead associated with running repair will be 
> removed
> * The reapers can utilize "spare" cycles and run constantly to prevent the 
> load spikes and performance issues associated with repair
> * There will no longer be the risk of data loss if repair can't be run for 
> some reason (for example because of a new adopter's lack of Cassandra 
> expertise, a cron script failing, or Cassandra bugs preventing repair being 
> run etc)
> * Reducing the average number of tombstones databases carry will improve 
> performance, sometimes dramatically

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to