[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Williams updated CASSANDRA-3620:
----------------------------------------

    Description: 
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and the 
problem is made worse when there are lots of column families in a database or 
it is necessary to run a column family with a low GCSeconds to reduce tombstone 
load. The database owner must run repair within the GCSeconds window, or 
increase GCSeconds, to avoid losing delete operations. 

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

It would be much better if there was no ongoing requirement to run repair to 
avoid data loss (or rather the potential for data to reappear), and no 
GCSeconds window. Ideally repair would be an optional maintenance utility used 
in special cases, or to ensure ONE reads get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast "relic" index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background "reaper" threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour & administration overhead associated with running repair can be 
removed
* Reapers can utilize "spare" cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



  was:
Here is a proposal for an improved system for handling distributed deletes.

h2. The Problem

There are various issues with repair:

* Repair is expensive anyway
* Repair jobs are often made more expensive than they should be by other issues 
(nodes dropping requests, hinted handoff not working, downtime etc)
* Repair processes can often fail and need restarting, for example in cloud 
environments where network issues make a node disappear 
from the ring for a brief moment
* When you fail to run repair within GCSeconds, either because you are dumb or 
because of issues with Cassandra, deleted data can reappear 
* If you cannot run repair and have to increase GCSeconds to prevent deleted 
data reappearing, in some cases the growing tombstone overhead can 
significantly degrade performance

Because of the foregoing, in high throughput environments it can be very 
difficult to make repair a cron job. It can be preferable to keep a terminal 
open and run repair jobs one by one, making sure they succeed and keeping and 
eye on overall load to reduce system impact. This isn't desirable, and the 
problem is made worse when there are lots of column families in a database or 
it is necessary to run a column family with a low GCSeconds to reduce tombstone 
load. The database owner must run repair with the GCSeconds window, or increase 
GCSeconds, to avoid losing delete operations. 

Running repair to deal with missing writes isn't so important, since QUORUM 
reads will always receive data successfully written with QUORUM.

Therefore ideally there should be no ongoing requirement to run repair to avoid 
data loss, and no GCSeconds. Repair should be an optional maintenance utility 
used in special cases, or to ensure ONE reads get consistent data. 

h2. "Reaper Model" Proposal

# Tombstones do not expire, and there is no GCSeconds
# Tombstones have associated ACK lists, which record the replicas that have 
acknowledged them
# Tombstones are only deleted (or marked for compaction) when they have been 
acknowledged by all replicas
# When a tombstone is deleted, it is added to a fast "relic" index of MD5 
hashes of cf-key-name[-subName]-ackList. The relic index makes it possible for 
a reaper to acknowledge a tombstone after it is deleted
# Background "reaper" threads constantly stream ACK requests to other nodes, 
and stream back ACK responses back to requests they have received (throttling 
their usage of CPU and bandwidth so as not to affect performance)
# If a reaper receives a request to ACK a tombstone that does not exist, it 
creates the tombstone and adds an ACK for the requestor, and replies with an 
ACK 

NOTES

* The existence of entries in the relic index do not affect normal query 
performance
* If a node goes down, and comes up after a configurable relic entry timeout, 
the worst that can happen is that a tombstone that hasn't received all its 
acknowledgements is re-created across the replicas when the reaper requests 
their acknowledgements (which is no big deal since this does not corrupt data)
* Since early removal of entries in the relic index does not cause data loss, 
it can be kept small, or even kept in memory
* Simple to implement and predictable 

h3. Planned Benefits

* Operations are finely grained (reaper interruption is not an issue)
* The labour & administration overhead associated with running repair can be 
removed
* Reapers can utilize "spare" cycles and run constantly in background to 
prevent the load spikes and performance issues associated with repair
* There will no longer be the threat of data loss if repair can't be run for 
some reason (for example because of a new adopter's lack of Cassandra 
expertise, a cron script failing, or Cassandra bugs preventing repair being run 
etc)
* Deleting tombstones earlier, thereby reducing the number involved in query 
processing, will often dramatically improve performance



    
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds 
> and scheduled repairs
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.5
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, 
> merkle_trees, repair,
>             Fix For: 1.1
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other 
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud 
> environments where network issues make a node disappear 
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either because you are dumb 
> or because of issues with Cassandra, deleted data can reappear 
> * If you cannot run repair and have to increase GCSeconds to prevent deleted 
> data reappearing, in some cases the growing tombstone overhead can 
> significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very 
> difficult to make repair a cron job. It can be preferable to keep a terminal 
> open and run repair jobs one by one, making sure they succeed and keeping and 
> eye on overall load to reduce system impact. This isn't desirable, and the 
> problem is made worse when there are lots of column families in a database or 
> it is necessary to run a column family with a low GCSeconds to reduce 
> tombstone load. The database owner must run repair within the GCSeconds 
> window, or increase GCSeconds, to avoid losing delete operations. 
> Running repair to deal with missing writes isn't so important, since QUORUM 
> reads will always receive data successfully written with QUORUM.
> It would be much better if there was no ongoing requirement to run repair to 
> avoid data loss (or rather the potential for data to reappear), and no 
> GCSeconds window. Ideally repair would be an optional maintenance utility 
> used in special cases, or to ensure ONE reads get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have 
> acknowledged them
> # Tombstones are only deleted (or marked for compaction) when they have been 
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a fast "relic" index of MD5 
> hashes of cf-key-name[-subName]-ackList. The relic index makes it possible 
> for a reaper to acknowledge a tombstone after it is deleted
> # Background "reaper" threads constantly stream ACK requests to other nodes, 
> and stream back ACK responses back to requests they have received (throttling 
> their usage of CPU and bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it 
> creates the tombstone and adds an ACK for the requestor, and replies with an 
> ACK 
> NOTES
> * The existence of entries in the relic index do not affect normal query 
> performance
> * If a node goes down, and comes up after a configurable relic entry timeout, 
> the worst that can happen is that a tombstone that hasn't received all its 
> acknowledgements is re-created across the replicas when the reaper requests 
> their acknowledgements (which is no big deal since this does not corrupt data)
> * Since early removal of entries in the relic index does not cause data loss, 
> it can be kept small, or even kept in memory
> * Simple to implement and predictable 
> h3. Planned Benefits
> * Operations are finely grained (reaper interruption is not an issue)
> * The labour & administration overhead associated with running repair can be 
> removed
> * Reapers can utilize "spare" cycles and run constantly in background to 
> prevent the load spikes and performance issues associated with repair
> * There will no longer be the threat of data loss if repair can't be run for 
> some reason (for example because of a new adopter's lack of Cassandra 
> expertise, a cron script failing, or Cassandra bugs preventing repair being 
> run etc)
> * Deleting tombstones earlier, thereby reducing the number involved in query 
> processing, will often dramatically improve performance

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to