[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168855#comment-13168855 ]
Dominic Williams commented on CASSANDRA-3620: --------------------------------------------- Make it optional per column family? Repair would still need to exist anyway so could fall back to that for cases like this. > Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds > and scheduled repairs > ------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-3620 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3620 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Dominic Williams > Labels: GCSeconds,, deletes,, distributed_deletes,, > merkle_trees, repair, > Original Estimate: 504h > Remaining Estimate: 504h > > Here is a proposal for an improved system for handling distributed deletes. > h2. The Problem > There are various issues with repair: > * Repair is expensive anyway > * Repair jobs are often made more expensive than they should be by other > issues (nodes dropping requests, hinted handoff not working, downtime etc) > * Repair processes can often fail and need restarting, for example in cloud > environments where network issues make a node disappear > from the ring for a brief moment > * When you fail to run repair within GCSeconds, either by error or because of > issues with Cassandra, data written to a node that did not see a later delete > can reappear (and a node might miss a delete for several reasons including > being down or simply dropping requests during load shedding) > * If you cannot run repair and have to increase GCSeconds to prevent deleted > data reappearing, in some cases the growing tombstone overhead can > significantly degrade performance > Because of the foregoing, in high throughput environments it can be very > difficult to make repair a cron job. It can be preferable to keep a terminal > open and run repair jobs one by one, making sure they succeed and keeping and > eye on overall load to reduce system impact. This isn't desirable, and > problems are exacerbated when there are lots of column families in a database > or it is necessary to run a column family with a low GCSeconds to reduce > tombstone load (because there are many write/deletes to that column family). > The database owner must run repair within the GCSeconds window, or increase > GCSeconds, to avoid potentially losing delete operations. > It would be much better if there was no ongoing requirement to run repair to > ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be > an optional maintenance utility used in special cases, or to ensure ONE reads > get consistent data. > h2. "Reaper Model" Proposal > # Tombstones do not expire, and there is no GCSeconds > # Tombstones have associated ACK lists, which record the replicas that have > acknowledged them > # Tombstones are only deleted (or marked for compaction) when they have been > acknowledged by all replicas > # When a tombstone is deleted, it is added to a fast "relic" index of MD5 > hashes of cf-key-name[-subName]-ackList. The relic index makes it possible > for a reaper to acknowledge a tombstone after it is deleted > # Background "reaper" threads constantly stream ACK requests to other nodes, > and stream back ACK responses back to requests they have received (throttling > their usage of CPU and bandwidth so as not to affect performance) > # If a reaper receives a request to ACK a tombstone that does not exist, it > creates the tombstone and adds an ACK for the requestor, and replies with an > ACK > NOTES > * The existence of entries in the relic index do not affect normal query > performance > * If a node goes down, and comes up after a configurable relic entry timeout, > the worst that can happen is that a tombstone that hasn't received all its > acknowledgements is re-created across the replicas when the reaper requests > their acknowledgements (which is no big deal since this does not corrupt data) > * Since early removal of entries in the relic index does not cause > corruption, it can be kept small, or even kept in memory > * Simple to implement and predictable > h3. Planned Benefits > * Operations are finely grained (reaper interruption is not an issue) > * The labour & administration overhead associated with running repair can be > removed > * Reapers can utilize "spare" cycles and run constantly in background to > prevent the load spikes and performance issues associated with repair > * There will no longer be the threat of corruption if repair can't be run for > some reason (for example because of a new adopter's lack of Cassandra > expertise, a cron script failing, or Cassandra bugs preventing repair being > run etc) > * Deleting tombstones earlier, thereby reducing the number involved in query > processing, will often dramatically improve performance -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira