[ https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594353#comment-13594353 ]
Aleksey Yeschenko commented on CASSANDRA-3620: ---------------------------------------------- I've been trying to find a workable solution, but now I'm almost certain that it can't be done safely.. unless you disable HH that is. Otherwise a delivered hint can undo a deletion - if we gc it away too fast. So this would only work safely if commitlog mode is sync AND hh is disabled - which doesn't seem like a common configuration. Or am I missing something obvious? > Proposal for distributed deletes - fully automatic "Reaper Model" rather than > GCSeconds and manual repairs > ---------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-3620 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3620 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Dominic Williams > Assignee: Aleksey Yeschenko > Labels: GCSeconds,, deletes,, distributed_deletes,, > merkle_trees, repair, > Fix For: 2.0 > > Original Estimate: 504h > Remaining Estimate: 504h > > Proposal for an improved system for handling distributed deletes, which > removes the requirement to regularly run repair processes to maintain > performance and data integrity. > h2. The Problem > There are various issues with repair: > * Repair is expensive to run > * Repair jobs are often made more expensive than they should be by other > issues (nodes dropping requests, hinted handoff not working, downtime etc) > * Repair processes can often fail and need restarting, for example in cloud > environments where network issues make a node disappear from the ring for a > brief moment > * When you fail to run repair within GCSeconds, either by error or because of > issues with Cassandra, data written to a node that did not see a later delete > can reappear (and a node might miss a delete for several reasons including > being down or simply dropping requests during load shedding) > * If you cannot run repair and have to increase GCSeconds to prevent deleted > data reappearing, in some cases the growing tombstone overhead can > significantly degrade performance > Because of the foregoing, in high throughput environments it can be very > difficult to make repair a cron job. It can be preferable to keep a terminal > open and run repair jobs one by one, making sure they succeed and keeping and > eye on overall load to reduce system impact. This isn't desirable, and > problems are exacerbated when there are lots of column families in a database > or it is necessary to run a column family with a low GCSeconds to reduce > tombstone load (because there are many write/deletes to that column family). > The database owner must run repair within the GCSeconds window, or increase > GCSeconds, to avoid potentially losing delete operations. > It would be much better if there was no ongoing requirement to run repair to > ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be > an optional maintenance utility used in special cases, or to ensure ONE reads > get consistent data. > h2. "Reaper Model" Proposal > # Tombstones do not expire, and there is no GCSeconds > # Tombstones have associated ACK lists, which record the replicas that have > acknowledged them > # Tombstones are deleted (or marked for compaction) when they have been > acknowledged by all replicas > # When a tombstone is deleted, it is added to a "relic" index. The relic > index makes it possible for a reaper to acknowledge a tombstone after it is > deleted > # The ACK lists and relic index are held in memory for speed > # Background "reaper" threads constantly stream ACK requests to other nodes, > and stream back ACK responses back to requests they have received (throttling > their usage of CPU and bandwidth so as not to affect performance) > # If a reaper receives a request to ACK a tombstone that does not exist, it > creates the tombstone and adds an ACK for the requestor, and replies with an > ACK. This is the worst that can happen, and does not cause data corruption. > ADDENDUM > The proposal to hold the ACK and relic lists in memory was added after the > first posting. Please see comments for full reasons. Furthermore, a proposal > for enhancements to repair was posted to comments, which would cause > tombstones to be scavenged when repair completes (the author had assumed this > was the case anyway, but it seems at time of writing they are only scavenged > during compaction on GCSeconds timeout). The proposals are not exclusive and > this proposal is extended to include the possible enhancements to repair > described. > NOTES > * If a node goes down for a prolonged period, the worst that can happen is > that some tombstones are recreated across the cluster when it restarts, which > does not corrupt data (and this will only occur with a very small number of > tombstones) > * The system is simple to implement and predictable > * With the reaper model, repair would become an optional process for > optimizing the database to increase the consistency seen by > ConsistencyLevel.ONE reads, and for fixing up nodes, for example after an > sstable was lost > h3. Planned Benefits > * Reaper threads can utilize "spare" cycles to constantly scavenge tombstones > in the background thereby greatly reducing tombstone load, improving query > performance, reducing the system resources needed by processes such as > compaction, and making performance generally more predictable > * The reaper model means that GCSeconds is no longer necessary, which removes > the threat of data corruption if repair can't be run successfully within that > period (for example if repair can't be run because of a new adopter's lack of > Cassandra expertise, a cron script failing, or Cassandra bugs or other > technical issues) > * Reaper threads are fully automatic, work in the background and perform > finely grained operations where interruption has little effect. This is much > better for database administrators than having to manually run and manage > repair, whether for the purposes of preventing data corruption or for > optimizing performance, which in addition to wasting operator time also often > creates load spikes and has to be restarted after failure. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira