[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595268#comment-13595268
 ] 

Jonathan Ellis commented on CASSANDRA-3620:
-------------------------------------------

Right.
                
> Proposal for distributed deletes - fully automatic "Reaper Model" rather than 
> GCSeconds and manual repairs
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Dominic Williams
>            Assignee: Aleksey Yeschenko
>              Labels: GCSeconds,, deletes,, distributed_deletes,, 
> merkle_trees, repair,
>             Fix For: 2.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Proposal for an improved system for handling distributed deletes, which 
> removes the requirement to regularly run repair processes to maintain 
> performance and data integrity. 
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive to run
> * Repair jobs are often made more expensive than they should be by other 
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud 
> environments where network issues make a node disappear from the ring for a 
> brief moment
> * When you fail to run repair within GCSeconds, either by error or because of 
> issues with Cassandra, data written to a node that did not see a later delete 
> can reappear (and a node might miss a delete for several reasons including 
> being down or simply dropping requests during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted 
> data reappearing, in some cases the growing tombstone overhead can 
> significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very 
> difficult to make repair a cron job. It can be preferable to keep a terminal 
> open and run repair jobs one by one, making sure they succeed and keeping and 
> eye on overall load to reduce system impact. This isn't desirable, and 
> problems are exacerbated when there are lots of column families in a database 
> or it is necessary to run a column family with a low GCSeconds to reduce 
> tombstone load (because there are many write/deletes to that column family). 
> The database owner must run repair within the GCSeconds window, or increase 
> GCSeconds, to avoid potentially losing delete operations. 
> It would be much better if there was no ongoing requirement to run repair to 
> ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be 
> an optional maintenance utility used in special cases, or to ensure ONE reads 
> get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have 
> acknowledged them
> # Tombstones are deleted (or marked for compaction) when they have been 
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a "relic" index. The relic 
> index makes it possible for a reaper to acknowledge a tombstone after it is 
> deleted
> # The ACK lists and relic index are held in memory for speed
> # Background "reaper" threads constantly stream ACK requests to other nodes, 
> and stream back ACK responses back to requests they have received (throttling 
> their usage of CPU and bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it 
> creates the tombstone and adds an ACK for the requestor, and replies with an 
> ACK. This is the worst that can happen, and does not cause data corruption. 
> ADDENDUM
> The proposal to hold the ACK and relic lists in memory was added after the 
> first posting. Please see comments for full reasons. Furthermore, a proposal 
> for enhancements to repair was posted to comments, which would cause 
> tombstones to be scavenged when repair completes (the author had assumed this 
> was the case anyway, but it seems at time of writing they are only scavenged 
> during compaction on GCSeconds timeout). The proposals are not exclusive and 
> this proposal is extended to include the possible enhancements to repair 
> described.
> NOTES
> * If a node goes down for a prolonged period, the worst that can happen is 
> that some tombstones are recreated across the cluster when it restarts, which 
> does not corrupt data (and this will only occur with a very small number of 
> tombstones)
> * The system is simple to implement and predictable 
> * With the reaper model, repair would become an optional process for 
> optimizing the database to increase the consistency seen by 
> ConsistencyLevel.ONE reads, and for fixing up nodes, for example after an 
> sstable was lost
> h3. Planned Benefits
> * Reaper threads can utilize "spare" cycles to constantly scavenge tombstones 
> in the background thereby greatly reducing tombstone load, improving query 
> performance, reducing the system resources needed by processes such as 
> compaction, and making performance generally more predictable 
> * The reaper model means that GCSeconds is no longer necessary, which removes 
> the threat of data corruption if repair can't be run successfully within that 
> period (for example if repair can't be run because of a new adopter's lack of 
> Cassandra expertise, a cron script failing, or Cassandra bugs or other 
> technical issues)
> * Reaper threads are fully automatic, work in the background and perform 
> finely grained operations where interruption has little effect. This is much 
> better for database administrators than having to manually run and manage 
> repair, whether for the purposes of preventing data corruption or for 
> optimizing performance, which in addition to wasting operator time also often 
> creates load spikes and has to be restarted after failure.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to