[ 
https://issues.apache.org/jira/browse/CASSANDRA-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169478#comment-13169478
 ] 

Dominic Williams edited comment on CASSANDRA-3620 at 12/14/11 4:03 PM:
-----------------------------------------------------------------------

OK.. the complete solution: The whole tombstone reaping process could be 
performed in memory because it fails safe.

PROPOSAL ADJUSTMENTS

* The tombstone acknowledgements and also the relic list are held in memory
* A node's reaper thread only requests tombstone acknowledgements when it can 
see all replicas in the ring
* The reaper works within configurable memory limit, and if there's a problem 
getting a tombstone acknowledgement, for example because a replica goes 
offline, or Cassandra exception, it simply kicks it out of memory


NOTES

* The reaping process now has no disk/storage overhead
* The memory and CPU savings achieved by not having to include tombstones in 
query processing, compaction etc will greatly exceed the reaper's overhead
* The bandwidth savings achieved by nodes not having to send each other 
tombstones to calculate query results will greatly exceed the reaper's overhead 
(requesting/sending ACKs)

SPECIAL CASES

Where there is a large replication factor, for example RF=9, in this case 
savings will still likely predominate. Even as regards overall bandwidth 
consumed, the requirement to request/send ACKs is probably more than offset by 
the need to share tombstones amongst nodes (in this case 5) to process QUORUM 
reads. Furthermore, reaper bandwidth overhead doesn't need to impede query 
processing, whereas the sharing of tombstones as part of query processing 
always does. 

Also this needn't be an either/or situation. If they want administrators could 
simply turn off reaping and fall back to using the repair process (the Sword of 
Damocles cough).

For the majority tombstone reaping should
* Dramatically improve query performance
* Greatly reduce administration overhead and complexity and remove the biggest 
driver of data corruption
* Reduce memory and processor pressure by preventing tombstone buildup thus 
indirectly reducing other issues
* Avoid the load spikes and associated problems caused by running repair.


 



                
      was (Author: dccwilliams):
    OK.. the complete solution: The whole tombstone reaping process could be 
performed in memory because it fails safe.

PROPOSAL ADJUSTMENTS

* The tombstone acknowledgements and also the relic list are held in memory
* A node's reaper thread only requests tombstone acknowledgements when it can 
see all replicas in the ring
* The reaper works within configurable memory limit, and if there's a problem 
getting a tombstone acknowledgement, for example because a replica goes 
offline, or Cassandra exception, it simply kicks it out of memory


NOTES

* The reaping process now has no disk/storage overhead
* The memory and CPU savings achieved by not having to include tombstones in 
query processing, compaction etc will greatly exceed the reaper's overhead
* The bandwidth savings achieved by nodes not having to send each other 
tombstones to calculate query results will greatly exceed the reaper's overhead 
(requesting/sending ACKs)

SPECIAL CASES

Where there is a large replication factor, for example RF=9, in this case 
savings will still likely predominate. Even as regards overall bandwidth 
consumed, the requirement to request/send ACKs is probably more than offset by 
the need to share tombstones amongst nodes (in this case 5) to process QUORUM 
reads. Furthermore, reaper bandwidth overhead doesn't need to impede query 
processing, whereas the sharing of tombstones as part of query processing 
always does. 

Also this needn't be an either/or situation. If they want administrators could 
simply turn off reaping and fall back to using the repair process (the Sword of 
Damocles cough) but for a great many tombstone reaping will (i) dramatically 
improve query performance (ii) greatly reduce administration overhead and 
complexity and remove the biggest driver of data corruption (iii) reduce memory 
and processor pressure by preventing tombstone buildup thus indirectly reducing 
other issues and (iv) avoid the load spikes and associated problems caused by 
running repair.


 



                  
> Proposal for distributed deletes - use "Reaper Model" rather than GCSeconds 
> and scheduled repairs
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3620
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Dominic Williams
>              Labels: GCSeconds,, deletes,, distributed_deletes,, 
> merkle_trees, repair,
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Here is a proposal for an improved system for handling distributed deletes.
> h2. The Problem
> There are various issues with repair:
> * Repair is expensive anyway
> * Repair jobs are often made more expensive than they should be by other 
> issues (nodes dropping requests, hinted handoff not working, downtime etc)
> * Repair processes can often fail and need restarting, for example in cloud 
> environments where network issues make a node disappear 
> from the ring for a brief moment
> * When you fail to run repair within GCSeconds, either by error or because of 
> issues with Cassandra, data written to a node that did not see a later delete 
> can reappear (and a node might miss a delete for several reasons including 
> being down or simply dropping requests during load shedding)
> * If you cannot run repair and have to increase GCSeconds to prevent deleted 
> data reappearing, in some cases the growing tombstone overhead can 
> significantly degrade performance
> Because of the foregoing, in high throughput environments it can be very 
> difficult to make repair a cron job. It can be preferable to keep a terminal 
> open and run repair jobs one by one, making sure they succeed and keeping and 
> eye on overall load to reduce system impact. This isn't desirable, and 
> problems are exacerbated when there are lots of column families in a database 
> or it is necessary to run a column family with a low GCSeconds to reduce 
> tombstone load (because there are many write/deletes to that column family). 
> The database owner must run repair within the GCSeconds window, or increase 
> GCSeconds, to avoid potentially losing delete operations. 
> It would be much better if there was no ongoing requirement to run repair to 
> ensure deletes aren't lost, and no GCSeconds window. Ideally repair would be 
> an optional maintenance utility used in special cases, or to ensure ONE reads 
> get consistent data. 
> h2. "Reaper Model" Proposal
> # Tombstones do not expire, and there is no GCSeconds
> # Tombstones have associated ACK lists, which record the replicas that have 
> acknowledged them
> # Tombstones are only deleted (or marked for compaction) when they have been 
> acknowledged by all replicas
> # When a tombstone is deleted, it is added to a fast "relic" index of MD5 
> hashes of cf-key-name[-subName]-ackList. The relic index makes it possible 
> for a reaper to acknowledge a tombstone after it is deleted
> # Background "reaper" threads constantly stream ACK requests to other nodes, 
> and stream back ACK responses back to requests they have received (throttling 
> their usage of CPU and bandwidth so as not to affect performance)
> # If a reaper receives a request to ACK a tombstone that does not exist, it 
> creates the tombstone and adds an ACK for the requestor, and replies with an 
> ACK 
> NOTES
> * The existence of entries in the relic index do not affect normal query 
> performance
> * If a node goes down, and comes up after a configurable relic entry timeout, 
> the worst that can happen is that a tombstone that hasn't received all its 
> acknowledgements is re-created across the replicas when the reaper requests 
> their acknowledgements (which is no big deal since this does not corrupt data)
> * Since early removal of entries in the relic index does not cause 
> corruption, it can be kept small, or even kept in memory
> * Simple to implement and predictable 
> h3. Planned Benefits
> * Operations are finely grained (reaper interruption is not an issue)
> * The labour & administration overhead associated with running repair can be 
> removed
> * Reapers can utilize "spare" cycles and run constantly in background to 
> prevent the load spikes and performance issues associated with repair
> * There will no longer be the threat of corruption if repair can't be run for 
> some reason (for example because of a new adopter's lack of Cassandra 
> expertise, a cron script failing, or Cassandra bugs preventing repair being 
> run etc)
> * Deleting tombstones earlier, thereby reducing the number involved in query 
> processing, will often dramatically improve performance

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to