[ 
https://issues.apache.org/jira/browse/CASSANDRA-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381562#comment-16381562
 ] 

Jeff Jirsa commented on CASSANDRA-14279:
----------------------------------------

Relevant / related mailing list post: 
https://lists.apache.org/thread.html/34e980c8e1ad6c06e28f99139f9bdec9878eb004da056a17774d0ad3@%3Cdev.cassandra.apache.org%3E


> Row Tombstones in separate sstables / separate compaction path
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-14279
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14279
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Local Write-Read Paths, Repair
>            Reporter: Constance Eustace
>            Priority: Major
>
> In my experience if data is not well organized into time windowed sstables, 
> cassandra has enormous difficulty in actually deleting data if the data has a 
> "medium term" lifetime and is commingled with data that isn't marked for 
> death, as would happen with compactions or intermingled write patterns. Or 
> for example, you might have an active working set and be archiving "unused" 
> data to other tables or clusters. Or you may be purging data. Or you may be 
> migrating/sharding/restructuring data. Whatever the case, you want that disk 
> space back, and you might not be able to truncate.
> In STCS and LCS, row tombstones are intermingled with column data and column 
> tombstones. But a row tombstone represents a significant event in data 
> lifecycle: large amounts of "droppable" data during compaction and a shortcut 
> from reading data from other sstables. It could also enable writes to be 
> discarded in rare data patterns if the row tombstone is ahead in time. 
> I am wondering that if row tombstones were isolated in their own sstables, 
> separately compacted and merged, that it might enable compaction to work more 
> efficiently: 
> reads can prioritize bloom filter lookups that indicate a row tombstone, 
> getting the timestamp of the deletion first, then can use that in the data 
> sstables to filter data or shortcircuit the data if the row data had an 
> overall "most recent data timestamp". 
> compaction could be forced to reference all the row tombstone sstables, such 
> that every time two or more "data" sstables are compacted, they must 
> reference the row tombstones to purge data. 
> In LCS, this would be particularly useful in getting data out of the upper 
> levels without having to wait for data to trickle up the tree. The row 
> tombstones, being read-only inputs into the data sstable compactions, can be 
> referenced in each of the LCS levels' parallel compactors. 
> Based on discussions in the dev list, this would appear to require some sort 
> of customization to the memtable->sstable flushing process, and perhaps a 
> different set of bloom filters. 
> Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they 
> should be comparitively smaller and take less time to compact. They could be 
> aggressively compacted on a different schedule than "data" sstables. 
> In addition, it may be easier to repair/synchronize row tombstones across the 
> cluster if they have already been separated into their own sstables.
> Column/range tombstones may also benefit from a similar separation, but my 
> guess is those are much more numerous and large and fine-grained that they 
> might as well coexist with the data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to