[ https://issues.apache.org/jira/browse/CASSANDRA-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381562#comment-16381562 ]
Jeff Jirsa commented on CASSANDRA-14279: ---------------------------------------- Relevant / related mailing list post: https://lists.apache.org/thread.html/34e980c8e1ad6c06e28f99139f9bdec9878eb004da056a17774d0ad3@%3Cdev.cassandra.apache.org%3E > Row Tombstones in separate sstables / separate compaction path > -------------------------------------------------------------- > > Key: CASSANDRA-14279 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14279 > Project: Cassandra > Issue Type: Improvement > Components: Compaction, Local Write-Read Paths, Repair > Reporter: Constance Eustace > Priority: Major > > In my experience if data is not well organized into time windowed sstables, > cassandra has enormous difficulty in actually deleting data if the data has a > "medium term" lifetime and is commingled with data that isn't marked for > death, as would happen with compactions or intermingled write patterns. Or > for example, you might have an active working set and be archiving "unused" > data to other tables or clusters. Or you may be purging data. Or you may be > migrating/sharding/restructuring data. Whatever the case, you want that disk > space back, and you might not be able to truncate. > In STCS and LCS, row tombstones are intermingled with column data and column > tombstones. But a row tombstone represents a significant event in data > lifecycle: large amounts of "droppable" data during compaction and a shortcut > from reading data from other sstables. It could also enable writes to be > discarded in rare data patterns if the row tombstone is ahead in time. > I am wondering that if row tombstones were isolated in their own sstables, > separately compacted and merged, that it might enable compaction to work more > efficiently: > reads can prioritize bloom filter lookups that indicate a row tombstone, > getting the timestamp of the deletion first, then can use that in the data > sstables to filter data or shortcircuit the data if the row data had an > overall "most recent data timestamp". > compaction could be forced to reference all the row tombstone sstables, such > that every time two or more "data" sstables are compacted, they must > reference the row tombstones to purge data. > In LCS, this would be particularly useful in getting data out of the upper > levels without having to wait for data to trickle up the tree. The row > tombstones, being read-only inputs into the data sstable compactions, can be > referenced in each of the LCS levels' parallel compactors. > Based on discussions in the dev list, this would appear to require some sort > of customization to the memtable->sstable flushing process, and perhaps a > different set of bloom filters. > Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they > should be comparitively smaller and take less time to compact. They could be > aggressively compacted on a different schedule than "data" sstables. > In addition, it may be easier to repair/synchronize row tombstones across the > cluster if they have already been separated into their own sstables. > Column/range tombstones may also benefit from a similar separation, but my > guess is those are much more numerous and large and fine-grained that they > might as well coexist with the data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org