Constance Eustace created CASSANDRA-14279:
---------------------------------------------

             Summary: Row Tombstones in separate sstables / separate compaction 
path
                 Key: CASSANDRA-14279
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14279
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Constance Eustace


In my experience if data is not well organized into time windowed sstables, 
cassandra has enormous difficulty in actually deleting data if the data has a 
"medium term" lifetime. Or for example, you might have an active working set 
and be archiving "unused" data to other tables or clusters. Or you may be 
purging data. Or you may be migrating/sharding data. Whatever the case, you 
want that disk space back. 

In STCS and LCS, row tombstones are intermingled with column data and column 
tombstones. But a row tombstone represents a big event: large amounts of 
"droppable" data from an sstable, or even a shortcut from reading data from 
other sstables.

I am wondering that if row tombstones were isolated in their own sstables, 
separately compacted and merged, that it might enable compaction to work more 
efficiently: 

reads can prioritize bloom filter lookups that indicate a row tombstone, 
getting the timestamp of the deletion first, then can use that in the data 
sstables to filter data or shortcircuit the data if the row data had an overall 
"most recent data timestamp". 

compaction could be forced to reference all the row tombstone sstables, such 
that every time two or more "data" sstables are compacted, they must reference 
the row tombstones to purge data. 

In LCS, this would be particularly useful in getting data out of the upper 
levels without having to wait for data to trickle up the tree. The row 
tombstones, being read-only inputs into the data sstable compactions, can be 
referenced in each of the LCS levels' parallel compactors. 

Based on discussions in the dev list, this would appear to require some sort of 
customization to the memtable->sstable flushing process, and perhaps a 
different set of bloom filters. 

Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they 
should be comparitively smaller and take less time to compact. They could be 
aggressively compacted on a different schedule than "data" sstables. 

In addition, it may be easier to repair/synchronize row tombstones across the 
cluster if they have already been separated into their own sstables.

Column/range tombstones may also benefit from a similar separation, but my 
guess is those are much more numerous and large and fine-grained that they 
might as well coexist with the data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to