[ 
https://issues.apache.org/jira/browse/CASSANDRA-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Constance Eustace updated CASSANDRA-14279:
------------------------------------------
    Description: 
In my experience if data is not well organized into time windowed sstables, 
cassandra has enormous difficulty in actually deleting data if the data has a 
"medium term" lifetime and is commingled with data that isn't marked for death, 
as would happen with compactions or intermingled write patterns. Or for 
example, you might have an active working set and be archiving "unused" data to 
other tables or clusters. Or you may be purging data. Or you may be 
migrating/sharding/restructuring data. Whatever the case, you want that disk 
space back, and you might not be able to truncate.

In STCS and LCS, row tombstones are intermingled with column data and column 
tombstones. But a row tombstone represents a significant event in data 
lifecycle: large amounts of "droppable" data during compaction and a shortcut 
from reading data from other sstables. It could also enable writes to be 
discarded in rare data patterns if the row tombstone is ahead in time. 

I am wondering that if row tombstones were isolated in their own sstables, 
separately compacted and merged, that it might enable compaction to work more 
efficiently: 

reads can prioritize bloom filter lookups that indicate a row tombstone, 
getting the timestamp of the deletion first, then can use that in the data 
sstables to filter data or shortcircuit the data if the row data had an overall 
"most recent data timestamp". 

compaction could be forced to reference all the row tombstone sstables, such 
that every time two or more "data" sstables are compacted, they must reference 
the row tombstones to purge data. 

In LCS, this would be particularly useful in getting data out of the upper 
levels without having to wait for data to trickle up the tree. The row 
tombstones, being read-only inputs into the data sstable compactions, can be 
referenced in each of the LCS levels' parallel compactors. 

Based on discussions in the dev list, this would appear to require some sort of 
customization to the memtable->sstable flushing process, and perhaps a 
different set of bloom filters. 

Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they 
should be comparitively smaller and take less time to compact. They could be 
aggressively compacted on a different schedule than "data" sstables. 

In addition, it may be easier to repair/synchronize row tombstones across the 
cluster if they have already been separated into their own sstables.

Column/range tombstones may also benefit from a similar separation, but my 
guess is those are much more numerous and large and fine-grained that they 
might as well coexist with the data.

  was:
In my experience if data is not well organized into time windowed sstables, 
cassandra has enormous difficulty in actually deleting data if the data has a 
"medium term" lifetime and is commingled with data that isn't marked for death, 
as would happen with compactions or intermingled write patterns. Or for 
example, you might have an active working set and be archiving "unused" data to 
other tables or clusters. Or you may be purging data. Or you may be 
migrating/sharding/restructuring data. Whatever the case, you want that disk 
space back, and you might not be able to truncate.

In STCS and LCS, row tombstones are intermingled with column data and column 
tombstones. But a row tombstone represents a big event: large amounts of 
"droppable" data from an sstable, or even a shortcut from reading data from 
other sstables.

I am wondering that if row tombstones were isolated in their own sstables, 
separately compacted and merged, that it might enable compaction to work more 
efficiently: 

reads can prioritize bloom filter lookups that indicate a row tombstone, 
getting the timestamp of the deletion first, then can use that in the data 
sstables to filter data or shortcircuit the data if the row data had an overall 
"most recent data timestamp". 

compaction could be forced to reference all the row tombstone sstables, such 
that every time two or more "data" sstables are compacted, they must reference 
the row tombstones to purge data. 

In LCS, this would be particularly useful in getting data out of the upper 
levels without having to wait for data to trickle up the tree. The row 
tombstones, being read-only inputs into the data sstable compactions, can be 
referenced in each of the LCS levels' parallel compactors. 

Based on discussions in the dev list, this would appear to require some sort of 
customization to the memtable->sstable flushing process, and perhaps a 
different set of bloom filters. 

Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they 
should be comparitively smaller and take less time to compact. They could be 
aggressively compacted on a different schedule than "data" sstables. 

In addition, it may be easier to repair/synchronize row tombstones across the 
cluster if they have already been separated into their own sstables.

Column/range tombstones may also benefit from a similar separation, but my 
guess is those are much more numerous and large and fine-grained that they 
might as well coexist with the data.


> Row Tombstones in separate sstables / separate compaction path
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-14279
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14279
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Local Write-Read Paths, Repair
>            Reporter: Constance Eustace
>            Priority: Major
>
> In my experience if data is not well organized into time windowed sstables, 
> cassandra has enormous difficulty in actually deleting data if the data has a 
> "medium term" lifetime and is commingled with data that isn't marked for 
> death, as would happen with compactions or intermingled write patterns. Or 
> for example, you might have an active working set and be archiving "unused" 
> data to other tables or clusters. Or you may be purging data. Or you may be 
> migrating/sharding/restructuring data. Whatever the case, you want that disk 
> space back, and you might not be able to truncate.
> In STCS and LCS, row tombstones are intermingled with column data and column 
> tombstones. But a row tombstone represents a significant event in data 
> lifecycle: large amounts of "droppable" data during compaction and a shortcut 
> from reading data from other sstables. It could also enable writes to be 
> discarded in rare data patterns if the row tombstone is ahead in time. 
> I am wondering that if row tombstones were isolated in their own sstables, 
> separately compacted and merged, that it might enable compaction to work more 
> efficiently: 
> reads can prioritize bloom filter lookups that indicate a row tombstone, 
> getting the timestamp of the deletion first, then can use that in the data 
> sstables to filter data or shortcircuit the data if the row data had an 
> overall "most recent data timestamp". 
> compaction could be forced to reference all the row tombstone sstables, such 
> that every time two or more "data" sstables are compacted, they must 
> reference the row tombstones to purge data. 
> In LCS, this would be particularly useful in getting data out of the upper 
> levels without having to wait for data to trickle up the tree. The row 
> tombstones, being read-only inputs into the data sstable compactions, can be 
> referenced in each of the LCS levels' parallel compactors. 
> Based on discussions in the dev list, this would appear to require some sort 
> of customization to the memtable->sstable flushing process, and perhaps a 
> different set of bloom filters. 
> Since the row tombstone sstables are all <rowkey>,<tombstone timestamp>, they 
> should be comparitively smaller and take less time to compact. They could be 
> aggressively compacted on a different schedule than "data" sstables. 
> In addition, it may be easier to repair/synchronize row tombstones across the 
> cluster if they have already been separated into their own sstables.
> Column/range tombstones may also benefit from a similar separation, but my 
> guess is those are much more numerous and large and fine-grained that they 
> might as well coexist with the data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to