[ 
https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333894#comment-16333894
 ] 

Lerh Chuan Low commented on CASSANDRA-8460:
-------------------------------------------

Thinking about this further, looks like this will be (reasonably) complex. 

The main issue is that by introducing an archival directory, we now have 
multiple data directories, which is like a JBOD setup. 
https://issues.apache.org/jira/browse/CASSANDRA-6696 (Partition SSTables by 
token range) seeks to prevent resurrected tombstones - the scenario where you 
can have resurrected tombstones is described here: 
https://www.datastax.com/dev/blog/improving-jbod. 

However, with an archiving directory, we can no longer guarantee that a single 
token range (or vnode) will live in one directory (unless I'm missing 
something. Archiving is based on SSTable age; it doesn't know anything about 
tokens)

High levelly, the situation goes like this: 

1. You have a SSD and a HDD. 
2. Key x is written into the SSD. 
3. After some time, x passes the archive days, and ends up in the HDD. 
4. For some reason not quite clear, the user decides to write a tombstone for x 
(They shouldn't for TWCS). So we now have tomb(x) in the SSD. 

At this point, we must keep in mind that there are 3 separate 
{{CompactionStrategy}} (CS) instances running in both the SSD and HDD, each 
managing repaired, unrepaired and pending repair SSTables. So there are 3 in 
the SSD and 3 in the HDD. These CS instances cannot see each other's 
candidates; when considering candidates for compaction, they see only the 
SSTables in their own directories. 

5. It passes gc_grace_second and tomb(x) is compacted away. So now x is 
resurrected. In an actual JBOD setup, this can't happen because a single token 
range or vnode can only live in one directory. This can't be guaranteed with an 
archiving setup. 

We can solve this issue by introducing a new flag. This flag will make it so 
that a tombstone is only dropped if it lives in the archiving directory. 
Enforcing {{gc_grace > archive_days}} is not sufficient because the node can 
always be taken offline or compactions disabled or similar. 

Consider the case where: 

6. The SSD is corrupted and needs to be replaced. In this case, the fix would 
be to replace the entire node, not just the SSD. This is to prevent tombstone 
resurrection but also that the system tables are gone (system tables live in 
the SSD), so a full replace is needed. 

This is the high level design we came up with: 
* In typical TTL use case TTL should always be greater than archive days 
* Introduce a new YAML setting; call it cold_data_directories possibly. This is 
to signal that 'archive' doesn't mean we can just forget it there; compactions 
still need to happen in that directory, for joining nodes, streaming nodes, and 
keeping the disk usage low. 
* An option on TWCS to specify to use cold directory after a certain amount of 
days. 
* Need a new flag to handle the situation described - cannot drop tombstones 
unless it’s in the cold directory. This also has the implication that we can’t 
drop data using tombstones on the non-archived data. Pretty much means we can’t 
use manual deletions on the table and we should only use this when TTLing 
everything, writing once, and we should turn off read repair.
* Need a separate compaction throughput and concurrent compactors setting for 
the cold directory

Caveats with changes to flags/properties:
* Removing cold flag from the yaml means we've lost the data in those 
directories.
* Removing cold flag from table only means data will no longer be archived to 
cold. Existing SSTables in the cold directory should be loaded in; however if 
compacted moved back to hot storage.
* Reducing the archive time on the table will just cause more data to be moved 
to the cold directory.
* Increasing the archive time means existing data that should no longer be 
archived could go back to the live set if compacted, however will stay in cold 
data with no negative impact.
* When promoting data to cold directory need to check that there’s not an 
overlapping SSTable with a max timestamp greater than minimum timestamp, same 
as TWCS expiry.

There will still be significant I/O when it comes to 
compacting/repairing/streaming the SSTables in the cold directory, and it adds 
reasonable complexity to the code base. It's not trivial to reason about 
either, it took 3 hours between me and my colleagues. The only leftover 
question we had was when changing the table level property will Cassandra need 
to be restarted to take effect? Or is there a hook/property checked constantly?

Anybody notice anything we missed or have any thoughts on it so far on the 
feature itself and the value it adds for the complexity introduced (if you have 
time)? Before we go ahead with it. Will be really appreciated! [~krummas] 
[~bdeggleston] [~jjirsa] [~stone] 




> Make it possible to move non-compacting sstables to slow/big storage in DTCS
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8460
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8460
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Eriksson
>            Priority: Major
>              Labels: doc-impacting, dtcs
>             Fix For: 4.x
>
>
> It would be nice if we could configure DTCS to have a set of extra data 
> directories where we move the sstables once they are older than 
> max_sstable_age_days. 
> This would enable users to have a quick, small SSD for hot, new data, and big 
> spinning disks for data that is rarely read and never compacted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to