[ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333894#comment-16333894 ]
Lerh Chuan Low commented on CASSANDRA-8460: ------------------------------------------- Thinking about this further, looks like this will be (reasonably) complex. The main issue is that by introducing an archival directory, we now have multiple data directories, which is like a JBOD setup. https://issues.apache.org/jira/browse/CASSANDRA-6696 (Partition SSTables by token range) seeks to prevent resurrected tombstones - the scenario where you can have resurrected tombstones is described here: https://www.datastax.com/dev/blog/improving-jbod. However, with an archiving directory, we can no longer guarantee that a single token range (or vnode) will live in one directory (unless I'm missing something. Archiving is based on SSTable age; it doesn't know anything about tokens) High levelly, the situation goes like this: 1. You have a SSD and a HDD. 2. Key x is written into the SSD. 3. After some time, x passes the archive days, and ends up in the HDD. 4. For some reason not quite clear, the user decides to write a tombstone for x (They shouldn't for TWCS). So we now have tomb(x) in the SSD. At this point, we must keep in mind that there are 3 separate {{CompactionStrategy}} (CS) instances running in both the SSD and HDD, each managing repaired, unrepaired and pending repair SSTables. So there are 3 in the SSD and 3 in the HDD. These CS instances cannot see each other's candidates; when considering candidates for compaction, they see only the SSTables in their own directories. 5. It passes gc_grace_second and tomb(x) is compacted away. So now x is resurrected. In an actual JBOD setup, this can't happen because a single token range or vnode can only live in one directory. This can't be guaranteed with an archiving setup. We can solve this issue by introducing a new flag. This flag will make it so that a tombstone is only dropped if it lives in the archiving directory. Enforcing {{gc_grace > archive_days}} is not sufficient because the node can always be taken offline or compactions disabled or similar. Consider the case where: 6. The SSD is corrupted and needs to be replaced. In this case, the fix would be to replace the entire node, not just the SSD. This is to prevent tombstone resurrection but also that the system tables are gone (system tables live in the SSD), so a full replace is needed. This is the high level design we came up with: * In typical TTL use case TTL should always be greater than archive days * Introduce a new YAML setting; call it cold_data_directories possibly. This is to signal that 'archive' doesn't mean we can just forget it there; compactions still need to happen in that directory, for joining nodes, streaming nodes, and keeping the disk usage low. * An option on TWCS to specify to use cold directory after a certain amount of days. * Need a new flag to handle the situation described - cannot drop tombstones unless it’s in the cold directory. This also has the implication that we can’t drop data using tombstones on the non-archived data. Pretty much means we can’t use manual deletions on the table and we should only use this when TTLing everything, writing once, and we should turn off read repair. * Need a separate compaction throughput and concurrent compactors setting for the cold directory Caveats with changes to flags/properties: * Removing cold flag from the yaml means we've lost the data in those directories. * Removing cold flag from table only means data will no longer be archived to cold. Existing SSTables in the cold directory should be loaded in; however if compacted moved back to hot storage. * Reducing the archive time on the table will just cause more data to be moved to the cold directory. * Increasing the archive time means existing data that should no longer be archived could go back to the live set if compacted, however will stay in cold data with no negative impact. * When promoting data to cold directory need to check that there’s not an overlapping SSTable with a max timestamp greater than minimum timestamp, same as TWCS expiry. There will still be significant I/O when it comes to compacting/repairing/streaming the SSTables in the cold directory, and it adds reasonable complexity to the code base. It's not trivial to reason about either, it took 3 hours between me and my colleagues. The only leftover question we had was when changing the table level property will Cassandra need to be restarted to take effect? Or is there a hook/property checked constantly? Anybody notice anything we missed or have any thoughts on it so far on the feature itself and the value it adds for the complexity introduced (if you have time)? Before we go ahead with it. Will be really appreciated! [~krummas] [~bdeggleston] [~jjirsa] [~stone] > Make it possible to move non-compacting sstables to slow/big storage in DTCS > ---------------------------------------------------------------------------- > > Key: CASSANDRA-8460 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8460 > Project: Cassandra > Issue Type: Improvement > Reporter: Marcus Eriksson > Priority: Major > Labels: doc-impacting, dtcs > Fix For: 4.x > > > It would be nice if we could configure DTCS to have a set of extra data > directories where we move the sstables once they are older than > max_sstable_age_days. > This would enable users to have a quick, small SSD for hot, new data, and big > spinning disks for data that is rarely read and never compacted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org