[ 
https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346255#comment-16346255
 ] 

Lerh Chuan Low commented on CASSANDRA-8460:
-------------------------------------------

I've tentatively started work on this, and it's turning out to be a relatively 
bigger code change than I was originally expecting, so would really love to get 
some feedback from the community who knows more (and review my initial 
patches). 

{{CompactionAwareWriter}}, {{DiskBoundaryManager}}, {{Directories}} and 
{{CompactionStrategyManager}} needs to know about archives. I've gone ahead and 
created a new Enumeration for `DirectoryType` that can be either ARCHIVE or 
STANDARD. 

{{CompactionAwareWriter}} always calls {{maybeSwitchWriter(Decorated Key)}} 
before calling {{realAppend}}. This is to handle the JBOD case, 
{{maybeSwitchWriter}} helps the writer write to the right location depending on 
the key to make sure keys do not overlap across directories. So it needs to 
have knowledge on which {{diskBoundaries}} it is actually using so as not to 
get into the situation where it can't differentiate between an actual archive 
disk and an actual JBOD disk. 

It would be wise to re-use the logic in {{diskBoundaries}} to also handle the 
case when the archive directory has been configured as JBOD, so 
{{DiskBoundaryManager}} now also needs to know about archive directories. When 
it tries to {{getWriteableLocations}} or generate disk boundaries, it should be 
able to differentiate between archive and non-archive. 

The same goes for {{CompactionStrategyManager}}. We still need to be able to 
run separate compaction strategy instances in the archive directory to handle 
the case of repairs and streaming (so archived SSTables don't just accumulate 
indefinitely). Here's where I am not sure which way to proceed forward. 

Option 1: 
Have it so that {{ColumnFamilyStore}} still only maintains one CSM and DBM and 
one {{Directories}}. CSM, DBM and {{Directories}} all start knowing about the 
existence of an archive directory; this can either be an extra field, or an 
EnumMap:

{code}
new EnumMap<Directories.DirectoryType, 
DiskBoundaries>(Directories.DirectoryType.class){{
            put(Directories.DirectoryType.STANDARD, 
cfs.getDiskBoundaries(Directories.DirectoryType.STANDARD));
            put(Directories.DirectoryType.ARCHIVE, 
cfs.getDiskBoundaries(Directories.DirectoryType.ARCHIVE));
        }}
{code}

The worry here for me is that some things may subtly break even as I fix up 
everything else that gets logged as errors...The CSM's own internal fields of 
{{repaired}}, {{unrepaired}} and {{pendingRepaired}} will also need to become 
maps, otherwise the individual instances will again become confused, being 
unable to differentiate between an actual JBOD disk or an archive disk. Some of 
the APIs, e.g reload, shutdown, enable etc will all need some smarts on which 
directory type is needed (in some cases it won't matter). Every consumer of 
these APIs will also need to be updated. 

Here's how it looks like in an initial go: 
https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460?expand=1

Option 2:
Have it so that {{ColumnFamilyStore}} keeps 2 CSMs and 2 DBMs, of which the 
archiving equivalents are {{null}} if not applicable/reloaded. In this case 
there's a reasonable level of confidence that each CSM and BDM will just 'do 
the right thing', regardless whether it's an archive or not. In this case then 
every call to getting DBM or CSM (and there are a lot for getting CSM) will 
need to be evaluated and checked. 

Here's how it looks like in an initial go: 
https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460-single-csm?expand=1

Both still have work on them (Scrubber, relocate SSTables, what happens when 
the archiving is turned off etc), but before I continue down the track, just 
wondering if anyone can point out which way is better/this is all misguided and 
, in the event this are the changes that need to happen (I can't seem to find a 
way for just TWCS to be aware that there's an archive directory, CFS needs to 
know as well), is this still worth the complexity introduced? 

[~pavel.trukhanov] Re "Why can't we simply allow a CS instance to spread across 
two disks - SSD
and corresponding archival HDD" -> I think in this case you're back in the 
situation where you can have data resurrected. You can have other replicas 
compact away tombstones (because the CS can see both directories) and then have 
your last remaining replica, before it manages to, get its SSD with the 
tombstone corrupted. Upon replacing the SSD with a new one and issuing repair, 
the tombstone is resurrected. Of course, this can be mitigated by making it 
clear to operators that every time there's a corrupt disk, every single disk 
needs to be replaced. 

Even if we did so, there will still be large code changes to make CSM and DBM 
be able to differentiate between whether the other directory it is managing 
really is a JBOD disk or an archive disk. 

> Make it possible to move non-compacting sstables to slow/big storage in DTCS
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8460
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8460
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Eriksson
>            Priority: Major
>              Labels: doc-impacting, dtcs
>             Fix For: 4.x
>
>
> It would be nice if we could configure DTCS to have a set of extra data 
> directories where we move the sstables once they are older than 
> max_sstable_age_days. 
> This would enable users to have a quick, small SSD for hot, new data, and big 
> spinning disks for data that is rarely read and never compacted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to