[jira] [Commented] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

Joey Lynch (Jira) Mon, 04 Nov 2019 11:25:31 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966938#comment-16966938
 ]


Joey Lynch commented on CASSANDRA-15379:
----------------------------------------

My rationale for the {{EnumSet}} over a boolean member function is:
 # Versus the boolean function idea it doesn't break the ICompressor 
abstraction and let compressors know that flushes exist. As in, it is very easy 
for an ICompressor author to claim to be good at {{FAST_COMPRESSION}} but 
probably can't make the call if that should be used in flushes or other 
situations. I could have a {{isFastCompressor}} boolean function but given that 
{{ICompressor}} is a public API interface I think sets of capabilities will be 
more maintainable than a collection of boolean functions going forwards, 
especially if we start adding more capabilities (see #2).
 # If we go down the path of _not_ making more knobs and just try to have the 
database figure out the best way to compress data for users this is easier to 
maintain long term since compressors can offer multiple types of hints to the 
database. For example the database might refuse to use slow compressors in 
flushes, commitlogs, etc or having compaction strategies opt into higher ratio 
compression strategies in higher "levels". If we do go down this path there are 
fewer interface changes (instead of adding and removing functions we just add 
ICompressor.Uses hints).
 # Versus the set of strings idea, it has compile time checks that are useful 
(which is the primary argument against sets of strings afaik).

After thinking about this problem space more I'm no longer convinced that 
giving general users more knobs here is the right choice (the table 
properties). By using a {{suitableUses}} hint the database can internally 
optimize:
 * Flushes: "get this data off my heap as fast as possible". We don't care 
about ratio (since the products will be re-compacted shortly) or decompression 
speed, only care about compression speed.
 * Commitlog: "some compression is nice but get this data off my heap fast". We 
mostly care about compression speed, but very minorly about ratio.
 * Compaction: "The older the data the more compressed it should be". We care a 
lot about decompression speed and ratio, but don't want to pick expensive 
compressors at the high churn points (L0 in LCS, small tables in STCS, before 
the time window bucket in TWCS)

The interface still gives advanced users a backdoor (they extend the compressor 
they want to change the behavior of and change what capabilities it offers).

> Make it possible to flush with a different compression strategy than we 
> compact with
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15379
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15379
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/Config, Local/Memtable
>            Reporter: Joey Lynch
>            Assignee: Joey Lynch
>            Priority: Normal
>
> [~josnyder] and I have been testing out CASSANDRA-14482 (Zstd compression) on 
> some of our most dense clusters and have been observing close to 50% 
> reduction in footprint with Zstd on some of our workloads! Unfortunately 
> though we have been running into an issue where the flush might take so long 
> (Zstd is slower to compress than LZ4) that we can actually block the next 
> flush and cause instability.
> Internally we are working around this with a very simple patch which flushes 
> SSTables as the default compression strategy (LZ4) regardless of the table 
> params. This is a simple solution but I think the ideal solution though might 
> be for the flush compression strategy to be configurable separately from the 
> table compression strategy (while defaulting to the same thing). Instead of 
> adding yet another compression option to the yaml (like hints and commitlog) 
> I was thinking of just adding it to the table parameters and then adding a 
> {{default_table_parameters}} yaml option like:
> {noformat}
> # Default table properties to apply on freshly created tables. The currently 
> supported defaults are:
> # * compression       : How are SSTables compressed in general (flush, 
> compaction, etc ...)
> # * flush_compression : How are SSTables compressed as they flush
> # supported
> default_table_parameters:
>   compression:
>     class_name: 'LZ4Compressor'
>     parameters:
>       chunk_length_in_kb: 16
>   flush_compression:
>     class_name: 'LZ4Compressor'
>     parameters:
>       chunk_length_in_kb: 4
> {noformat}
> This would have the nice effect as well of giving our configuration a path 
> forward to providing user specified defaults for table creation (so e.g. if a 
> particular user wanted to use a different default chunk_length_in_kb they can 
> do that).
> So the proposed (~mandatory) scope is:
> * Flush with a faster compression strategy
> I'd like to implement the following at the same time:
> * Per table flush compression configuration
> * Ability to default the table flush and compaction compression in the yaml.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

Reply via email to