[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2020-04-21 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088266#comment-17088266
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 4/21/20, 11:21 PM:
---

*Zstd Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: ~100 million partitions with 2 rows each of 10 columns, total 
size per partition of about 4 KiB of random data. ~120 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=320MiB, fanout=20
 * Compression: Zstd with 16 KiB block size

I had to tweak some settings to make compaction less of the overall trace (it 
was 50+% or more of the traces) which are hiding the flush behavior. 
Specifically I increased the size of the memtable before flush by increasing 
the {{memtable_cleanup_threshold}} setting from 0.11 to 0.5, which allowed 
flushes to get up to 1.4 GiB, and by setting compaction to defer as long as we 
can before doing the L0 -> L1 transition:
{noformat}
compaction = {'class': 'LeveledCompactionStrategy', 'fanout_size': '20', 
'max_threshold': '128', 'min_threshold': '32', 'sstable_size_in_mb': '320'}
compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.ZstdCompressor'}
{noformat}
I would prefer to up fanout_size even more to defer compactions further, but 
with the increase in memtable size and increase in sstable size and fanout I 
was able to reduce the compaction load to where the cluster was stable (pending 
compactions not growing without bound) on both baseline and candidate 

*Zstd Defaults Benchmark Results*:

Candidate flushes were spaced about 4 minutes apart and took about 8 seconds to 
flush 1.4 GiB. Flamegraphs show 50% of on-cpu time in flush writer and ~45 in 
compression. [^15379_candidate_flush_trace.png]

Baseline flushes were spaced about 4 minutes apart and took about 22 seconds to 
flush 1.4 GiB. Flamegraphs show 20% of on-cpu time in flush writer and ~75 in 
compression.  [^15379_baseline_flush_trace.png]

No significant change in coordinator level, replica level latency or system 
metrics. Some latencies were better on candidate some worse. 
[^15379_system_zstd_defaults.png] [^15379_coordinator_zstd_defaults.png] 
[^15379_replica_zstd_defaults.png]

I think the main finding here is that already, with the cheapest zstd level, we 
are running closer to the flush interval than I'd like (if it takes longer to 
flush then the next time we flush, it's bad news bears for the cluster), and 
this is with a relatively small number of writes per second (~400 coordinator 
writes per second per node)

*Next steps:*

I've published a final squashed commit to:
||trunk||
|[657c39d4|https://github.com/jolynch/cassandra/commit/657c39d4aba0888c6db6a46d1b1febf899de9578]|
|[branch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-15379-final]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379-final.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379-final]|

There appear to be a lot of failures in java8 runs that I'm pretty sure are 
unrelated to my change (unit tests and in-jvm dtests passed, along with long 
unit tests). I'll look into all the failures and make sure they're unrelated 
(on a related note I'm :( that trunk is so red again).

I am now running a test with Zstd compression set to a block size of 256 KiB 
and level 10, which is how we typically run it in production for write mosty 
read rarely datasets such as trace data (for the significant reduction in disk 
space). 


was (Author: jolynch):
*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: ~100 million partitions with 2 rows each of 10 columns, total 
size per partition of about 4 KiB of random data. ~120 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=320MiB, fanout=20
 * Compression: Zstd with 16 KiB block size

I had to tweak some settings to make compaction less of the overall trace (it 
was 50+% or more of the traces) which are hiding the flush behavior. 
Specifically I increased the size of the memtable before flush by increasing 
the {{memtable_cleanup_threshold}} setting from 0.11 to 0.5, which allowed 
flushes to get up to 1.4 GiB, and by setting compaction to defer as long as we 
can before doing the L0 -> L1 transition:
{noformat}
compaction = {'class': 'LeveledCompactionStrategy', 'fanout_size': '20', 
'max_threshold': '128', 'min_threshold': '32', 'sstable_size_in_mb': '320'}
compression = {'chunk_length_in_kb': '16', 'class': 
'org.apache.cassandra.io.compress.ZstdCompressor'}
{noformat}
I would prefer to up fanout_size even more to defer compactions further, but 
with the increase in memtable size and 

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2020-04-19 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087187#comment-17087187
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 4/19/20, 9:44 PM:
--

Alright, finally fixed our internal trunk build so we can do performance 
validations again. I ran the following performance benchmark and the results 
are essentially identical for the default configuration (so testing _just_ the 
addition of the NoopCompressor on the megamorphic call sites).

*Experimental Setup:*

A baseline and candidate cluster of EC2 machines running the following:
 * C* cluster: 3x3 (us-east-1 and eu-west-1) i3.2xlarge
 * Load cluster: 3 m5.2xlarge nodes running ndbench in us-east-1, generating a 
consistent load against the cluster
 * Baseline C* version: Latest trunk (b05fe7ab)
 * Candidate C* version: The proposed patch applied to the same version of trunk
 * Relevant system configuration: Ubuntu xenial running Linux 4.15, with kyber 
io scheduler (vs noop), 32 KiB readahead (vs 128), and tc-fq network qdisc (vs 
pfifo_fast)
 * Relevant JVM configuration: 12 GiB heap size

In all cases load is applied and then we wait for metrics to settle, especially 
things like pending compactions, read/write latencies, p99 latencies, etc ...

*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: 10 million partitions with 2 rows each of 10 columns, total 
size per partition of about 10 KiB of random data. ~100 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=256MiB, fanout=20
 * Compression: LZ4 with 16 KiB block size 

*Defaults Benchmark Results:*

We do not have data to support the hypothesis that the megamorphic call sites 
have become more expensive to the addition of the NoopCompressor.

1. No significant change at the coordinator level (least relevant metric): 
[^15379_coordinator_defaults.png]
 2. No significant change at the replica level (most relevant metric): 
[^15379_replica_defaults.png]
 3. No significant change at the system resource level (second most relevant 
metrics): [^15379_system_defaults.png]

Our external flamegraphs exports appear to be broken, but I looked at them and 
they also show no noticeable difference (I'll work with our performance team to 
fix exports so I can share the data here).

*Next steps for me:*
 * Squash, rebase, and re-run unit and dtests with latest trunk in preparation 
for commit
 * Run a benchmark of `ZstdCompressor` with and without the patch, we expect to 
see reduced CPU usage due to flushes. I will likely have to reduce the 
read/write throughput due to compactions taking a crazy amount of our on CPU 
time with this configuration.


was (Author: jolynch):
Alright, finally fixed our internal trunk build so we can do performance 
validations again. I ran the following performance benchmark and the results 
are essentially identical for the default configuration (so testing _just_ the 
addition of the NoopCompressor on the megamorphic call sites).

*Experimental Setup:*

A baseline and candidate cluster of EC2 machines running the following:
 * C* cluster: 3x3 (us-east-1 and eu-west-1) i3.2xlarge
 * Load cluster: 3 m5.2xlarge nodes running ndbench in us-east-1, generating a 
consistent load against the cluster
 * Baseline C* version: Latest trunk (b05fe7ab)
 * Candidate C* version: The proposed patch applied to the same version of trunk
 * Relevant system configuration: Ubuntu xenial running Linux 4.15, with kyber 
io scheduler (vs noop), 32 KiB readahead (vs 128), and tc-fq network qdisc (vs 
pfifo_fast)

In all cases load is applied and then we wait for metrics to settle, especially 
things like pending compactions, read/write latencies, p99 latencies, etc ...

*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: 10 million partitions with 2 rows each of 10 columns, total 
size per partition of about 10 KiB of random data. ~100 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=256MiB, fanout=20
 * Compression: LZ4 with 16 KiB block size 

*Defaults Benchmark Results:*

We do not have data to support the hypothesis that the megamorphic call sites 
have become more expensive to the addition of the NoopCompressor.

1. No significant change at the coordinator level (least relevant metric): 
[^15379_coordinator_defaults.png]
 2. No significant change at the replica level (most relevant metric): 
[^15379_replica_defaults.png]
 3. No significant change at the system resource level (second most relevant 
metrics): [^15379_system_defaults.png]

Our external flamegraphs exports appear to be broken, but I looked at them and 
they also show no noticeable difference (I'll work with our performance team to 
fix exports so I can share the data 

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2020-04-19 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087187#comment-17087187
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 4/19/20, 9:09 PM:
--

Alright, finally fixed our internal trunk build so we can do performance 
validations again. I ran the following performance benchmark and the results 
are essentially identical for the default configuration (so testing _just_ the 
addition of the NoopCompressor on the megamorphic call sites).

*Experimental Setup:*

A baseline and candidate cluster of EC2 machines running the following:
 * C* cluster: 3x3 (us-east-1 and eu-west-1) i3.2xlarge
 * Load cluster: 3 m5.2xlarge nodes running ndbench in us-east-1, generating a 
consistent load against the cluster
 * Baseline C* version: Latest trunk (b05fe7ab)
 * Candidate C* version: The proposed patch applied to the same version of trunk
 * Relevant system configuration: Ubuntu xenial running Linux 4.15, with kyber 
io scheduler (vs noop), 32 KiB readahead (vs 128), and tc-fq network qdisc (vs 
pfifo_fast)

In all cases load is applied and then we wait for metrics to settle, especially 
things like pending compactions, read/write latencies, p99 latencies, etc ...

*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: 10 million partitions with 2 rows each of 10 columns, total 
size per partition of about 10 KiB of random data. ~100 GiB per node data size 
(replicated 6 ways)
 * Compaction settings: LCS with size=256MiB, fanout=20
 * Compression: LZ4 with 16 KiB block size 

*Defaults Benchmark Results:*

We do not have data to support the hypothesis that the megamorphic call sites 
have become more expensive to the addition of the NoopCompressor.

1. No significant change at the coordinator level (least relevant metric): 
[^15379_coordinator_defaults.png]
 2. No significant change at the replica level (most relevant metric): 
[^15379_replica_defaults.png]
 3. No significant change at the system resource level (second most relevant 
metrics): [^15379_system_defaults.png]

Our external flamegraphs exports appear to be broken, but I looked at them and 
they also show no noticeable difference (I'll work with our performance team to 
fix exports so I can share the data here).

*Next steps for me:*
 * Squash, rebase, and re-run unit and dtests with latest trunk in preparation 
for commit
 * Run a benchmark of `ZstdCompressor` with and without the patch, we expect to 
see reduced CPU usage due to flushes. I will likely have to reduce the 
read/write throughput due to compactions taking a crazy amount of our on CPU 
time with this configuration.


was (Author: jolynch):
Alright, finally fixed our internal trunk build so we can do performance 
validations again. I ran the following performance benchmark and the results 
are essentially identical for the default configuration (so testing _just_ the 
addition of the NoopCompressor on the megamorphic call sites).

*Experimental Setup:*

A baseline and candidate cluster of EC2 machines running the following:
 * C* cluster: 3x3 (us-east-1 and eu-west-1) i3.2xlarge
 * Load cluster: 3 m5.2xlarge nodes running ndbench in us-east-1, generating a 
consistent load against the cluster
 * Baseline C* version: Latest trunk (b05fe7ab)
 * Candidate C* version: The proposed patch applied to the same version of trunk
 * Relevant system configuration: Ubuntu xenial running Linux 4.15, with kyber 
io scheduler (vs noop), 32 KiB readahead (vs 128), and tc-fq network qdisc (vs 
pfifo_fast)

In all cases load is applied and then we wait for metrics to settle, especially 
things like pending compactions, read/write latencies, p99 latencies, etc ...

*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: 2 rows of 10 columns, total size per partition of about 10 KiB 
of random data. ~100 GiB per node data size (replicated 6 ways)
 * Compaction settings: LCS with size=256MiB, fanout=20
 * Compression: LZ4 with 16 KiB block size 

*Defaults Benchmark Results:*

We do not have data to support the hypothesis that the megamorphic call sites 
have become more expensive to the addition of the NoopCompressor.

1. No significant change at the coordinator level (least relevant metric): 
[^15379_coordinator_defaults.png]
 2. No significant change at the replica level (most relevant metric): 
[^15379_replica_defaults.png]
 3. No significant change at the system resource level (second most relevant 
metrics): [^15379_system_defaults.png]

Our external flamegraphs exports appear to be broken, but I looked at them and 
they also show no noticeable difference (I'll work with our performance team to 
fix exports so I can share the data here).

*Next steps for me:*
 * Squash, rebase, and re-run unit and dtests with 

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2020-04-19 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087187#comment-17087187
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 4/19/20, 9:00 PM:
--

Alright, finally fixed our internal trunk build so we can do performance 
validations again. I ran the following performance benchmark and the results 
are essentially identical for the default configuration (so testing _just_ the 
addition of the NoopCompressor on the megamorphic call sites).

*Experimental Setup:*

A baseline and candidate cluster of EC2 machines running the following:
 * C* cluster: 3x3 (us-east-1 and eu-west-1) i3.2xlarge
 * Load cluster: 3 m5.2xlarge nodes running ndbench in us-east-1, generating a 
consistent load against the cluster
 * Baseline C* version: Latest trunk (b05fe7ab)
 * Candidate C* version: The proposed patch applied to the same version of trunk
 * Relevant system configuration: Ubuntu xenial running Linux 4.15, with kyber 
io scheduler (vs noop), 32 KiB readahead (vs 128), and tc-fq network qdisc (vs 
pfifo_fast)

In all cases load is applied and then we wait for metrics to settle, especially 
things like pending compactions, read/write latencies, p99 latencies, etc ...

*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: 2 rows of 10 columns, total size per partition of about 10 KiB 
of random data. ~100 GiB per node data size (replicated 6 ways)
 * Compaction settings: LCS with size=256MiB, fanout=20
 * Compression: LZ4 with 16 KiB block size 

*Defaults Benchmark Results:*

We do not have data to support the hypothesis that the megamorphic call sites 
have become more expensive to the addition of the NoopCompressor.

1. No significant change at the coordinator level (least relevant metric): 
[^15379_coordinator_defaults.png]
 2. No significant change at the replica level (most relevant metric): 
[^15379_replica_defaults.png]
 3. No significant change at the system resource level (second most relevant 
metrics): [^15379_system_defaults.png]

Our external flamegraphs exports appear to be broken, but I looked at them and 
they also show no noticeable difference (I'll work with our performance team to 
fix exports so I can share the data here).

*Next steps for me:*
 * Squash, rebase, and re-run unit and dtests with latest trunk in preparation 
for commit
 * Run a benchmark of `ZstdCompressor` with and without the patch, we expect to 
see reduced CPU usage due to flushes. I will likely have to reduce the 
read/write throughput due to compactions taking a crazy amount of our on CPU 
time with this configuration.


was (Author: jolynch):
Alright, finally fixed our internal trunk build so we can do performance 
validations again. I ran the following performance benchmark and the results 
are essentially identical for the default configuration (so testing _just_ the 
addition of the NoopCompressor on the megamorphic call sites).

*Experimental Setup:*

A baseline and candidate cluster of EC2 machines running the following:
 * C* cluster: 3x3 (us-east-1 and eu-west-1) i3.2xlarge
 * Load cluster: 3 m5.2xlarge nodes running ndbench in us-east-1, generating a 
consistent load against the cluster
 * Baseline C* version: Latest trunk (b05fe7ab)
 * Candidate C* version: The proposed patch applied to the same version of trunk
 * Relevant system configuration: Ubuntu xenial running Linux 4.15, with kyber 
io scheduler (vs noop), 32 KiB readahead (vs 128), and tc-fq network qdisc (vs 
pfifo_fast)

In all cases load is applied and then we wait for metrics to settle, especially 
things like pending compactions, read/write latencies, p99 latencies, etc ...

*Defaults Benchmark:*
 * Load pattern: 1.2K wps and 1.2k rps at LOCAL_ONE consistency with a  random 
load pattern.
 * Data sizing: 2 rows of 10 columns, total size per partition of about 10 KiB 
of random data. ~100 GiB per node data size (replicated 6 ways)
 * Compaction settings: LCS with size=256MiB, fanout=20
 * Compression: LZ4 with 16 KiB block siz 

*Defaults Benchmark Results:*

We do not have data to support the hypothesis that the megamorphic call sites 
have become more expensive to the addition of the NoopCompressor.

1. No significant change at the coordinator level (least relevant metric): 
[^15379_coordinator_defaults.png]
2. No significant change at the replica level (most relevant metric): 
[^15379_replica_defaults.png]
3. No significant change at the system resource level (second most relevant 
metrics): [^15379_system_defaults.png]

Our external flamegraphs exports appear to be broken, but I looked at them and 
they also show no noticeable difference (I'll work with our performance team to 
fix exports so I can share the data here).

*Next steps for me:*
 * Squash, rebase, and re-run unit and dtests with latest trunk in preparation 
for 

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2020-03-14 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059531#comment-17059531
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 3/15/20, 2:03 AM:
--

Cool, took your changes and [rebased on trunk with a few 
fixups|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-15379].
 Tests are running now.

I am having some trouble with our performance integration suite for trunk right 
now, but should hopefully be able to run those performance tests on Monday.

Just to  confirm you would like performance numbers for a write heavy test for 
baseline (trunk without my patch):

* No compressor
* LZ4 Compressor
* Zstd Compressor

And the following candidates:

* No compressor
* Noop compressor
* LZ4 compressor
* Zstd compressor


was (Author: jolynch):
Cool, took your changes and [rebased on trunk with a few 
fixups|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-15379].
 Tests are running now.

I am having some trouble with our performance integration suite for trunk right 
now, but should hopefully be able to run those performance tests on Monday.

> Make it possible to flush with a different compression strategy than we 
> compact with
> 
>
> Key: CASSANDRA-15379
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15379
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction, Local/Config, Local/Memtable
>Reporter: Joey Lynch
>Assignee: Joey Lynch
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> [~josnyder] and I have been testing out CASSANDRA-14482 (Zstd compression) on 
> some of our most dense clusters and have been observing close to 50% 
> reduction in footprint with Zstd on some of our workloads! Unfortunately 
> though we have been running into an issue where the flush might take so long 
> (Zstd is slower to compress than LZ4) that we can actually block the next 
> flush and cause instability.
> Internally we are working around this with a very simple patch which flushes 
> SSTables as the default compression strategy (LZ4) regardless of the table 
> params. This is a simple solution but I think the ideal solution though might 
> be for the flush compression strategy to be configurable separately from the 
> table compression strategy (while defaulting to the same thing). Instead of 
> adding yet another compression option to the yaml (like hints and commitlog) 
> I was thinking of just adding it to the table parameters and then adding a 
> {{default_table_parameters}} yaml option like:
> {noformat}
> # Default table properties to apply on freshly created tables. The currently 
> supported defaults are:
> # * compression   : How are SSTables compressed in general (flush, 
> compaction, etc ...)
> # * flush_compression : How are SSTables compressed as they flush
> # supported
> default_table_parameters:
>   compression:
> class_name: 'LZ4Compressor'
> parameters:
>   chunk_length_in_kb: 16
>   flush_compression:
> class_name: 'LZ4Compressor'
> parameters:
>   chunk_length_in_kb: 4
> {noformat}
> This would have the nice effect as well of giving our configuration a path 
> forward to providing user specified defaults for table creation (so e.g. if a 
> particular user wanted to use a different default chunk_length_in_kb they can 
> do that).
> So the proposed (~mandatory) scope is:
> * Flush with a faster compression strategy
> I'd like to implement the following at the same time:
> * Per table flush compression configuration
> * Ability to default the table flush and compaction compression in the yaml.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2019-11-09 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971010#comment-16971010
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 11/10/19 4:03 AM:
--

[~djoshi] per your feedback in slack I've added the ability for the user to 
control the flush via a yaml option while doing the right thing by default.

||trunk||
|[branch|https://github.com/apache/cassandra/compare/trunk...jolynch:CASSANDRA-15379]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-15379]|

In order to implement the "don't compress during the flush" [option you 
suggested|https://the-asf.slack.com/archives/CK23JSY2K/p1572905922120300?thread_ts=1572905763.117000=CK23JSY2K]
 I figured that the easiest was was to just implement the simple 
[NoopCompressor|https://github.com/apache/cassandra/commit/9030d8abcf593c06e85f549947ad41621d4776d1]
 everyone has been mentioning for years. I was having a hard time turning off 
compression at the level of abstraction BigTableWriter operates at since it 
doesn't control that e.g. the compression offset file get's written. This way 
even if you select "none" your flush is still protected by block level 
checksums. Separately it gives us a good path forward for mitigating 
CASSANDRA-12682 and CASSANDRA-9264 if we want it to I think.


was (Author: jolynch):
[~djoshi] per your feedback in slack I've added the ability for the user to 
control the flush via a yaml option while doing the right thing by default.

In order to implement the "don't compress during the flush" [option you 
suggested|https://the-asf.slack.com/archives/CK23JSY2K/p1572905922120300?thread_ts=1572905763.117000=CK23JSY2K]
 I figured that the easiest was was to just implement the simple 
[NoopCompressor|https://github.com/apache/cassandra/commit/9030d8abcf593c06e85f549947ad41621d4776d1]
 everyone has been mentioning for years. I was having a hard time turning off 
compression at the level of abstraction BigTableWriter operates at since it 
doesn't control that e.g. the compression offset file get's written. This way 
even if you select "none" your flush is still protected by block level 
checksums. Separately it gives us a good path forward for mitigating 
CASSANDRA-12682 and CASSANDRA-9264 if we want it to I think.

> Make it possible to flush with a different compression strategy than we 
> compact with
> 
>
> Key: CASSANDRA-15379
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15379
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction, Local/Config, Local/Memtable
>Reporter: Joey Lynch
>Assignee: Joey Lynch
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> [~josnyder] and I have been testing out CASSANDRA-14482 (Zstd compression) on 
> some of our most dense clusters and have been observing close to 50% 
> reduction in footprint with Zstd on some of our workloads! Unfortunately 
> though we have been running into an issue where the flush might take so long 
> (Zstd is slower to compress than LZ4) that we can actually block the next 
> flush and cause instability.
> Internally we are working around this with a very simple patch which flushes 
> SSTables as the default compression strategy (LZ4) regardless of the table 
> params. This is a simple solution but I think the ideal solution though might 
> be for the flush compression strategy to be configurable separately from the 
> table compression strategy (while defaulting to the same thing). Instead of 
> adding yet another compression option to the yaml (like hints and commitlog) 
> I was thinking of just adding it to the table parameters and then adding a 
> {{default_table_parameters}} yaml option like:
> {noformat}
> # Default table properties to apply on freshly created tables. The currently 
> supported defaults are:
> # * compression   : How are SSTables compressed in general (flush, 
> compaction, etc ...)
> # * flush_compression : How are SSTables compressed as they flush
> # supported
> default_table_parameters:
>   compression:
> class_name: 'LZ4Compressor'
> parameters:
>   chunk_length_in_kb: 16
>   flush_compression:
> class_name: 'LZ4Compressor'
> parameters:
>   chunk_length_in_kb: 4
> {noformat}
> This would have the nice effect as well of giving our configuration a path 
> forward to providing user specified defaults for table creation (so e.g. if a 
> particular user wanted to use a different default chunk_length_in_kb they can 
> do that).
> So the proposed (~mandatory) scope is:
> * Flush with a faster compression strategy
> I'd like to implement 

[jira] [Comment Edited] (CASSANDRA-15379) Make it possible to flush with a different compression strategy than we compact with

2019-11-04 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966938#comment-16966938
 ] 

Joey Lynch edited comment on CASSANDRA-15379 at 11/4/19 7:30 PM:
-

My rationale for the {{EnumSet}} over a boolean member function is:
 # Versus the boolean function idea it doesn't break the ICompressor 
abstraction and let compressors know that flushes exist. As in, it is very easy 
for an ICompressor author to claim to be good at {{FAST_COMPRESSION}} but 
probably can't make the call if that should be used in flushes or other 
situations. I could have a {{isFastCompressor}} boolean function but given that 
{{ICompressor}} is a public API interface I think sets of capabilities will be 
more maintainable than a collection of boolean functions going forwards, 
especially if we start adding more capabilities (see #2).
 # If we go down the path of _not_ making more knobs and just try to have the 
database figure out the best way to compress data for users this is easier to 
maintain long term since compressors can offer multiple types of hints to the 
database. For example the database might refuse to use slow compressors in 
flushes, commitlogs, etc or having compaction strategies opt into higher ratio 
compression strategies in higher "levels". If we do go down this path there are 
fewer interface changes (instead of adding and removing functions we just add 
ICompressor.Uses hints).
 # Versus the set of strings idea, it has compile time checks that are useful 
(which is the primary argument against sets of strings afaik).

After thinking about this problem space more I'm no longer convinced that 
giving general users more knobs here is the right choice (the table 
properties). By using a {{suitableUses}} hint the database can in the future 
4.x releases internally optimize:
 * Flushes: "get this data off my heap as fast as possible". We don't care 
about ratio (since the products will be re-compacted shortly) or decompression 
speed, only care about compression speed.
 * Commitlog: "some compression is nice but get this data off my heap fast". We 
mostly care about compression speed, but very minorly about ratio.
 * Compaction: "The older the data the more compressed it should be". We care a 
lot about decompression speed and ratio, but don't want to pick expensive 
compressors at the high churn points (L0 in LCS, small tables in STCS, before 
the time window bucket in TWCS)

The interface still gives advanced users a backdoor (they extend the compressor 
they want to change the behavior of and change what capabilities it offers).

edit: I pinged this ticket into 
[slack|https://the-asf.slack.com/archives/CK23JSY2K/p1572881897039500] to seek 
more feedback.


was (Author: jolynch):
My rationale for the {{EnumSet}} over a boolean member function is:
 # Versus the boolean function idea it doesn't break the ICompressor 
abstraction and let compressors know that flushes exist. As in, it is very easy 
for an ICompressor author to claim to be good at {{FAST_COMPRESSION}} but 
probably can't make the call if that should be used in flushes or other 
situations. I could have a {{isFastCompressor}} boolean function but given that 
{{ICompressor}} is a public API interface I think sets of capabilities will be 
more maintainable than a collection of boolean functions going forwards, 
especially if we start adding more capabilities (see #2).
 # If we go down the path of _not_ making more knobs and just try to have the 
database figure out the best way to compress data for users this is easier to 
maintain long term since compressors can offer multiple types of hints to the 
database. For example the database might refuse to use slow compressors in 
flushes, commitlogs, etc or having compaction strategies opt into higher ratio 
compression strategies in higher "levels". If we do go down this path there are 
fewer interface changes (instead of adding and removing functions we just add 
ICompressor.Uses hints).
 # Versus the set of strings idea, it has compile time checks that are useful 
(which is the primary argument against sets of strings afaik).

After thinking about this problem space more I'm no longer convinced that 
giving general users more knobs here is the right choice (the table 
properties). By using a {{suitableUses}} hint the database can internally 
optimize:
 * Flushes: "get this data off my heap as fast as possible". We don't care 
about ratio (since the products will be re-compacted shortly) or decompression 
speed, only care about compression speed.
 * Commitlog: "some compression is nice but get this data off my heap fast". We 
mostly care about compression speed, but very minorly about ratio.
 * Compaction: "The older the data the more compressed it should be". We care a 
lot about decompression speed and ratio, but don't want to pick expensive 
compressors at the high