[ https://issues.apache.org/jira/browse/CASSANDRA-15452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844835#comment-17844835 ]
Jon Haddad edited comment on CASSANDRA-15452 at 5/9/24 1:42 AM: ---------------------------------------------------------------- I should share some additional information and point out some things that might not be immediately obvious in the graphs about the EBS test. * The GP3 EBS volume was configured at 3K IOPs with 256MB throughput. * All tests were run with an empty page cache by doing {{ echo 3 | sudo tee /proc/sys/vm/drop_caches}}then a nodetool compact. * Compaction was not throttled. * Read ahead was set to 4KB. * The EBS volume was, as expected, completely limited by IOPS, as it peaks right at 3K, with read throughput on the drive around 18-20MB/s. * The spikes in writes, presumably from fsync, cause the dip in read perf. * The patched version had a significant reduction in IOPS usage, using only ~500 IOPS to achieve around 65MB/s, *more than a 3x* improvement in throughput and a {*}6x improvement in IOPS usage{*}. In a production environment this would lead to more predictable read latencies, as compaction I/O competes with reads for disk resources. * Multiple compaction threads should be able to achieve an even more impressive improvement since we haven't yet reached the resource limit of the underlying device. * This patch should significantly improve node density when using EBS as compaction is a significant bottleneck. Some other general information: * The patch does not yet take advantage of the fadvise call, which should help a production environment significantly by preventing compaction from polluting the page cache with pages that are about to be removed from the filesystem. Overall I am very excited by the results we're seeing here and will follow up with additional testing. was (Author: rustyrazorblade): I should share some additional information and point out some things that might not be immediately obvious in the graphs about the EBS test. * The GP3 EBS volume was configured at 3K IOPs with 256MB throughput. * All tests were run with an empty page cache by doing the following, then a nodetool compact. * Compaction was not throttled. * Read ahead was set to 4KB. * The EBS volume was, as expected, completely limited by IOPS, as it peaks right at 3K, with read throughput on the drive around 18-20MB/s. * The spikes in writes, presumably from fsync, cause the dip in read perf. * The patched version had a significant reduction in IOPS usage, using only ~500 IOPS to achieve around 65MB/s, *more than a 3x* improvement in throughput and a {*}6x improvement in IOPS usage{*}. In a production environment this would lead to more predictable read latencies, as compaction I/O competes with reads for disk resources. * Multiple compaction threads should be able to achieve an even more impressive improvement since we haven't yet reached the resource limit of the underlying device. * This patch should significantly improve node density when using EBS as compaction is a significant bottleneck. Some other general information: * The patch does not yet take advantage of the fadvise call, which should help a production environment significantly by preventing compaction from polluting the page cache with pages that are about to be removed from the filesystem. Overall I am very excited by the results we're seeing here and will follow up with additional testing. > Improve disk access patterns during compaction and streaming > ------------------------------------------------------------ > > Key: CASSANDRA-15452 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15452 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Local Write-Read Paths, Local/Compaction > Reporter: Jon Haddad > Assignee: Jordan West > Priority: Normal > Attachments: everyfs.txt, iostat-5.0-head.output, > iostat-5.0-patched.output, iostat-ebs-15452.png, iostat-ebs-head.png, > iostat-instance-15452.png, iostat-instance-head.png, results.txt, > sequential.fio > > > On read heavy workloads Cassandra performs much better when using a low read > ahead setting. In my tests I've seen an 5x improvement in throughput and > more than a 50% reduction in latency. However, I've also observed that it > can have a negative impact on compaction and streaming throughput. It > especially negatively impacts cloud environments where small reads incur high > costs in IOPS due to tiny requests. > # We should investigate using POSIX_FADV_DONTNEED on files we're compacting > to see if we can improve performance and reduce page faults. > # This should be combined with an internal read ahead style buffer that > Cassandra manages, similar to a BufferedInputStream but with our own > machinery. This buffer should read fairly large blocks of data off disk at > at time. EBS, for example, allows 1 IOP to be up to 256KB. A considerable > amount of time is spent in blocking I/O during compaction and streaming. > Reducing the frequency we read from disk should speed up all sequential I/O > operations. > # We can reduce system calls by buffering writes as well, but I think it > will have less of an impact than the reads -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org