[ https://issues.apache.org/jira/browse/CASSANDRA-15452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823374#comment-17823374 ]
Jon Haddad edited comment on CASSANDRA-15452 at 3/4/24 11:04 PM: ----------------------------------------------------------------- I took a look at the I/O operations hitting the filesystem in the HEAD of 5.0 branch, and I've found our reads are going to be fairly wasteful. I loaded a single node up with some data and then started monitoring every operation, then ran `nodetool compact`. I monitored the i/o using this command: {noformat} sudo /usr/share/bcc/tools/xfsslower 0 -p 26988 | awk '$4 == "R" { print $0 }'{noformat} which allowed me to monitor all reads at the filesystem level. Here's a sample of the output. I've added the headers back in for convenience, and I've taken the liberty of showing some of the later I/O operations that occur at the end. {noformat} Tracing XFS operations TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 22:27:38 CompactionExec 26988 R 4096 0 0.01 nb-7-big-Statistics.db 22:27:38 CompactionExec 26988 R 4096 4 0.00 nb-7-big-Statistics.db 22:27:38 CompactionExec 26988 R 2062 8 0.00 nb-7-big-Statistics.db 22:27:38 CompactionExec 26988 R 14907 0 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14924 14 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14896 29 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14844 43 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14923 58 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14931 72 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14905 87 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14891 101 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14919 116 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14965 130 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14918 145 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14930 160 0.01 nb-7-big-Data.db ..... 22:27:39 CompactionExec 26988 R 4096 0 0.01 nb-7-big-Statistics.db 22:27:39 CompactionExec 26988 R 98 0 0.01 nb-9-big-Data.db 22:27:39 CompactionExec 26988 R 4096 0 0.01 nb-13-big-Statistics.db 22:27:39 CompactionExec 26988 R 4096 0 0.01 nb-4-big-Statistics.db 22:27:39 CompactionExec 26988 R 911 4 0.00 nb-4-big-Statistics.db 22:27:39 CompactionExec 26988 R 115 0 0.01 nb-4-big-Data.db 22:27:39 CompactionExec 26988 R 51 0 0.01 nb-9-big-Data.db {noformat} This table was using the default compression with a 16KB chunk length. When I altered the table to use a 4KB chunk length, I see this, running the same commands: {noformat} 22:37:56 CompactionExec 26988 R 4096 0 0.01 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 816 4 0.00 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 102 0 0.01 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 4096 0 0.01 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 4096 4 0.00 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 2062 8 0.00 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 3763 0 0.01 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3758 3 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3748 7 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3783 11 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3770 14 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3750 18 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3794 22 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3793 25 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3748 29 0.00 nb-13-big-Data.db{noformat} Reading in only 1 chunk at a time instead of a buffer means our reliance on page cache for sequential reads is at odds with our need to minimize read amplification during read heavy workloads. This includes LWT and Counter workloads as they're read before write. With accord coming up we're going to want to ensure our users don't need to overpay for storage just for the sake of running compaction. Here's a sample of what the I/O looks like at the block device level when running compact: {noformat} ubuntu@ip-172-31-38-58:~$ sudo /usr/share/bcc/tools/biosnoop -d xvdb | awk '$5 == "R" { print $0 }'{noformat} {noformat} TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 0.000000 CompactionExec 26988 xvdb R 48340000 16384 0.26 0.002350 CompactionExec 26988 xvdb R 48339842 512 0.23 0.003560 CompactionExec 26988 xvdb R 48339872 4096 0.22 0.003788 CompactionExec 26988 xvdb R 48339864 4096 0.18 0.004550 CompactionExec 26988 xvdb R 48339841 512 0.16 0.004719 CompactionExec 26988 xvdb R 48339843 512 0.14 0.004906 CompactionExec 26988 xvdb R 48339856 4096 0.15 0.005077 CompactionExec 26988 xvdb R 48339848 4096 0.14 0.005304 CompactionExec 26988 xvdb R 48340288 16384 0.17 0.009510 CompactionExec 26988 xvdb R 48340320 16384 0.19 0.016991 NonPeriodicTas 26988 xvdb R 32258112 16384 0.20 0.028200 CompactionExec 26988 xvdb R 32262176 16384 0.27 0.029466 CompactionExec 26988 xvdb R 32258944 16384 0.26 0.031511 CompactionExec 26988 xvdb R 32226562 512 0.21 0.038502 CompactionExec 26988 xvdb R 32226592 4096 0.22 0.038848 CompactionExec 26988 xvdb R 32226584 4096 0.24 0.039842 CompactionExec 26988 xvdb R 32226561 512 0.14 0.040079 CompactionExec 26988 xvdb R 32226563 512 0.14{noformat} Since every one of these disk operations would count as an IOP toward the quota in EBS (or any cloud provider's disaggregated storage), this is going to quickly run through the allocated quota significantly faster than if we were to use buffered reads. In fact, in the best case scenario of reading 16KB at a time, we'd use 16x the optimal number of IOPS. This causes the cost of storage on cloud providers to be significantly higher than it should be. was (Author: rustyrazorblade): I took a look at the I/O operations hitting the filesystem in the HEAD of 5.0 branch, and I've found our reads are going to be fairly wasteful. I loaded a single node up with some data and then started monitoring every operation, then ran `nodetool compact`. I monitored the i/o using this command: {noformat} sudo /usr/share/bcc/tools/xfsslower 0 -p 26988 | awk '$4 == "R" { print $0 }'{noformat} which allowed me to monitor all reads at the filesystem level. Here's a sample of the output. I've added the headers back in for convenience, and I've taken the liberty of showing some of the later I/O operations that occur at the end. {noformat} Tracing XFS operations TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 22:27:38 CompactionExec 26988 R 4096 0 0.01 nb-7-big-Statistics.db 22:27:38 CompactionExec 26988 R 4096 4 0.00 nb-7-big-Statistics.db 22:27:38 CompactionExec 26988 R 2062 8 0.00 nb-7-big-Statistics.db 22:27:38 CompactionExec 26988 R 14907 0 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14924 14 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14896 29 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14844 43 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14923 58 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14931 72 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14905 87 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14891 101 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14919 116 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14965 130 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14918 145 0.01 nb-7-big-Data.db 22:27:38 CompactionExec 26988 R 14930 160 0.01 nb-7-big-Data.db ..... 22:27:39 CompactionExec 26988 R 4096 0 0.01 nb-7-big-Statistics.db 22:27:39 CompactionExec 26988 R 98 0 0.01 nb-9-big-Data.db 22:27:39 CompactionExec 26988 R 4096 0 0.01 nb-13-big-Statistics.db 22:27:39 CompactionExec 26988 R 4096 0 0.01 nb-4-big-Statistics.db 22:27:39 CompactionExec 26988 R 911 4 0.00 nb-4-big-Statistics.db 22:27:39 CompactionExec 26988 R 115 0 0.01 nb-4-big-Data.db 22:27:39 CompactionExec 26988 R 51 0 0.01 nb-9-big-Data.db {noformat} This table was using the default compression with a 16KB chunk length. When I altered the table to use a 4KB chunk length, I see this, running the same commands: {noformat} 22:37:56 CompactionExec 26988 R 4096 0 0.01 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 816 4 0.00 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 102 0 0.01 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 4096 0 0.01 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 4096 4 0.00 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 2062 8 0.00 nb-13-big-Statistics.db 22:37:56 CompactionExec 26988 R 3763 0 0.01 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3758 3 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3748 7 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3783 11 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3770 14 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3750 18 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3794 22 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3793 25 0.00 nb-13-big-Data.db 22:37:56 CompactionExec 26988 R 3748 29 0.00 nb-13-big-Data.db{noformat} Reading in only 1 chunk at a time instead of a buffer means our reliance on page cache for sequential reads is at odds with our need to minimize read amplification during read heavy workloads. This includes LWT and Counter workloads as they're read before write. With accord coming up we're going to want to ensure our users don't need to overpay for storage just for the sake of running compaction. Here's a sample of what the I/O looks like at the block device level when running compact: {noformat} ubuntu@ip-172-31-38-58:~$ sudo /usr/share/bcc/tools/biosnoop -d xvdb | awk '$5 == "R" { print $0 }'{noformat} {noformat} TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 0.000000 CompactionExec 26988 xvdb R 48340000 16384 0.26 0.002350 CompactionExec 26988 xvdb R 48339842 512 0.23 0.003560 CompactionExec 26988 xvdb R 48339872 4096 0.22 0.003788 CompactionExec 26988 xvdb R 48339864 4096 0.18 0.004550 CompactionExec 26988 xvdb R 48339841 512 0.16 0.004719 CompactionExec 26988 xvdb R 48339843 512 0.14 0.004906 CompactionExec 26988 xvdb R 48339856 4096 0.15 0.005077 CompactionExec 26988 xvdb R 48339848 4096 0.14 0.005304 CompactionExec 26988 xvdb R 48340288 16384 0.17 0.009510 CompactionExec 26988 xvdb R 48340320 16384 0.19 0.016991 NonPeriodicTas 26988 xvdb R 32258112 16384 0.20 0.028200 CompactionExec 26988 xvdb R 32262176 16384 0.27 0.029466 CompactionExec 26988 xvdb R 32258944 16384 0.26 0.031511 CompactionExec 26988 xvdb R 32226562 512 0.21 0.038502 CompactionExec 26988 xvdb R 32226592 4096 0.22 0.038848 CompactionExec 26988 xvdb R 32226584 4096 0.24 0.039842 CompactionExec 26988 xvdb R 32226561 512 0.14 0.040079 CompactionExec 26988 xvdb R 32226563 512 0.14{noformat} Since every one of these disk operations would count as an IOP toward the quota in EBS (or any cloud provider's disaggregated storage), this is going to quickly run through the allocated quota significantly faster than if we were to use buffered reads. In fact, in the best case scenario of reading 16KB at a time, we'd use 16x the optimal number of IOPS. This causes the cost of storage on cloud providers to be significantly higher than it should be. > Improve disk access patterns during compaction and streaming > ------------------------------------------------------------ > > Key: CASSANDRA-15452 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15452 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Local Write-Read Paths, Local/Compaction > Reporter: Jon Haddad > Priority: Normal > Attachments: results.txt, sequential.fio > > > On read heavy workloads Cassandra performs much better when using a low read > ahead setting. In my tests I've seen an 5x improvement in throughput and > more than a 50% reduction in latency. However, I've also observed that it > can have a negative impact on compaction and streaming throughput. It > especially negatively impacts cloud environments where small reads incur high > costs in IOPS due to tiny requests. > # We should investigate using POSIX_FADV_DONTNEED on files we're compacting > to see if we can improve performance and reduce page faults. > # This should be combined with an internal read ahead style buffer that > Cassandra manages, similar to a BufferedInputStream but with our own > machinery. This buffer should read fairly large blocks of data off disk at > at time. EBS, for example, allows 1 IOP to be up to 256KB. A considerable > amount of time is spent in blocking I/O during compaction and streaming. > Reducing the frequency we read from disk should speed up all sequential I/O > operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org