[ 
https://issues.apache.org/jira/browse/CASSANDRA-15452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823374#comment-17823374
 ] 

Jon Haddad commented on CASSANDRA-15452:
----------------------------------------

I took a look at the I/O operations hitting the filesystem in the HEAD of 5.0 
branch, and I've found our reads are going to be fairly wasteful.  I loaded a 
single node up with some data and then started monitoring every operation, then 
ran `nodetool compact`.

I monitored the i/o using this command:

 
{noformat}
sudo /usr/share/bcc/tools/xfsslower 0 -p 26988 | awk '$4 == "R" { print $0 
}'{noformat}
which allowed me to monitor all reads at the filesystem level.

Here's a sample of the output.  I've added the headers back in for convenience, 
and I've taken the liberty of showing some of the later I/O operations that 
occur at the end.

 

 
{noformat}
Tracing XFS operations
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
22:27:38 CompactionExec 26988  R 4096    0           0.01 nb-7-big-Statistics.db
22:27:38 CompactionExec 26988  R 4096    4           0.00 nb-7-big-Statistics.db
22:27:38 CompactionExec 26988  R 2062    8           0.00 nb-7-big-Statistics.db
22:27:38 CompactionExec 26988  R 14907   0           0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14924   14          0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14896   29          0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14844   43          0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14923   58          0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14931   72          0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14905   87          0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14891   101         0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14919   116         0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14965   130         0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14918   145         0.01 nb-7-big-Data.db
22:27:38 CompactionExec 26988  R 14930   160         0.01 nb-7-big-Data.db
.....
22:27:39 CompactionExec 26988  R 4096    0           0.01 nb-7-big-Statistics.db
22:27:39 CompactionExec 26988  R 98      0           0.01 nb-9-big-Data.db 
22:27:39 CompactionExec 26988  R 4096    0           0.01 
nb-13-big-Statistics.db
22:27:39 CompactionExec 26988  R 4096    0           0.01 
nb-4-big-Statistics.db 
22:27:39 CompactionExec 26988  R 911     4           0.00 
nb-4-big-Statistics.db 
22:27:39 CompactionExec 26988  R 115     0           0.01 nb-4-big-Data.db
22:27:39 CompactionExec 26988  R 51      0           0.01 nb-9-big-Data.db
{noformat}
This table was using the default compression with a 16KB chunk length.  When I 
altered the table to use a 4KB chunk length, I see this, running the same 
commands:

 
{noformat}
22:37:56 CompactionExec 26988  R 4096    0           0.01 
nb-13-big-Statistics.db
22:37:56 CompactionExec 26988  R 816     4           0.00 
nb-13-big-Statistics.db
22:37:56 CompactionExec 26988  R 102     0           0.01 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 4096    0           0.01 
nb-13-big-Statistics.db
22:37:56 CompactionExec 26988  R 4096    4           0.00 
nb-13-big-Statistics.db
22:37:56 CompactionExec 26988  R 2062    8           0.00 
nb-13-big-Statistics.db
22:37:56 CompactionExec 26988  R 3763    0           0.01 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3758    3           0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3748    7           0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3783    11          0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3770    14          0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3750    18          0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3794    22          0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3793    25          0.00 nb-13-big-Data.db
22:37:56 CompactionExec 26988  R 3748    29          0.00 
nb-13-big-Data.db{noformat}
Reading in only 1 chunk at a time instead of a buffer means our reliance on 
page cache for sequential reads is at odds with our need to minimize read 
amplification during read heavy workloads.  This includes LWT and Counter 
workloads as they're read before write.  With accord coming up we're going to 
want to ensure our users don't need to overpay for storage just for the sake of 
running compaction.

Here's a sample of what the I/O looks like at the block device level when 
running compact:
{noformat}
ubuntu@ip-172-31-38-58:~$ sudo /usr/share/bcc/tools/biosnoop  -d xvdb | awk '$5 
== "R" { print $0 }'{noformat}
{noformat}
TIME(s)     COMM           PID     DISK      T SECTOR     BYTES  LAT(ms)
0.000000    CompactionExec 26988   xvdb      R 48340000   16384     0.26
0.002350    CompactionExec 26988   xvdb      R 48339842   512       0.23
0.003560    CompactionExec 26988   xvdb      R 48339872   4096      0.22
0.003788    CompactionExec 26988   xvdb      R 48339864   4096      0.18
0.004550    CompactionExec 26988   xvdb      R 48339841   512       0.16
0.004719    CompactionExec 26988   xvdb      R 48339843   512       0.14
0.004906    CompactionExec 26988   xvdb      R 48339856   4096      0.15
0.005077    CompactionExec 26988   xvdb      R 48339848   4096      0.14
0.005304    CompactionExec 26988   xvdb      R 48340288   16384     0.17
0.009510    CompactionExec 26988   xvdb      R 48340320   16384     0.19
0.016991    NonPeriodicTas 26988   xvdb      R 32258112   16384     0.20
0.028200    CompactionExec 26988   xvdb      R 32262176   16384     0.27
0.029466    CompactionExec 26988   xvdb      R 32258944   16384     0.26
0.031511    CompactionExec 26988   xvdb      R 32226562   512       0.21
0.038502    CompactionExec 26988   xvdb      R 32226592   4096      0.22
0.038848    CompactionExec 26988   xvdb      R 32226584   4096      0.24
0.039842    CompactionExec 26988   xvdb      R 32226561   512       0.14
0.040079    CompactionExec 26988   xvdb      R 32226563   512       
0.14{noformat}
Since every one of these disk operations would count as an IOP toward the quota 
in EBS (or any cloud provider's disaggregated storage), this is going to 
quickly run through the allocated quota significantly faster than if we were to 
use buffered reads.  In fact, in the best case scenario of reading 16KB at a 
time, we'd use 16x the optimal number of IOPS.  This causes the cost of storage 
on cloud providers to be significantly higher than it should be.  

 

 

> Improve disk access patterns during compaction and streaming
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-15452
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15452
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Local Write-Read Paths, Local/Compaction
>            Reporter: Jon Haddad
>            Priority: Normal
>         Attachments: results.txt, sequential.fio
>
>
> On read heavy workloads Cassandra performs much better when using a low read 
> ahead setting.   In my tests I've seen an 5x improvement in throughput and 
> more than a 50% reduction in latency.  However, I've also observed that it 
> can have a negative impact on compaction and streaming throughput. It 
> especially negatively impacts cloud environments where small reads incur high 
> costs in IOPS due to tiny requests.
>  # We should investigate using POSIX_FADV_DONTNEED on files we're compacting 
> to see if we can improve performance and reduce page faults. 
>  # This should be combined with an internal read ahead style buffer that 
> Cassandra manages, similar to a BufferedInputStream but with our own 
> machinery.  This buffer should read fairly large blocks of data off disk at 
> at time.  EBS, for example, allows 1 IOP to be up to 256KB.  A considerable 
> amount of time is spent in blocking I/O during compaction and streaming. 
> Reducing the frequency we read from disk should speed up all sequential I/O 
> operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to