You are right, Kurt, it's what I was trying to do - lowering compression chunk size and device read-ahead.
Column-family settings: "compression = {'chunk_length_kb': '16', 'sstable_compression': 'org.apache.cassandra.io.compress.SnappyCompressor'}" Device read-ahead: blockdev --setra 8 .... I had to fallback to default RA 256 and got large merged reads and small iops with good MBytes/sec after this. I believe it's not caused by C* settings, but it's something with filesystem / IO-related kernel settings (or it's by design?). Tried to emulate C* reads during compactions by dd: ****** RA=8 (4k) # blockdev --setra 8 /dev/xvdb # dd if=/dev/zero of=/data/ZZZ ^C16980952+0 records in 16980951+0 records out 8694246912 bytes (8.7 GB, 8.1 GiB) copied, 36.4651 s, 238 MB/s # sync # echo 3 > /proc/sys/vm/drop_caches # dd if=/data/ZZZ of=/dev/null ^C846513+0 records in 846512+0 records out 433414144 bytes (433 MB, 413 MiB) copied, 21.4604 s, 20.2 MB/s <<<<< High IOPS in this case, io size = 4k. What's interesting, setting bs=128k in dd didn't decrease iops, io size still was 4k ****** RA=256 (128k): # blockdev --setra 256 /dev/xvdb # echo 3 > /proc/sys/vm/drop_caches # dd if=/data/ZZZ of=/dev/null ^C15123937+0 records in 15123936+0 records out 7743455232 bytes (7.7 GB, 7.2 GiB) copied, 60.8407 s, 127 MB/s <<<<<< io size - 128k, small iops, good throughput (limited by EBS bandwidth) Writes were fine in both cases: io size 128k, good throughput limited by EBS bandwidth only Is above situation typical for small read-ahead ("price for small fast reads") or it's something wrong with my setup? [It's not XFS mailing list, but as somebody here may know this, ] Why in case of small RA even large reads (bs=128k) are converted to multiple small reads? Regards, Kyrill ________________________________ From: kurt greaves <k...@instaclustr.com> Sent: Tuesday, May 8, 2018 2:12:40 AM To: User Subject: Re: compaction: huge number of random reads If you've got small partitions/small reads you should test lowering your compression chunk size on the table and disabling read ahead. This sounds like it might just be a case of read amplification. On Tue., 8 May 2018, 05:43 Kyrylo Lebediev, <kyrylo_lebed...@epam.com<mailto:kyrylo_lebed...@epam.com>> wrote: Dear Experts, I'm observing strange behavior on a cluster 2.1.20 during compactions. My setup is: 12 nodes m4.2xlarge (8 vCPU, 32G RAM) Ubuntu 16.04, 2T EBS gp2. Filesystem: XFS, blocksize 4k, device read-ahead - 4k /sys/block/vxdb/queue/nomerges = 0 SizeTieredCompactionStrategy After data loads when effectively nothing else is talking to the cluster and compactions is the only activity, I see something like this: $ iostat -dkx 1 ... Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 0.00 0.00 4769.00 213.00 19076.00 26820.00 18.42 7.95 1.17 1.06 3.76 0.20 100.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 0.00 0.00 6098.00 177.00 24392.00 22076.00 14.81 6.46 1.36 0.96 15.16 0.16 100.00 Writes are fine: 177 writes/sec <-> ~22Mbytes/sec, But for some reason compactions generate a huge number of small reads: 6098 reads/s <-> ~24Mbytes/sec. ===> Read size is 4k Why instead much smaller amount of large reads I'm getting huge number of 4k reads instead? What could be the reason? Thanks, Kyrill