[ 
https://issues.apache.org/jira/browse/CASSANDRA-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063567#comment-18063567
 ] 

Sam Lightfoot edited comment on CASSANDRA-21134 at 3/6/26 3:07 PM:
-------------------------------------------------------------------

h3. Block I/O Latency: Compaction Writes at the Device Level

To understand _why_ DIO improves read tail latency, we captured block I/O 
latency histograms using {{biolatency-bpfcc}} (BPF) during compaction. This 
traces every I/O at the NVMe device driver - below the page cache & filesystem.

*Setup:* 2 × 65 GB SSTables, major compaction at 128 MiB/s, 10K reads/s, 12 GB 
cgroup, RAID1 NVMe, chunk_length_kb=4. 30-second capture during steady-state 
compaction.
h4. Buffered compaction write I/Os (writeback)

With buffered I/O, compaction writes enter the page cache and are flushed to 
disk asynchronously by the kernel's writeback daemon. These appear as {{Write}} 
flag I/Os:
{noformat}
     usecs               : count     distribution
         8 -> 15         :    134    |*                                       |
        16 -> 31         :    276    |****                                    |
        32 -> 63         :    199    |**                                      |
        64 -> 127        :    223    |***                                     |
       128 -> 255        :    344    |*****                                   |
       256 -> 511        :    486    |*******                                 |
       512 -> 1023       :    928    |*************                           |
      1024 -> 2047       :  1,110   |****************                         |
      2048 -> 4095       :  1,608   |***********************                  |
      4096 -> 8191       :  2,476   |************************************     |
      8192 -> 16383      :  1,528   |**********************                   |
     16384 -> 32767      :  1,809   |**************************               |
     32768 -> 65535      :  2,722   |**************************************** | 
 <-- mode
     65536 -> 131071     :    738   |**********                               |
{noformat}
*14,581 writeback I/Os. Mode at 32–65 ms. 77% exceed 1 ms. Spread across 5 
orders of magnitude.*
h4. DIO compaction write I/Os (O_DIRECT)

With DIO, compaction writes go directly to the device as {{Sync-Write}} I/Os, 
bypassing the page cache entirely:
{noformat}
     usecs               : count     distribution
         8 -> 15         :     20   |                                        |
        16 -> 31         :     72   |                                        |
        32 -> 63         :    984   |*                                       |
        64 -> 127        : 31,424   |****************************************|  
<-- mode
      1024 -> 2047       :      1   |                                        |
      4096 -> 8191       :      1   |                                        |
{noformat}
*32,502 Sync-Write I/Os. Mode at 64–127 us (~500x faster). 97% complete within 
127 us.*
h4. Impact on reads

The buffered writeback I/Os (mode 32–65 ms) saturate the device's write 
bandwidth, causing user read I/Os to queue behind them. During the 30-second 
capture:
 - *Buffered:* 5,729 read I/Os exceeded 2 ms (3.8% of all reads reaching disk), 
max ~32 ms
 - *DIO:* 8 read I/Os exceeded 2 ms (0.04%), max ~16 ms

This writeback-induced read queueing is the primary mechanism behind the p99 
latency difference observed in the application-level results.


was (Author: JIRAUSER302824):
h3. Block I/O Latency: Compaction Writes at the Device Level

To understand _why_ DIO improves read tail latency, we captured block I/O 
latency histograms using {{biolatency-bpfcc}} (BPF) during compaction. This 
traces every I/O at the NVMe device driver — below the page cache, below the 
filesystem.

*Setup:* 2 × 65 GB SSTables, major compaction at 128 MiB/s, 10K reads/s, 12 GB 
cgroup, RAID1 NVMe, chunk_length_kb=4. 30-second capture during steady-state 
compaction.
h4. Buffered compaction write I/Os (writeback)

With buffered I/O, compaction writes enter the page cache and are flushed to 
disk asynchronously by the kernel's writeback daemon. These appear as {{Write}} 
flag I/Os:
{noformat}
     usecs               : count     distribution
         8 -> 15         :    134    |*                                       |
        16 -> 31         :    276    |****                                    |
        32 -> 63         :    199    |**                                      |
        64 -> 127        :    223    |***                                     |
       128 -> 255        :    344    |*****                                   |
       256 -> 511        :    486    |*******                                 |
       512 -> 1023       :    928    |*************                           |
      1024 -> 2047       :  1,110   |****************                         |
      2048 -> 4095       :  1,608   |***********************                  |
      4096 -> 8191       :  2,476   |************************************     |
      8192 -> 16383      :  1,528   |**********************                   |
     16384 -> 32767      :  1,809   |**************************               |
     32768 -> 65535      :  2,722   |**************************************** | 
 <-- mode
     65536 -> 131071     :    738   |**********                               |
{noformat}
*14,581 writeback I/Os. Mode at 32–65 ms. 77% exceed 1 ms. Spread across 5 
orders of magnitude.*
h4. DIO compaction write I/Os (O_DIRECT)

With DIO, compaction writes go directly to the device as {{Sync-Write}} I/Os, 
bypassing the page cache entirely:
{noformat}
     usecs               : count     distribution
         8 -> 15         :     20   |                                        |
        16 -> 31         :     72   |                                        |
        32 -> 63         :    984   |*                                       |
        64 -> 127        : 31,424   |****************************************|  
<-- mode
      1024 -> 2047       :      1   |                                        |
      4096 -> 8191       :      1   |                                        |
{noformat}
*32,502 Sync-Write I/Os. Mode at 64–127 us (~500x faster). 97% complete within 
127 us.*
h4. Impact on reads

The buffered writeback I/Os (mode 32–65 ms) saturate the device's write 
bandwidth, causing user read I/Os to queue behind them. During the 30-second 
capture:
 - *Buffered:* 5,729 read I/Os exceeded 2 ms (3.8% of all reads reaching disk), 
max ~32 ms
 - *DIO:* 8 read I/Os exceeded 2 ms (0.04%), max ~16 ms

This writeback-induced read queueing is the primary mechanism behind the p99 
latency difference observed in the application-level results.

> Direct IO support for compaction writes
> ---------------------------------------
>
>                 Key: CASSANDRA-21134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21134
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: image-2026-02-11-17-22-58-361.png, 
> image-2026-02-11-17-25-58-329.png
>
>
> Follow-up from the implementation for compaction reads (CASSANDRA-19987)
> Notable points
>  * Update the start-up check that impacts DIO writes 
> ({_}checkKernelBug1057843{_})
>  * RocksDB uses 1 MB flush buffer. This should be configurable and 
> performance tested (256KB vs 1MB)
>  * Introduce compaction_write_disk_access_mode / 
> backgroud_write_disk_access_mode
>  * Support for the compressed path would be most beneficial



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to