On Wed, Mar 11, 2026 at 9:37 AM Xuneng Zhou <[email protected]> wrote: > > Hi Andres, > > On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <[email protected]> wrote: > > > > Hi, > > > > On 2026-03-10 19:28:29 +0900, Michael Paquier wrote: > > > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote: > > > > Here’s v5 of the patchset. The wal_logging_large patch has been > > > > removed, as no performance gains were observed in the benchmark runs. > > > > > > Looking at the numbers you are posting, it is harder to get excited > > > about the hash, gin, bloom_vacuum and wal_logging. > > > > It's perhaps worth emphasizing that, to allow real world usage of direct IO, > > we'll need streaming implementation for most of these. Also, on windows the > > OS > > provided readahead is ... not aggressive, so you'll hit IO stalls much more > > frequently than you'd on linux (and some of the BSDs). > > > > It might be a good idea to run the benchmarks with debug_io_direct=data. > > That'll make them very slow, since the write side doesn't yet use AIO and > > thus > > will do a lot of synchronous writes, but it should still allow to evaluate > > the > > gains from using read stream. > > > > > > The other thing that's kinda important to evaluate read streams is to test > > on > > higher latency storage, even without direct IO. Many workloads are not at > > all > > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but > > are > > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency. > > > > > > To be able to test such higher latencies locally, I've found it quite useful > > to use dm_delay above a fast disk. See [1]. > > Thanks for the tips! I currently don’t have access to a machine or > cloud instance with slower SSDs or HDDs that have higher latency. I’ll > try running the benchmark with debug_io_direct=data and dm_delay, as > you suggested, to see if the results vary. > > > > > > The worker method seems more efficient, may show that we are out of noise > > > level. > > > > I think that's more likely to show that memory bandwidth, probably due to > > checksum computations, is a factor. The memory copy (from the kernel page > > cache, with buffered IO) and the checksum computations (when checksums are > > enabled) are parallelized by worker, but not by io_uring. > > > > > > Greetings, > > > > Andres Freund > > > > > > [1] > > > > https://docs.kernel.org/admin-guide/device-mapper/delay.html > > > > Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be > > introduced for it: > > > > umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz > > /dev/md0) delay /dev/md0 0 1" /dev/md0 && mount /dev/mapper/delayed /srv/ > > > > To update the amount of delay to 3ms the following can be used: > > dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev > > --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0 && dmsetup resume delayed > > > > (I will often just update the delay to 0 for comparison runs, as that > > doesn't require remounting) >
With debug_io_direct=data and dm_delay, the results look quite promising! medium size / io_uring gin_vacuum_medium base= 1619.9ms patch= 301.8ms 5.37x ( 81.4%) (reads=1571→947, io_time=1524.86→207.48ms) The average runtime increases significantly after adding the manual device delay, so it will take some time to complete all the test runs. I was also busy with something else today... Once the runs are finished, I’ll share the results and the script to reproduce them. -- Best, Xuneng
