
over the last ~year I spent a lot of time trying to figure out how we could
add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
there's still a *lot* of open questions, I think I now have a decent handle on
most of the bigger architectural questions.  Thus this long email.

Just to be clear: I don't expect the current to design to survive as-is. If
there's a few sentences below that sound a bit like describing the new world,
that's because they're from the README.md in the patch series...

## Why Direct / unbuffered IO?

The main reason to want to use Direct IO are:

- Lower CPU usage / higher throughput. Particularly on modern storage
  buffered writes are bottlenecked by the operating system having to
  copy data from the kernel's page cache to postgres buffer pool using
  the CPU. Whereas direct IO can often move the data directly between
  the storage devices and postgres' buffer cache, using DMA. While
  that transfer is ongoing, the CPU is free to perform other work,
  with little impact.
- Avoiding double buffering between operating system cache and
  postgres' shared_buffers.
- Better control over the timing and pace of dirty data writeback.
- Potential for concurrent WAL writes (via O_DIRECT | O_DSYNC writes)

The main reason *not* to use Direct IO are:

- Without AIO, Direct IO is unusably slow for most purposes.
- Even with AIO, many parts of postgres need to be modified to perform
  explicit prefetching.
- In situations where shared_buffers cannot be set appropriately
  large, e.g. because there are many different postgres instances
  hosted on shared hardware, performance will often be worse then when
  using buffered IO.

## Why Asynchronous IO

- Without AIO we cannot use DIO

- Without asynchronous IO (AIO) PG has to rely on the operating system
  to hide the cost of synchronous IO from Postgres. While this works
  surprisingly well in a lot of workloads, it does not do as good a job
  on prefetching and controlled writeback as we would like.
- There are important expensive operations like fdatasync() where the
  operating system cannot hide the storage latency. This is particularly
  important for WAL writes, where the ability to asynchronously issue
  fdatasync() or O_DSYNC writes can yield significantly higher
- Fetching data into shared buffers asynchronously and concurrently with query
  execution means there is more CPU time for query execution.

## High level difficulties adding AIO/DIO support

- Optionally using AIO leads to convoluted and / or duplicated code.

- Platform dependency: The common AIO APIs are typically specific to one
  platform (linux AIO, linux io_uring, windows IOCP, windows overlapped IO) or
  a few platforms (posix AIO, but there's many differences).

- There are a lot of separate places doing IO in PG. Moving all of these to
  use efficiently use AIO is an, um, large undertaking.

- Nothing in the buffer management APIs expects there to be more than one IO
  to be in progress at the same time - which is required to do AIO.

## Portability & Duplication

To avoid the issue of needing non-AIO codepaths to support platforms without
native AIO support a worker process based AIO implementation exists (and is
currently the default). This also is convenient to check if a problem is
related to the native IO implementation or not.

Thanks to Thomas Munro for helping a *lot* around this area. He wrote
the worker mode, the posix aio mode, added CI, did a lot of other
testing, listened to me...

## Deadlock and Starvation Dangers due to AIO

Using AIO in a naive way can easily lead to deadlocks in an environment where
the source/target of AIO are shared resources, like pages in postgres'

Consider one backend performing readahead on a table, initiating IO for a
number of buffers ahead of the current "scan position". If that backend then
performs some operation that blocks, or even just is slow, the IO completion
for the asynchronously initiated read may not be processed.

This AIO implementation solves this problem by requiring that AIO methods
either allow AIO completions to be processed by any backend in the system
(e.g. io_uring, and indirectly posix, via signal handlers), or to guarantee
that AIO processing will happen even when the issuing backend is blocked
(e.g. worker mode, which offloads completion processing to the AIO workers).

## AIO API overview

The main steps to use AIO (without higher level helpers) are:

1) acquire an "unused" AIO: pgaio_io_get()

2) start some IO, this is done by functions like

   The (read|write|fsync|flush_range) indicates the operation, whereas
   (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.

   (see below for more details about this design choice - it might or not be

3) optionally: assign a backend-local completion callback to the IO

4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
   put on a per-backend list of pending IOs. The pending IOs can be explicitly
   be flushed pgaio_submit_pending(), but will also be submitted if the
   pending list gets to be too large, or if the current backend waits for the

   The are two main reasons not to submit the IO immediately:
   - If adjacent, we can merge several IOs into one "kernel level" IO during
     submission. Larger IOs are considerably more efficient.
   - Several AIO APIs allow to submit a batch of IOs in one system call.

5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
   backend. When other backends may need to wait for an IO to finish,
   pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
   BufferDesc), which can be waited for using pgaio_io_wait_ref().

6) Process the results of the request. If a callback was registered in 3),
   this isn't always necessary. The results of AIO can be accessed using
   pgaio_io_result() which returns an integer where negative numbers are
   -errno, and positive numbers are the [partial] success conditions
   (e.g. potentially indicating a short read).

7) release ownership of the io (pgaio_io_release()) or reuse the IO for
   another operation (pgaio_io_recycle())

Most places that want to use AIO shouldn't themselves need to care about
managing the number of writes in flight, or the readahead distance. To help
with that there are two helper utilities, a "streaming read" and a "streaming

The "streaming read" helper uses a callback to determine which blocks to
prefetch - that allows to do readahead in a sequential fashion but importantly
also allows to asynchronously "read ahead" non-sequential blocks.

E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
uses the tids in LVDeadTuples to figure out which blocks are going to be
needed. Here's the latter as an example:

## IO initialization layering

One difficulty I had in this process was how to initialize IOs in light of the
layering (from bufmgr.c over smgr.c and md.c to fd.c and back, but also
e.g. xlog.c). Sometimes AIO needs to be initialized on the bufmgr.c level,
sometimes on the md.c level, sometimes on the level of fd.c. But to be able to
react to the completion of any such IO metadata about the operation is needed.

Early on fd.c initialized IOs, and the context information was just passed
through to fd.c. But that seems quite wrong - fd.c shouldn't have to know
about which Buffer an IO is about. But higher levels shouldn't know about
which files an operation resides in either, so they can't do all the work

To avoid that, I ended up splitting the "start an AIO" operation into a higher
level part, e.g. pgaio_io_start_read_smgr() - which doesn't know about which
smgr implementation is in use and thus also not what file/offset we're dealing
with, which calls into smgr->md->fd to actually "prepare" the IO (i.e. figure
out file / offset).  This currently looks like:

pgaio_io_start_read_smgr(PgAioInProgress *io, struct SMgrRelationData* smgr, 
ForkNumber forknum,
                                                 BlockNumber blocknum, char 
        pgaio_io_prepare(io, PGAIO_OP_READ);

        smgrstartread(io, smgr, forknum, blocknum, bufdata);

        io->scb_data.read_smgr.tag = (AioBufferTag){
                .rnode = smgr->smgr_rnode,
                .forkNum = forknum,
                .blockNum = blocknum

        pgaio_io_stage(io, PGAIO_SCB_READ_SMGR);

Once this reaches the fd.c layer the new FileStartRead() function calls
pgaio_io_prep_read() on the IO - but doesn't need to know anything about weird
higher level stuff like relfilenodes.

The _sb (_sb for shared_buffers) variant stores the Buffer, backend and mode
(as in ReadBufferMode).

I'm not sure this is the right design - but it seems a lot better than what I
had earlier...

## Callbacks

In the core AIO pieces there are two different types of callbacks at the

Shared callbacks, which can be invoked by any backend (normally the issuing
backend / the AIO workers, but can be other backends if they are waiting for
the IO to complete). For operations on shared resources (e.g. shared buffer
reads/writes, or WAL writes) these shared callback needs to transition the
state of the object the IO is being done for to completion. E.g. for a shared
buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS.

The main reason these callbacks exist is that they make it safe for a backend
to issue non-blocking IO on buffers (see the deadlock section above). As any
blocked backend can cause the IO to complete, the deadlock danger is gone.

Local callbacks, one of which the issuer of an IO can associate with the
IO. These can be used to issue further readahead. I initially did not have
these, but I found it hard to have a controllable numbers of IO in
flight. They are currently mainly used for error handling (e.g. erroring out
when XLogFileInit() cannot create the file due to ENOSPC), and to issue more
IO (e.g. readahead for heapam).

The local callback system isn't quite right, and there's

## AIO conversions

Currently the patch series converts a number of subsystems to AIO. They are of
very varying quality. I mainly did the conversions that I considered either be
of interest architecturally, or that caused a fair bit of pain due to slowness
(e.g. VACUUMing without AIO is no fun at all when using DIO). Some also for
fun ;)

Most conversions are fairly simple. E.g. heap scans, checkpointer, bgwriter,
VACUUM are all not too complicated.

There are two conversions that are good bit more complicated/experimental:

1) Asynchronous, concurrent, WAL writes. This is important because we right
   now are very bottlenecked by IO latency, because there effectively only
   ever is one WAL IO in flight at the same time. Even though in many
   situations it is possible to issue a WAL write, have one [set of] backends
   wait for that write as its completions satisfies their XLogFlush() needs,
   but concurrently already issue the next WAL write(s) that other backends

   The code here is very crufty, but I think the concept is mostly right.

2) Asynchronous buffer replacement. Even with buffered IO we experience a lot
   of pain when ringbuffers need to write out data (VACUUM!). But with DIO the
   issue gets a lot worse - the kernel can't hide the write latency from us
   anymore.  This change makes each backend asynchronously clean out buffers
   that it will need soon. When a ringbuffer is is use this means cleaning out
   buffers in the ringbuffer, when not, performing the clock sweep and cleaning
   out victim buffers.  Due to 1) the XLogFlush() can also be done

There are a *lot* of places that haven't been converted to use AIO.

## Stats

There are two new views: pg_stat_aios showing AIOs that are currently
in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.

## Code:


I was not planning to attach all the patches on the way to AIO - it's too many
right now... I hope I can reduce the size of the series iteratively into
easier to look at chunks.

## TL;DR: Performance numbers

This is worth an email on its own, and it's pretty late here already and I
want to rerun benchmarks before posting more numbers. So here are just a few
that I could run before falling asleep.

1) 60s of parallel COPY BINARY of a 89MB into separate tables (s_b = 96GB):

slow NVMe SSD
branch   dio  clients   tps/stddev      checkpoint write time
master   n    8         3.0/2296 ms     4.1s / 379647 buffers = 723MiB/s
aio      n    8         3.8/1985 ms     11.5s / 1028669 buffers = 698MiB/
aio      y    8         4.7/204 ms      10.0s / 1164933 buffers = 910MiB/s

raid of 2 fast NVMe SSDs (on pcie3):
branch   dio  clients   tps/stddev      checkpoint write time
master   n    8         9.7/62 ms       7.6s / 1206376 buffers = 1240MiB/s
aio      n    8         11.4/82 ms      14.3s / 2838129 buffers = 1550MiB/s
aio      y    8         18.1/56 ms      8.9s / 4486170 buffers = 3938MiB/s

2) pg prewarm speed

raid of 2 fast NVMe SSDs (on pcie3):

pg_prewarm(62GB, read)
branch   dio    time            bw
master   n      17.4s           3626MiB/s
aio      n      10.3s           6126MiB/s (higher cpu usage)
aio      y      9.8s            6438MiB/s

pg_prewarm(62GB, buffer)
branch   dio    time            bw
master   n      38.3s           1647MiB/s
aio      n      13.6s           4639MiB/s (higher cpu usage)
aio      y      10.7s           5897MiB/s

3) parallel sequential scan speed

parallel sequential scan + count(*) of 59GB table:

branch   dio        max_parallel        time
master   n          0                   40.5s
master   n          1                   22.6s
master   n          2                   16.4s
master   n          4                   10.9s
master   n          8                   9.3s

aio      y          0                   33.1s
aio      y          1                   17.2s
aio      y          2                   11.8s
aio      y          4                   9.0s
aio      y          8                   9.2s

On local SSDs there's some, but not a huge performance advantage in most
transactional r/w workloads. But on cloud storage - which has a lot higher
latency - AIO can yield huge advantages. I've seen over 4x.

There's definitely also cases where AIO currently hurts - most of those I just
didn't get aroung to address.

There's a lot more cases in which DIO currently hurts - mostly because the
necessary smarts haven't yet been added.

Comments? Questions?

I plan to send separate emails about smaller chunks of this seperately -
the whole topic is just too big. In particular I plan to send something
around buffer locking / state management - it's a one of the core issues
around this imo.



Reply via email to