On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

> +typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else.  I recommend referring to it in README.md.  Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?

> +{
> +     /* not in use */
> +     AHS_IDLE = 0,
> +
> +     /* returned by pgaio_io_get() */
> +     AHS_HANDED_OUT,
> +
> +     /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet 
> */
> +     AHS_DEFINED,
> +
> +     /* subjects prepare() callback has been called */
> +     AHS_PREPARED,
> +
> +     /* IO is being executed */
> +     AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach.  For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

> +
> +     /* IO finished, but result has not yet been processed */
> +     AHS_REAPED,
> +
> +     /* IO completed, shared completion has been called */
> +     AHS_COMPLETED_SHARED,
> +
> +     /* IO completed, local completion has been called */
> +     AHS_COMPLETED_LOCAL,
> +} PgAioHandleState;

> +void
> +pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
> +{
> +     PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, 
> ioh_node);
> +
> +     Assert(ioh->resowner);
> +
> +     ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
> +     ioh->resowner = NULL;
> +
> +     switch (ioh->state)
> +     {
> +             case AHS_IDLE:
> +                     elog(ERROR, "unexpected");
> +                     break;
> +             case AHS_HANDED_OUT:
> +                     Assert(ioh == my_aio->handed_out_io || 
> my_aio->handed_out_io == NULL);
> +
> +                     if (ioh == my_aio->handed_out_io)
> +                     {
> +                             my_aio->handed_out_io = NULL;
> +                             if (!on_error)
> +                                     elog(WARNING, "leaked AIO handle");
> +                     }
> +
> +                     pgaio_io_reclaim(ioh);
> +                     break;
> +             case AHS_DEFINED:
> +             case AHS_PREPARED:
> +                     /* XXX: Should we warn about this when is_commit? */

Yes.

> +                     pgaio_submit_staged();
> +                     break;
> +             case AHS_IN_FLIGHT:
> +             case AHS_REAPED:
> +             case AHS_COMPLETED_SHARED:
> +                     /* this is expected to happen */
> +                     break;
> +             case AHS_COMPLETED_LOCAL:
> +                     /* XXX: unclear if this ought to be possible? */
> +                     pgaio_io_reclaim(ioh);
> +                     break;
> +     }

> +void
> +pgaio_io_ref_wait(PgAioHandleRef *ior)
> +{
> +     uint64          ref_generation;
> +     PgAioHandleState state;
> +     bool            am_owner;
> +     PgAioHandle *ioh;
> +
> +     ioh = pgaio_io_from_ref(ior, &ref_generation);
> +
> +     am_owner = ioh->owner_procno == MyProcNumber;
> +
> +
> +     if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> +             return;
> +
> +     if (am_owner)
> +     {
> +             if (state == AHS_DEFINED || state == AHS_PREPARED)
> +             {
> +                     /* XXX: Arguably this should be prevented by callers? */
> +                     pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both.  AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.  Even so,
the next point might remove the need here:

> +void
> +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> +{
> +     Assert(ioh->state == AHS_HANDED_OUT);
> +     Assert(pgaio_io_has_subject(ioh));
> +
> +     ioh->op = op;
> +     ioh->state = AHS_DEFINED;
> +     ioh->result = 0;
> +
> +     /* allow a new IO to be staged */
> +     my_aio->handed_out_io = NULL;
> +
> +     pgaio_io_prepare_subject(ioh);
> +
> +     ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here.  This code already needs to be safe for that (per
README.md).  When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common().  I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

> +if test "$with_liburing" = yes; then
> +  PKG_CHECK_MODULES(LIBURING, liburing)
> +fi

I used the attached makefile patch to build w/ liburing.

> +pgaio_uring_shmem_init(bool first_time)
> +{
> +     uint32          TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - 
> MAX_IO_WORKERS;
> +     bool            found;
> +
> +     aio_uring_contexts = (PgAioUringContext *)
> +             ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
> +
> +     if (found)
> +             return;
> +
> +     for (int contextno = 0; contextno < TotalProcs; contextno++)
> +     {
> +             PgAioUringContext *context = &aio_uring_contexts[contextno];
> +             int                     ret;
> +
> +             /*
> +              * XXX: Probably worth sharing the WQ between the different 
> rings,
> +              * when supported by the kernel. Could also cause additional
> +              * contention, I guess?
> +              */
> +#if 0
> +             if (!AcquireExternalFD())
> +                     elog(ERROR, "No external FD available");
> +#endif
> +             ret = io_uring_queue_init(io_max_concurrency, 
> &context->io_uring_ring, 0);

With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:

2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  starting PostgreSQL 
18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 
64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  listening on Unix socket 
"/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG:  database system was shut 
down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL:  the 
database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG:  database system is ready 
to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC:  failed: -9/Bad 
file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC:  failed: 
-95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC:  
unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG:  received fast shutdown 
request

I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

> +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> +{
> +     struct io_uring *uring_instance = 
> &my_shared_uring_context->io_uring_ring;
> +
> +     Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
> +
> +     for (int i = 0; i < num_staged_ios; i++)
> +     {
> +             PgAioHandle *ioh = staged_ios[i];
> +             struct io_uring_sqe *sqe;
> +
> +             sqe = io_uring_get_sqe(uring_instance);
> +
> +             pgaio_io_prepare_submit(ioh);
> +             pgaio_uring_sq_from_io(ioh, sqe);
> +     }
> +
> +     while (true)
> +     {
> +             int                     ret;
> +
> +             pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
> +             ret = io_uring_submit(uring_instance);
> +             pgstat_report_wait_end();
> +
> +             if (ret == -EINTR)
> +             {
> +                     elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> +                     continue;
> +             }

Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN.  "man io_uring_enter" has:

    EAGAIN The kernel was unable to allocate memory for the request, or
    otherwise ran out of resources to handle it. The application should wait
    for some completions and try again.

> +FileStartWriteV(struct PgAioHandle *ioh, File file,
> +                             int iovcnt, off_t offset,
> +                             uint32 wait_event_info)
> +{
> +     int                     returnCode;
> +     Vfd                *vfdP;
> +
> +     Assert(FileIsValid(file));
> +
> +     DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
> +                        file, VfdCache[file].fileName,
> +                        (int64) offset,
> +                        iovcnt));
> +
> +     returnCode = FileAccess(file);
> +     if (returnCode < 0)
> +             return returnCode;
> +
> +     vfdP = &VfdCache[file];
> +
> +     /* FIXME: think about / reimplement  temp_file_limit */
> +
> +     pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
> +
> +     return 0;
> +}

FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).


For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.  Otherwise,
deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock.  What
resolves it under io_uring?  Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.  Could you add a mention of "deadlock" in the
comment at whichever code achieves that?


I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations.  Is there any
specific decision you'd like to settle before patch 6 exits WIP?

Thanks,
nm
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 84302cc..b123fdc 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -43,9 +43,10 @@ OBJS = \
        $(top_builddir)/src/common/libpgcommon_srv.a \
        $(top_builddir)/src/port/libpgport_srv.a
 
-# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add
-# libldap and ICU
-LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS)
+# We put libpgport and libpgcommon into OBJS, so remove it from LIBS.
+LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS))
+# The backend conditionally needs libraries that most executables don't need.
+LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
 
 # The backend doesn't need everything that's in LIBS, however
 LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))

Reply via email to