On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote: > There's plenty more to do, but I thought this would be a useful checkpoint.
I find patches 1-5 are Ready for Committer. > +typedef enum PgAioHandleState This enum clarified a lot for me, so I wish I had read it before anything else. I recommend referring to it in README.md. Would you also cover the valid state transitions and which of them any backend can do vs. which are specific to the defining backend? > +{ > + /* not in use */ > + AHS_IDLE = 0, > + > + /* returned by pgaio_io_get() */ > + AHS_HANDED_OUT, > + > + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet > */ > + AHS_DEFINED, > + > + /* subjects prepare() callback has been called */ > + AHS_PREPARED, > + > + /* IO is being executed */ > + AHS_IN_FLIGHT, Let's align terms between functions and states those functions reach. For example, I recommend calling this state AHS_SUBMITTED, because pgaio_io_prepare_submit() is the function reaching this state. (Alternatively, use in_flight in the function name.) > + > + /* IO finished, but result has not yet been processed */ > + AHS_REAPED, > + > + /* IO completed, shared completion has been called */ > + AHS_COMPLETED_SHARED, > + > + /* IO completed, local completion has been called */ > + AHS_COMPLETED_LOCAL, > +} PgAioHandleState; > +void > +pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error) > +{ > + PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, > ioh_node); > + > + Assert(ioh->resowner); > + > + ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node); > + ioh->resowner = NULL; > + > + switch (ioh->state) > + { > + case AHS_IDLE: > + elog(ERROR, "unexpected"); > + break; > + case AHS_HANDED_OUT: > + Assert(ioh == my_aio->handed_out_io || > my_aio->handed_out_io == NULL); > + > + if (ioh == my_aio->handed_out_io) > + { > + my_aio->handed_out_io = NULL; > + if (!on_error) > + elog(WARNING, "leaked AIO handle"); > + } > + > + pgaio_io_reclaim(ioh); > + break; > + case AHS_DEFINED: > + case AHS_PREPARED: > + /* XXX: Should we warn about this when is_commit? */ Yes. > + pgaio_submit_staged(); > + break; > + case AHS_IN_FLIGHT: > + case AHS_REAPED: > + case AHS_COMPLETED_SHARED: > + /* this is expected to happen */ > + break; > + case AHS_COMPLETED_LOCAL: > + /* XXX: unclear if this ought to be possible? */ > + pgaio_io_reclaim(ioh); > + break; > + } > +void > +pgaio_io_ref_wait(PgAioHandleRef *ior) > +{ > + uint64 ref_generation; > + PgAioHandleState state; > + bool am_owner; > + PgAioHandle *ioh; > + > + ioh = pgaio_io_from_ref(ior, &ref_generation); > + > + am_owner = ioh->owner_procno == MyProcNumber; > + > + > + if (pgaio_io_was_recycled(ioh, ref_generation, &state)) > + return; > + > + if (am_owner) > + { > + if (state == AHS_DEFINED || state == AHS_PREPARED) > + { > + /* XXX: Arguably this should be prevented by callers? */ > + pgaio_submit_staged(); Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup. Even so, the next point might remove the need here: > +void > +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op) > +{ > + Assert(ioh->state == AHS_HANDED_OUT); > + Assert(pgaio_io_has_subject(ioh)); > + > + ioh->op = op; > + ioh->state = AHS_DEFINED; > + ioh->result = 0; > + > + /* allow a new IO to be staged */ > + my_aio->handed_out_io = NULL; > + > + pgaio_io_prepare_subject(ioh); > + > + ioh->state = AHS_PREPARED; As defense in depth, let's add a critical section from before assigning AHS_DEFINED to here. This code already needs to be safe for that (per README.md). When running outside a critical section, an ERROR in a subject callback could leak the lwlock disowned in shared_buffer_prepare_common(). I doubt there's a plausible way to reach that leak today, but future subject callbacks could add risk over time. > +if test "$with_liburing" = yes; then > + PKG_CHECK_MODULES(LIBURING, liburing) > +fi I used the attached makefile patch to build w/ liburing. > +pgaio_uring_shmem_init(bool first_time) > +{ > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - > MAX_IO_WORKERS; > + bool found; > + > + aio_uring_contexts = (PgAioUringContext *) > + ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found); > + > + if (found) > + return; > + > + for (int contextno = 0; contextno < TotalProcs; contextno++) > + { > + PgAioUringContext *context = &aio_uring_contexts[contextno]; > + int ret; > + > + /* > + * XXX: Probably worth sharing the WQ between the different > rings, > + * when supported by the kernel. Could also cause additional > + * contention, I guess? > + */ > +#if 0 > + if (!AcquireExternalFD()) > + elog(ERROR, "No external FD available"); > +#endif > + ret = io_uring_queue_init(io_max_concurrency, > &context->io_uring_ring, 0); With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring" fails early: 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312" 2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT 2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up 2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections 2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor 2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported 2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory 2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request I expect that's from io_uring_queue_init() stashing in shared memory a file descriptor and mmap address, which aren't valid in EXEC_BACKEND children. Reattaching descriptors and memory in each child may work, or one could just block io_method=io_uring under EXEC_BACKEND. > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) > +{ > + struct io_uring *uring_instance = > &my_shared_uring_context->io_uring_ring; > + > + Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE); > + > + for (int i = 0; i < num_staged_ios; i++) > + { > + PgAioHandle *ioh = staged_ios[i]; > + struct io_uring_sqe *sqe; > + > + sqe = io_uring_get_sqe(uring_instance); > + > + pgaio_io_prepare_submit(ioh); > + pgaio_uring_sq_from_io(ioh, sqe); > + } > + > + while (true) > + { > + int ret; > + > + pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT); > + ret = io_uring_submit(uring_instance); > + pgstat_report_wait_end(); > + > + if (ret == -EINTR) > + { > + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios); > + continue; > + } Since io_uring_submit() is a wrapper around io_uring_enter(), this should also retry on EAGAIN. "man io_uring_enter" has: EAGAIN The kernel was unable to allocate memory for the request, or otherwise ran out of resources to handle it. The application should wait for some completions and try again. > +FileStartWriteV(struct PgAioHandle *ioh, File file, > + int iovcnt, off_t offset, > + uint32 wait_event_info) > +{ > + int returnCode; > + Vfd *vfdP; > + > + Assert(FileIsValid(file)); > + > + DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d", > + file, VfdCache[file].fileName, > + (int64) offset, > + iovcnt)); > + > + returnCode = FileAccess(file); > + if (returnCode < 0) > + return returnCode; > + > + vfdP = &VfdCache[file]; > + > + /* FIXME: think about / reimplement temp_file_limit */ > + > + pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset); > + > + return 0; > +} FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever). For non-sync IO methods, I gather it's essential that a process other than the IO definer be scanning for incomplete IOs and completing them. Otherwise, deadlocks like this would happen: backend1 locks blk1 for non-IO reasons backend2 locks blk2, starts AIO write backend1 waits for lock on blk2 for non-IO reasons backend2 waits for lock on blk1 for non-IO reasons If that's right, in worker mode, the IO worker resolves that deadlock. What resolves it under io_uring? Another process that happens to do pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to make that happen systematically. Could you add a mention of "deadlock" in the comment at whichever code achieves that? I could share more-tactical observations about patches 6-20, but they're probably things you'd change without those observations. Is there any specific decision you'd like to settle before patch 6 exits WIP? Thanks, nm
diff --git a/src/backend/Makefile b/src/backend/Makefile index 84302cc..b123fdc 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -43,9 +43,10 @@ OBJS = \ $(top_builddir)/src/common/libpgcommon_srv.a \ $(top_builddir)/src/port/libpgport_srv.a -# We put libpgport and libpgcommon into OBJS, so remove it from LIBS; also add -# libldap and ICU -LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS) +# We put libpgport and libpgcommon into OBJS, so remove it from LIBS. +LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) +# The backend conditionally needs libraries that most executables don't need. +LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS) # The backend doesn't need everything that's in LIBS, however LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))