On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <[email protected]> wrote: > The if (worker == -1) is done for every to be submitted IO. If there are no > idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM it > should just be: > > for (int i = 0; i < num_staged_ios; ++i) > { > > Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i])); > if > (!pgaio_worker_submission_queue_insert(staged_ios[i])) > { > /* > * Do the rest synchronously. If the queue is > full, give up > * and do the rest synchronously. We're > holding an exclusive > * lock on the queue so nothing can consume > entries. > */ > synchronous_ios = &staged_ios[i]; > nsync = (num_staged_ios - i); > > break; > } > } > > /* Choose one worker to wake for this batch. */ > if (worker == -1) > worker = pgaio_worker_choose_idle(-1);
Well I didn't want to wake a worker if we'd failed to enqueue anything. Ahh, I could put it there and test nsync. Or I guess I could just do it anyway. Considering that. > > > I think both 'wakeups" and "ios" are a bit too generically named. Based > > > on the > > > names I have no idea what this heuristic might be. > > > > I have struggled to name them. Does wakeup_count and io_count help? > > hist_wakeups, hist_ios? Thanks, that's a good name. > > No, we only set it if it isn't already set (like a latch), and only > > send a pmsignal when we set it (like a latch), and the postmaster only > > clears it if it can start a worker (unlike a latch). That applies in > > general, not just when we hit the cap of io_max_workers: while the > > postmaster is waiting for launch interval to expire, it will leave the > > flag set, suppressed for 100ms or whatever, and the in the special > > case of io_max_workers, for as long as the count remains that high. > > I'm quite certain that's not how it actually ended up working with the prior > version and the benchmark I showed, there indeed were a lot of requests to > postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the old > name already) very frequently clears the flag, just for it to be immediately > set again. > > > Yep, still happens, does require the max to be smaller than 32 though. > > While a lot of IO is happening, no new connections being started, and with > 1781562 being postmaster's pid: > > perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1 > > Performance counter stats for process id '1781562': > > 2,790 raw_syscalls:sys_enter > > 1.001872667 seconds time elapsed > > 2,814 raw_syscalls:sys_enter > > 1.001983049 seconds time elapsed > > 3,036 raw_syscalls:sys_enter > > 1.001705850 seconds time elapsed > > 2,982 raw_syscalls:sys_enter > > 1.001881364 seconds time elapsed > > > I think it may need a timestamp in the shared state to not allow another > postmaster wake until some time has elapsed, or something. Hnng. Studying... > > I should have made it clearer that that's a secondary condition. The > > primary condition is: a worker wanted to wake another worker, but > > found that none were idle. Unfortunately the whole system is a bit > > too asynchronous for that to be a reliable cue on its own. So, I also > > check if the queue appears to be (1) obviously growing: that's clearly > > too long and must be introducing latency, or (2) varying "too much". > > Which I detect in exactly the same way. > > > > Imagine a histogram that look like this: > > > > LOG: depth 00: 7898 > > LOG: depth 01: 1630 > > LOG: depth 02: 308 > > LOG: depth 03: 93 > > LOG: depth 04: 40 > > LOG: depth 05: 19 > > LOG: depth 06: 6 > > LOG: depth 07: 4 > > LOG: depth 08: 0 > > LOG: depth 09: 1 > > LOG: depth 10: 1 > > LOG: depth 11: 0 > > LOG: depth 12: 0 > > LOG: depth 13: 0 > > > > If you're failing to find idle workers to wake up AND our managic > > threshold is hit by something in that long tail, then it'll call for > > backup. Of course I'm totally sidestepping a lot of queueing theory > > maths and just saying "I'd better be able to find an idle worker when > > I want to" and if not, "there had better not be any outliers that > > reach this far". > > > > I've written a longer explanation in a long comment. Including a > > little challenge for someone to do better with real science and maths. > > I hope it's a bit clearer at least. > > Definitely good to have that comment. Have to ponder it for a bit. Let me try again. Our goal is simple: process every IO immediately. We have immediate feedback that is simple: there's an IO in the queue and there is no idle worker. The only action we can take is simple: add one more worker. So we don't need to suffer through the maths required to figure out the ideal k for our M/G/k queue system (I think that's what we have?) or any of the inputs that would require*. The problem is that on its own, the test triggered far too easily because a worker that is not marked idle might in fact be just about to pick up that IO on the one the one hand, and because there might be rare spikes/clustering on the other, so I cooled it off a bit by additionally testing if the queue appears to be growing or spiking beyond some threshold. I think it's OK to let the queue grow a bit before we are triggered anyway, so the precise value used doesn't seem too critical. Someone might be able to come up with a more defensible value, but in the end I just wanted a value that isn't triggered by the outliers I see in real systems that are keeping up. We could tune it lower and overshoot more, but this setting seems to work pretty well. It doesn't seem likely that a real system could achieve a steady state that is introducing latency but isn't increasing over time, and pool size adjustments are bound to lag anyway. * It's probably quite hard for call centres to figure out the number of agents required to make you wait for a certain length of time, but it's easy to know if you had to wait and you wish they had more! > I've not again looked through the details, but based on a relatively short > experiment, the one problematic thing I see is the frequent postmaster > requests. Looking into that...
