On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > I've been annoyed for awhile because, while a parallel check-world > run usually takes a bit over a minute on my machine, sometimes it > takes between three and four minutes. I was finally able to > track down what is happening, and it's this: sometimes one or > another of the src/test/subscription tests takes an extra three > minutes because the logical replication launcher is sleeping > instead of launching the next task. It eventually reaches its > hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min), > wakes up and notices it has something to do, and then we're > on our merry way again. I'm not sure how often this is a problem > in the real world, but it happens often enough to be annoying > during development. > > There are two distinct bugs involved: > > 1. WaitForReplicationWorkerAttach sometimes has to clear a process > latch event so that it can keep waiting for the worker to launch. > It neglects to set the latch again, allowing ApplyLauncherMain > to miss events. >
There was a previous discussion to fix this behavior. Heikki has proposed a similar fix for this, but at the caller. See the patch attached in email [1]. The comments in his patch are more explicit about the kind of events that are missed. Then there is also a discussion solving it by having two latches, one for worker exiting and the other for subscription changes. There is a slight advantage with the approach of using two latches, which is probably to avoid extra looping to traverse the subscriptions when nothing has changed. I find one of your or Heikki's proposals to fix this issue is good enough. > 2. ApplyLauncherMain ignores a failure return from > logicalrep_worker_launch, which is bad because (unless it has > another worker launch pending) it will then sleep for > DEFAULT_NAPTIME_PER_CYCLE before reconsidering. What it ought to do > is try again after wal_retrieve_retry_interval. This situation can > arise in resource-exhaustion cases (cf. the early exits in > logicalrep_worker_launch), but what's more likely in the regression > tests is that the worker stops due to some error condition before > WaitForReplicationWorkerAttach sees it attached, which is then duly > reported as a failure. > LGTM. > It's possible to make the test slowness extremely reproducible > with this change, which widens the race condition window for > both problems: > > diff --git a/src/backend/replication/logical/launcher.c > b/src/backend/replication/logical/launcher.c > index 1c3c051403d..724e82bcdc1 100644 > --- a/src/backend/replication/logical/launcher.c > +++ b/src/backend/replication/logical/launcher.c > @@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, > */ > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_TIMEOUT | > WL_EXIT_ON_PM_DEATH, > - 10L, WAIT_EVENT_BGWORKER_STARTUP); > + 1000L, > WAIT_EVENT_BGWORKER_STARTUP); > > if (rc & WL_LATCH_SET) > { > > I don't recommend that as a permanent change, but it's helpful > for testing the attached patch. > > In the attached, I made two other non-cosmetic changes: > > 3. In WaitForReplicationWorkerAttach, capture worker->in_use > before not after releasing LogicalRepWorkerLock. Maybe there > is a reason why that's not a dangerous race condition, but > it sure is un-obvious to me. > I also can't think of a reason to use current coding. Your change looks good to me. > 4. In process_syncing_tables_for_apply (the other caller of > logicalrep_worker_launch), it seems okay to ignore the > result of logicalrep_worker_launch, but I think it should > fill hentry->last_start_time before not after the call. With this, won't we end up retrying to launch the worker sooner if the launch took time, but still failed to launch the worker? [1] - https://www.postgresql.org/message-id/ff0663d9-8011-420f-a169-efbf57327cb5%40iki.fi -- With Regards, Amit Kapila.