On Wed, Feb 14, 2024 at 03:31:16PM +0000, Bertrand Drouvot wrote: > On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > > The 035_standby_logical_decoding.pl hang is > > a race condition arising from an event sequence like this: > > > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > > - Test script calls pg_log_standby_snapshot() on primary. Emits > > XLOG_RUNNING_XACTS. > > - checkpoint_timeout makes a primary checkpoint finish. Emits > > XLOG_RUNNING_XACTS. > > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits > > XLOG_RUNNING_XACTS. > > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > > > Other test code already has a solution for this, so the attached patches > > add a > > timeout and copy the existing solution. I'm also attaching the hack that > > makes it 100% reproducible.
> I did a few tests and confirm that the proposed solution fixes the corner > case. Thanks for reviewing. > What about creating a sub, say wait_for_restart_lsn_calculation() in > Cluster.pm > and then make use of it in create_logical_slot_on_standby() and above? > (something > like wait_for_restart_lsn_calculation-v1.patch attached). Waiting for restart_lsn is just a prerequisite for calling pg_log_standby_snapshot(), so I wouldn't separate those two. If we're extracting a sub, I would move the pg_log_standby_snapshot() call into the sub and make the API like one of these: $standby->wait_for_subscription_starting_point($primary, $slot_name); $primary->log_standby_snapshot($standby, $slot_name); Would you like to finish the patch in such a way?