On Wed, Feb 14, 2024 at 03:31:16PM +0000, Bertrand Drouvot wrote:
> On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> > The 035_standby_logical_decoding.pl hang is
> > a race condition arising from an event sequence like this:
> > 
> > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> > - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> > XLOG_RUNNING_XACTS.
> > - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> > XLOG_RUNNING_XACTS.
> > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits 
> > XLOG_RUNNING_XACTS.
> > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> > 
> > Other test code already has a solution for this, so the attached patches 
> > add a
> > timeout and copy the existing solution.  I'm also attaching the hack that
> > makes it 100% reproducible.

> I did a few tests and confirm that the proposed solution fixes the corner 
> case.

Thanks for reviewing.

> What about creating a sub, say wait_for_restart_lsn_calculation() in 
> Cluster.pm
> and then make use of it in create_logical_slot_on_standby() and above? 
> (something
> like wait_for_restart_lsn_calculation-v1.patch attached).

Waiting for restart_lsn is just a prerequisite for calling
pg_log_standby_snapshot(), so I wouldn't separate those two.  If we're
extracting a sub, I would move the pg_log_standby_snapshot() call into the sub
and make the API like one of these:

  $standby->wait_for_subscription_starting_point($primary, $slot_name);
  $primary->log_standby_snapshot($standby, $slot_name);

Would you like to finish the patch in such a way?


Reply via email to