Hi, On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > Coincidentally, one of my buildfarm animals hanged several weeks in a > different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS > reduction was part of making it reproducible: > > On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote: > > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > > > If you look at the buildfarm's failures page and filter down to > > > > just subscriptionCheck failures, what you find is that all of the > > > > last 6 such failures are in 031_column_list.pl: > > > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > > > reported a replacement BgWriterDelay value reproducing it. > > > > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to > > BgWriterDelay. > > I'm reusing this thread just in case there's overlap with the > 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is > a race condition arising from an event sequence like this: > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > - Test script calls pg_log_standby_snapshot() on primary. Emits > XLOG_RUNNING_XACTS. > - checkpoint_timeout makes a primary checkpoint finish. Emits > XLOG_RUNNING_XACTS. > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS. > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > Other test code already has a solution for this, so the attached patches add a > timeout and copy the existing solution. I'm also attaching the hack that > makes it 100% reproducible.
Thanks! I did a few tests and confirm that the proposed solution fixes the corner case. standby-slot-test-1-timeout-v1.patch LGTM. Regarding standby-slot-test-2-race-v1.patch: > +# See corresponding create_logical_slot_on_standby() code. > +$node_standby->poll_query_until( > + 'postgres', qq[ > + SELECT restart_lsn IS NOT NULL > + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' > + ]) or die "timed out waiting for logical slot to calculate its > restart_lsn"; > + What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm and then make use of it in create_logical_slot_on_standby() and above? (something like wait_for_restart_lsn_calculation-v1.patch attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
commit 14d988c748f3e500c44d65e073c276e6d8af6156 Author: Bertrand Drouvot <bertranddrouvot...@gmail.com> Date: Wed Feb 14 15:21:35 2024 +0000 wait_for_restart_lsn_calculation diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index e2e70d0dbf..21cf179db1 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -3174,6 +3174,28 @@ $SIG{TERM} = $SIG{INT} = sub { =pod +=item $node->wait_for_restart_lsn_calculation(self, slot_name) + +Create logical replication slot on given standby + +=cut + +sub wait_for_restart_lsn_calculation +{ + my ($self, $slot_name) = @_; + my ($stdout, $stderr); + + $self->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' + ]) + or die + "timed out waiting for logical slot to calculate its restart_lsn"; +} + +=pod + =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname) Create logical replication slot on given standby @@ -3203,13 +3225,7 @@ sub create_logical_slot_on_standby # xl_running_xacts WAL record from the restart_lsn onwards. First wait # until the slot restart_lsn is determined. - $self->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' - ]) - or die - "timed out waiting for logical slot to calculate its restart_lsn"; + $self->wait_for_restart_lsn_calculation($slot_name); # Then arrange for the xl_running_xacts record for which pg_recvlogical is # waiting. diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index f7b7fc7f9e..85330720c5 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -465,12 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n"; $psql_subscriber{run}->pump_nb(); -# See corresponding create_logical_slot_on_standby() code. -$node_standby->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' - ]) or die "timed out waiting for logical slot to calculate its restart_lsn"; +# Wait for restart_lsn calculation +$node_standby->wait_for_restart_lsn_calculation('tap_sub'); # Speed up the subscription creation $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");