Re: 035_standby_logical_decoding unbounded hang
On Thu, Mar 07, 2024 at 02:46:55PM +0500, Andrey M. Borodin wrote: > I’m not sure if it is connected, but so far many patches in CFbot keep > hanging in this test. For example [0]. > [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 Relevant part: [22:03:10.292] stderr: [22:03:10.292] # poll_query_until timed out executing this query: [22:03:10.292] # SELECT (SELECT catalog_xmin::text::int - 770 from pg_catalog.pg_replication_slots where slot_name = 'injection_activeslot') > 0 [22:03:10.292] # expecting this output: [22:03:10.292] # t [22:03:10.292] # last actual query output: [22:03:10.292] # f [22:03:10.292] # with stderr: [22:03:10.292] # Tests were run but no plan was declared and done_testing() was not seen. [22:03:10.292] # Looks like your test exited with 255 just after 57. The injection_activeslot test got added after $SUBJECT, in 08a52ab151 (2024-03-06). It's now reverted in 65db0cfb4c (2024-03-07).
Re: 035_standby_logical_decoding unbounded hang
> On 20 Feb 2024, at 04:09, Noah Misch wrote: > I’m not sure if it is connected, but so far many patches in CFbot keep hanging in this test. For example [0]. I haven’t done root cause analysis yet, but hangs may be related to this thread. Maybe someone more familiar with similar issues could take a look there? Thanks! Best regards, Andrey Borodin. [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292
Re: 035_standby_logical_decoding unbounded hang
On Fri, Feb 16, 2024 at 06:37:38AM +, Bertrand Drouvot wrote: > On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote: > > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote: > > > What about creating a sub, say wait_for_restart_lsn_calculation() in > > > Cluster.pm > > > and then make use of it in create_logical_slot_on_standby() and above? > > > (something > > > like wait_for_restart_lsn_calculation-v1.patch attached). > > > > Waiting for restart_lsn is just a prerequisite for calling > > pg_log_standby_snapshot(), so I wouldn't separate those two. > > Yeah, makes sense. > > > If we're > > extracting a sub, I would move the pg_log_standby_snapshot() call into the > > sub > > and make the API like one of these: > > > > $standby->wait_for_subscription_starting_point($primary, $slot_name); > > $primary->log_standby_snapshot($standby, $slot_name); > > > > Would you like to finish the patch in such a way? > > Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used > log_standby_snapshot() as it seems more generic to me. Thanks. Pushed at commit 0e16281.
Re: 035_standby_logical_decoding unbounded hang
Hi, On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote: > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote: > > What about creating a sub, say wait_for_restart_lsn_calculation() in > > Cluster.pm > > and then make use of it in create_logical_slot_on_standby() and above? > > (something > > like wait_for_restart_lsn_calculation-v1.patch attached). > > Waiting for restart_lsn is just a prerequisite for calling > pg_log_standby_snapshot(), so I wouldn't separate those two. Yeah, makes sense. > If we're > extracting a sub, I would move the pg_log_standby_snapshot() call into the sub > and make the API like one of these: > > $standby->wait_for_subscription_starting_point($primary, $slot_name); > $primary->log_standby_snapshot($standby, $slot_name); > > Would you like to finish the patch in such a way? Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used log_standby_snapshot() as it seems more generic to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 07da74cf56..5d3174d0b5 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -3181,6 +3181,37 @@ $SIG{TERM} = $SIG{INT} = sub { =pod +=item $node->log_standby_snapshot(self, standby, slot_name) + +Log a standby snapshot on primary once the slot restart_lsn is determined on +the standby. + +=cut + +sub log_standby_snapshot +{ + my ($self, $standby, $slot_name) = @_; + my ($stdout, $stderr); + + # Once the slot's restart_lsn is determined, the standby looks for + # xl_running_xacts WAL record from the restart_lsn onwards. First wait + # until the slot restart_lsn is determined. + + $standby->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' + ]) + or die + "timed out waiting for logical slot to calculate its restart_lsn"; + + # Then arrange for the xl_running_xacts record for which the standby is + # waiting. + $self->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()'); +} + +=pod + =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname) Create logical replication slot on given standby @@ -3206,21 +3237,9 @@ sub create_logical_slot_on_standby '2>', \$stderr); - # Once the slot's restart_lsn is determined, the standby looks for - # xl_running_xacts WAL record from the restart_lsn onwards. First wait - # until the slot restart_lsn is determined. - - $self->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' - ]) - or die - "timed out waiting for logical slot to calculate its restart_lsn"; - - # Then arrange for the xl_running_xacts record for which pg_recvlogical is + # Arrange for the xl_running_xacts record for which pg_recvlogical is # waiting. - $primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()'); + $primary->log_standby_snapshot($self, $slot_name); $handle->finish(); diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index b020361b29..0710bccc17 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -465,8 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n"; $psql_subscriber{run}->pump_nb(); -# Speed up the subscription creation -$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); +# Log the standby snapshot to speed up the subscription creation +$node_primary->log_standby_snapshot($node_standby, 'tap_sub'); # Explicitly shut down psql instance gracefully - to avoid hangs # or worse on windows
Re: 035_standby_logical_decoding unbounded hang
On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote: > On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > > The 035_standby_logical_decoding.pl hang is > > a race condition arising from an event sequence like this: > > > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > > - Test script calls pg_log_standby_snapshot() on primary. Emits > > XLOG_RUNNING_XACTS. > > - checkpoint_timeout makes a primary checkpoint finish. Emits > > XLOG_RUNNING_XACTS. > > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits > > XLOG_RUNNING_XACTS. > > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > > > Other test code already has a solution for this, so the attached patches > > add a > > timeout and copy the existing solution. I'm also attaching the hack that > > makes it 100% reproducible. > I did a few tests and confirm that the proposed solution fixes the corner > case. Thanks for reviewing. > What about creating a sub, say wait_for_restart_lsn_calculation() in > Cluster.pm > and then make use of it in create_logical_slot_on_standby() and above? > (something > like wait_for_restart_lsn_calculation-v1.patch attached). Waiting for restart_lsn is just a prerequisite for calling pg_log_standby_snapshot(), so I wouldn't separate those two. If we're extracting a sub, I would move the pg_log_standby_snapshot() call into the sub and make the API like one of these: $standby->wait_for_subscription_starting_point($primary, $slot_name); $primary->log_standby_snapshot($standby, $slot_name); Would you like to finish the patch in such a way?
Re: 035_standby_logical_decoding unbounded hang
Hi, On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > Coincidentally, one of my buildfarm animals hanged several weeks in a > different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS > reduction was part of making it reproducible: > > On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote: > > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > > > If you look at the buildfarm's failures page and filter down to > > > > just subscriptionCheck failures, what you find is that all of the > > > > last 6 such failures are in 031_column_list.pl: > > > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > > > reported a replacement BgWriterDelay value reproducing it. > > > > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to > > BgWriterDelay. > > I'm reusing this thread just in case there's overlap with the > 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is > a race condition arising from an event sequence like this: > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > - Test script calls pg_log_standby_snapshot() on primary. Emits > XLOG_RUNNING_XACTS. > - checkpoint_timeout makes a primary checkpoint finish. Emits > XLOG_RUNNING_XACTS. > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS. > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > Other test code already has a solution for this, so the attached patches add a > timeout and copy the existing solution. I'm also attaching the hack that > makes it 100% reproducible. Thanks! I did a few tests and confirm that the proposed solution fixes the corner case. standby-slot-test-1-timeout-v1.patch LGTM. Regarding standby-slot-test-2-race-v1.patch: > +# See corresponding create_logical_slot_on_standby() code. > +$node_standby->poll_query_until( > + 'postgres', qq[ > + SELECT restart_lsn IS NOT NULL > + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' > + ]) or die "timed out waiting for logical slot to calculate its > restart_lsn"; > + What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm and then make use of it in create_logical_slot_on_standby() and above? (something like wait_for_restart_lsn_calculation-v1.patch attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com commit 14d988c748f3e500c44d65e073c276e6d8af6156 Author: Bertrand Drouvot Date: Wed Feb 14 15:21:35 2024 + wait_for_restart_lsn_calculation diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index e2e70d0dbf..21cf179db1 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -3174,6 +3174,28 @@ $SIG{TERM} = $SIG{INT} = sub { =pod +=item $node->wait_for_restart_lsn_calculation(self, slot_name) + +Create logical replication slot on given standby + +=cut + +sub wait_for_restart_lsn_calculation +{ + my ($self, $slot_name) = @_; + my ($stdout, $stderr); + + $self->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' + ]) + or die + "timed out waiting for logical slot to calculate its restart_lsn"; +} + +=pod + =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname) Create logical replication slot on given standby @@ -3203,13 +3225,7 @@ sub create_logical_slot_on_standby # xl_running_xacts WAL record from the restart_lsn onwards. First wait # until the slot restart_lsn is determined. - $self->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' - ]) - or die - "timed out waiting for logical slot to calculate its restart_lsn"; + $self->wait_for_restart_lsn_calculation($slot_name); # Then arrange for the xl_running_xacts record for which pg_recvlogical is # waiting. diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index f7b7fc7f9e..85330720c5 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -465,12 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n"; $psql_subscriber{run}->pump_nb(); -# See corresponding create_logical_slot_on_standby() code. -$node_standby->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' - ]) or die "timed out waiting for logical slot to calculate its restart_lsn"; +# Wait for restart_lsn calculation +$node_standby->wait_for_restart_lsn_calculation('tap_sub');
035_standby_logical_decoding unbounded hang
Coincidentally, one of my buildfarm animals hanged several weeks in a different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS reduction was part of making it reproducible: On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote: > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > > If you look at the buildfarm's failures page and filter down to > > > just subscriptionCheck failures, what you find is that all of the > > > last 6 such failures are in 031_column_list.pl: > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > > reported a replacement BgWriterDelay value reproducing it. > > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to > BgWriterDelay. I'm reusing this thread just in case there's overlap with the 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is a race condition arising from an event sequence like this: - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. - Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS. - checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS. - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS. - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. Other test code already has a solution for this, so the attached patches add a timeout and copy the existing solution. I'm also attaching the hack that makes it 100% reproducible. Author: Noah Misch Commit: Noah Misch [not for commit] Demonstrate 035_standby_logical_decoding.pl hang. There are three way XLOG_RUNNING_XACTS can show up: - SELECT pg_log_standby_snapshot() - CHECKPOINT - bgwriter LOG_SNAPSHOT_INTERVAL_MS An idle primary won't do the second two. If you're unlucky enough for all three to happen before CREATE_REPLICATION_SLOT's DecodingContextFindStartpoint() call starts watching, then CREATE_REPLICATION_SLOT will hang indefinitely. diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index f11ce27..07f987c 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -70,7 +70,7 @@ int BgWriterDelay = 200; * Interval in which standby snapshots are logged into the WAL stream, in * milliseconds. */ -#define LOG_SNAPSHOT_INTERVAL_MS 15000 +#define LOG_SNAPSHOT_INTERVAL_MS 1 /* * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 9270d7b..0f170a4 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -1085,6 +1085,7 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, appendStringInfoString(&cmd, " PHYSICAL RESERVE_WAL"); } + pg_usleep(1000 * 1000); res = libpqrcv_PQexec(conn->streamConn, cmd.data); pfree(cmd.data); diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index cebfa52..c371369 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -467,6 +467,7 @@ $psql_subscriber{run}->pump_nb(); # Speed up the subscription creation $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); +$node_primary->safe_psql('postgres', "CHECKPOINT"); # Explicitly shut down psql instance gracefully - to avoid hangs # or worse on windows Author: Noah Misch Commit: Noah Misch Bound waits in 035_standby_logical_decoding.pl. One IPC::Run::start() used an IPC::Run::timer() without checking for expiration. The other used no timeout or timer. Back-patch to v16, which introduced the test. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index c371369..61da915 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -21,7 +21,6 @@ my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby'); my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); my $default_timeout = $PostgreSQL::Test::Utils::timeout_default; -my $psql_timeout = IPC::Run::timer($default_timeout); my $res; # Name for the physical slot on primary @@ -90,7 +89,8 @@ sub make_slot_active '>', $to_stdout, '2>', - $to_stderr); + $to_stderr, + IPC::Run::tim