On Wed, 30 Apr 2025 at 17:41, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <t...@sss.pgh.pa.us> wrote:
> >
> > Xuneng Zhou pointed out on Discord that the test case added by
> > 7c99dc587 has caused repeated failures in CI --- though oddly,
> > it's not failed in the buildfarm so far as I can find.  The
> > failures look like
> >
> > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped loading 
> > publication: tap_pub_3) at 
> > /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
> >
>
> I analyzed the relevant publisher-side CI Logs [1]:
> ...
> 2025-04-19 08:24:14.096 UTC [21961][client backend]
> [024_add_drop_pub.pl][7/4:0] LOG:  statement: INSERT INTO tab_3
> values(1)
> 2025-04-19 08:24:14.098 UTC [21961][client backend]
> [024_add_drop_pub.pl][:0] LOG:  disconnection: session time:
> 0:00:00.003 user=postgres database=postgres host=[local]
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
> released logical replication slot "tap_sub"
> 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
> disconnection: session time: 0:00:00.329 user=postgres
> database=postgres host=[local]
> 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
> LOG:  connection received: host=[local]
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG:  connection authenticated: user="postgres" method=trust
> (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
> 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> LOG:  replication connection authorized: user=postgres
> application_name=tap_sub
> 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
> statement: SELECT pg_catalog.set_config('search_path', '', false);
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
> STATEMENT:  IDENTIFY_SYSTEM
> 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/0 (proto_version '4', streaming 'parallel', origin 'any',
> publication_names '"tap_pub_
> ...
>
> This shows that walsender restarts after the "INSERT INTO tab_3
> values(1)" is processed by the previous walsender ("released logical
> replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
> So, it is possible that the old apply worker has sent the confirmation
> of WAL received location after the Insert (due to keep_alive message
> handling). So, after the restart, the new walsender will start
> processing WAL after the INSERT and wait for the skipped message LOG
> timed out.
>
> Considering the above theory is correct, after "ALTER SUBSCRIPTION
> tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
> walsender to restart. We are already doing the same for a similar case
> in 001_rep_changes.pl (See "# check that change of connection string
> and/or publication list causes restart of subscription workers. We
> check the state along with application_name to ensure that the
> walsender is (re)started.).
>
> Unfortunately, I will be away for the rest of the week. In the
> meantime, if you or someone else is able to reproduce and fix it, then
> good; otherwise, I'll take care of it after I return.

I agree with your analysis. I was able to reproduce the issue by
delaying the invalidation of the subscription until the walsender
finished decoding the INSERT operation following the ALTER
SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
of the INSERT after the ALTER SUBSCRIPTION.  In this scenario, the
confirmed_flush_lsn ends up pointing to a location after the INSERT.
When the invalidation is eventually received and the apply
worker/walsender is restarted, the restarted walsender begins decoding
from that LSN—after the INSERT—which means the "skipped loading
publication" warning is never triggered, causing the test to fail.

Attached is a patch that ensures the walsender process is properly
restarted after ALTER SUBSCRIPTION, preventing this race condition.

Regards,
Vignesh
From 26c8efbf59a456f4ad8a87b504180449efe1cd69 Mon Sep 17 00:00:00 2001
From: Vignesh <vignes...@gmail.com>
Date: Wed, 30 Apr 2025 21:38:33 +0530
Subject: [PATCH] Fix race condition after ALTER SUBSCRIPTION SET PUBLICATION

Previously, after executing ALTER SUBSCRIPTION tap_sub SET PUBLICATION, we
did not wait for the new walsender process to restart. As a result, an INSERT
executed immediately after the ALTER could be decoded and the confirmed flush
lsn is advanced. This could cause replication to resume from a point after the
INSERT. In such cases, we miss the expected warning about the missing
publication.

To fix this, we now ensure that the walsender has restarted before continuing
after ALTER SUBSCRIPTION.
---
 src/test/subscription/t/024_add_drop_pub.pl | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/test/subscription/t/024_add_drop_pub.pl b/src/test/subscription/t/024_add_drop_pub.pl
index b594941c7cb..e995d8b3839 100644
--- a/src/test/subscription/t/024_add_drop_pub.pl
+++ b/src/test/subscription/t/024_add_drop_pub.pl
@@ -91,10 +91,21 @@ is($result, qq(20|1|10), 'check initial data is copied to subscriber');
 $node_publisher->safe_psql('postgres', "CREATE TABLE tab_3 (a int)");
 $node_subscriber->safe_psql('postgres', "CREATE TABLE tab_3 (a int)");
 
+my $oldpid = $node_publisher->safe_psql('postgres',
+	"SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+);
+
 # Set the subscription with a missing publication
 $node_subscriber->safe_psql('postgres',
 	"ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_3");
 
+# Wait for the walsender to restart after altering the subscription
+$node_publisher->poll_query_until('postgres',
+	"SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for apply worker to restart after altering the subscription";
+
 my $offset = -s $node_publisher->logfile;
 
 $node_publisher->safe_psql('postgres',"INSERT INTO tab_3 values(1)");
-- 
2.43.0

Reply via email to