Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-28 Thread vignesh C
On Thu, 25 Apr 2024 at 16:20, Amit Kapila  wrote:
>
> On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, April 24, 2024 6:29 PM vignesh C  wrote:
> > >
> > >
> > > The attached patch
> > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> > > applies for master and PG16 branch,
> > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> > > applies for PG15 branch.
> >
> > Thanks, I have verified that the patches can be applied cleanly and fix the
> > issue on each branch. The regression test can also pass after applying the 
> > patch
> > on my machine.
> >
>
> Pushed.

Thanks for pushing this, I have marked the commitfest entry at [1] as committed.
[1] - https://commitfest.postgresql.org/48/4816/

Regards,
Vignesh




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-25 Thread vignesh C
On Wed, 24 Apr 2024 at 15:49, Amit Kapila  wrote:
>
> On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila  wrote:
> >
> > On Wed, Mar 13, 2024 at 11:59 AM vignesh C  wrote:
> > >
> > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > >
> > > > For 0002, instead of avoid resetting the latch, is it possible to let 
> > > > the
> > > > logical rep worker wake up the launcher once after attaching ?
> > >
> > > Waking up of the launch process uses the same latch that is used for
> > > subscription creation/modification and apply worker process exit. As
> > > the handling of this latch for subscription creation/modification and
> > > worker process exit can be done only by ApplyLauncherMain, we will not
> > > be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> > > waking up the launcher process might not help in this case as
> > > currently we will not be able to differentiate between worker
> > > attached, subscription creation/modification and apply worker process
> > > exit.
> > >
> >
> > IIUC, even if we set the latch once the worker attaches, the other
> > set_latch by subscription creation/modification or apply_worker_exit
> > could also be consumed due to reset of latch in
> > WaitForReplicationWorkerAttach(). Is that understanding correct? If
> > so, can we use some other way to wake up
> > WaitForReplicationWorkerAttach() say condition variable?
> >
>
> The other possibility is to have a GUC launcher_retry_time or
> something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
> still may not resolve the issue if launcher_retry_time is longer but
> at least users would be able to configure it. I am not sure if this is
> a good idea or not but just trying to brainstorm different ideas to
> solve this problem.
>
> BTW, as far as I understand, this is an improvement in the existing
> code, so should be done only for HEAD (probably PG18) and should be
> discussed in a separate thread.

I have started a new thread at [1] to discuss this:
https://www.postgresql.org/message-id/CALDaNm01_KEgHM1tKtgXkCGLJ5209SMSmGw3UmhZbOz365_%3DeA%40mail.gmail.com

Regards,
Vignesh




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-25 Thread Amit Kapila
On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, April 24, 2024 6:29 PM vignesh C  wrote:
> >
> >
> > The attached patch
> > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> > applies for master and PG16 branch,
> > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> > applies for PG15 branch.
>
> Thanks, I have verified that the patches can be applied cleanly and fix the
> issue on each branch. The regression test can also pass after applying the 
> patch
> on my machine.
>

Pushed.

-- 
With Regards,
Amit Kapila.




RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread Zhijie Hou (Fujitsu)
On Wednesday, April 24, 2024 6:29 PM vignesh C  wrote:
> 
> On Wed, 24 Apr 2024 at 11:59, Amit Kapila  wrote:
> >
> > On Wed, Mar 13, 2024 at 9:19 AM vignesh C  wrote:
> > >
> > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C 
> wrote:
> > > >>
> > > >>
> > > >> Thanks, I have created the following Commitfest entry for this:
> > > >> https://commitfest.postgresql.org/47/4816/
> > > >>
> > > >> Regards,
> > > >> Vignesh
> > > >
> > > >
> > > > Thanks for the patch, I have verified that the fix works well by 
> > > > following
> the steps mentioned to reproduce the problem.
> > > > Reviewing the patch, it seems good and is well documented. Just one
> minor comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> > >
> > > Thanks for reviewing the patch, the attached v6 patch has the
> > > changes for the same.
> > >
> >
> > v6_0001* looks good to me. This should be backpatched unless you or
> > others think otherwise.
> 
> This patch needs to be committed in master,PG16 and PG15.
> This is not required from PG14 onwards because they don't have
> HasSubscriptionRelations call before updating table_states_valid:
> /*
>  * Does the subscription have tables?
>  *
>  * If there were not-READY relations found then we know it does. But
>  * if table_states_not_ready was empty we still need to check again to
>  * see if there are 0 tables.
>  */
> has_subrels = (table_states_not_ready != NIL) ||
>   HasSubscriptionRelations(MySubscription->oid);
> 
> So the invalidation function will not be called here.
> 
> Whereas for PG15 and newer versions this is applicable:
> HasSubscriptionRelations calls table_open function which will get the
> invalidate callback like in:
> HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid
> -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages ->
> LocalExecuteInvalidationMessage -> CallSyscacheCallbacks ->
> invalidate_syncing_table_states
> 
> The attached patch
> v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> applies for master and PG16 branch,
> v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> applies for PG15 branch.

Thanks, I have verified that the patches can be applied cleanly and fix the
issue on each branch. The regression test can also pass after applying the patch
on my machine.

Best Regards,
Hou zj


Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread vignesh C
On Wed, 24 Apr 2024 at 11:59, Amit Kapila  wrote:
>
> On Wed, Mar 13, 2024 at 9:19 AM vignesh C  wrote:
> >
> > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> > >
> > >
> > >
> > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> > >>
> > >>
> > >> Thanks, I have created the following Commitfest entry for this:
> > >> https://commitfest.postgresql.org/47/4816/
> > >>
> > >> Regards,
> > >> Vignesh
> > >
> > >
> > > Thanks for the patch, I have verified that the fix works well by 
> > > following the steps mentioned to reproduce the problem.
> > > Reviewing the patch, it seems good and is well documented. Just one minor 
> > > comment I had was probably to change the name of the variable 
> > > table_states_valid to table_states_validity. The current name made sense 
> > > when it was a bool, but now that it is a tri-state enum, it doesn't fit 
> > > well.
> >
> > Thanks for reviewing the patch, the attached v6 patch has the changes
> > for the same.
> >
>
> v6_0001* looks good to me. This should be backpatched unless you or
> others think otherwise.

This patch needs to be committed in master,PG16 and PG15.
This is not required from PG14 onwards because they don't have
HasSubscriptionRelations call before updating table_states_valid:
/*
 * Does the subscription have tables?
 *
 * If there were not-READY relations found then we know it does. But
 * if table_states_not_ready was empty we still need to check again to
 * see if there are 0 tables.
 */
has_subrels = (table_states_not_ready != NIL) ||
  HasSubscriptionRelations(MySubscription->oid);

So the invalidation function will not be called here.

Whereas for PG15 and newer versions this is applicable:
HasSubscriptionRelations calls table_open function which will get the
invalidate callback like in:
HasSubscriptionRelations -> table_open -> relation_open ->
LockRelationOid -> AcceptInvalidationMessages ->
ReceiveSharedInvalidMessages -> LocalExecuteInvalidationMessage ->
CallSyscacheCallbacks -> invalidate_syncing_table_states

The attached patch
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
applies for master and PG16 branch,
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
applies for PG15 branch.

Regards,
Vignesh
From 93116320bd9ffbcddc83a1fc253cba90c56cb47e Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v7] Table sync missed due to race condition in subscription
 cache handling.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +
 src/tools/pgindent/typedefs.list|  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 81fff1919d..7d4ee48206 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -122,7 +122,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -272,7 +279,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1556,13 +1563,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_validity != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_validity = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1596,7 +1605,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild 

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread Amit Kapila
On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila  wrote:
>
> On Wed, Mar 13, 2024 at 11:59 AM vignesh C  wrote:
> >
> > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > >
> > > For 0002, instead of avoid resetting the latch, is it possible to let the
> > > logical rep worker wake up the launcher once after attaching ?
> >
> > Waking up of the launch process uses the same latch that is used for
> > subscription creation/modification and apply worker process exit. As
> > the handling of this latch for subscription creation/modification and
> > worker process exit can be done only by ApplyLauncherMain, we will not
> > be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> > waking up the launcher process might not help in this case as
> > currently we will not be able to differentiate between worker
> > attached, subscription creation/modification and apply worker process
> > exit.
> >
>
> IIUC, even if we set the latch once the worker attaches, the other
> set_latch by subscription creation/modification or apply_worker_exit
> could also be consumed due to reset of latch in
> WaitForReplicationWorkerAttach(). Is that understanding correct? If
> so, can we use some other way to wake up
> WaitForReplicationWorkerAttach() say condition variable?
>

The other possibility is to have a GUC launcher_retry_time or
something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
still may not resolve the issue if launcher_retry_time is longer but
at least users would be able to configure it. I am not sure if this is
a good idea or not but just trying to brainstorm different ideas to
solve this problem.

BTW, as far as I understand, this is an improvement in the existing
code, so should be done only for HEAD (probably PG18) and should be
discussed in a separate thread.

-- 
With Regards,
Amit Kapila.




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread Amit Kapila
On Wed, Mar 13, 2024 at 9:19 AM vignesh C  wrote:
>
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following 
> > the steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor 
> > comment I had was probably to change the name of the variable 
> > table_states_valid to table_states_validity. The current name made sense 
> > when it was a bool, but now that it is a tri-state enum, it doesn't fit 
> > well.
>
> Thanks for reviewing the patch, the attached v6 patch has the changes
> for the same.
>

v6_0001* looks good to me. This should be backpatched unless you or
others think otherwise.

-- 
With Regards,
Amit Kapila.




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-23 Thread Amit Kapila
On Wed, Mar 13, 2024 at 11:59 AM vignesh C  wrote:
>
> On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > For 0002, instead of avoid resetting the latch, is it possible to let the
> > logical rep worker wake up the launcher once after attaching ?
>
> Waking up of the launch process uses the same latch that is used for
> subscription creation/modification and apply worker process exit. As
> the handling of this latch for subscription creation/modification and
> worker process exit can be done only by ApplyLauncherMain, we will not
> be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> waking up the launcher process might not help in this case as
> currently we will not be able to differentiate between worker
> attached, subscription creation/modification and apply worker process
> exit.
>

IIUC, even if we set the latch once the worker attaches, the other
set_latch by subscription creation/modification or apply_worker_exit
could also be consumed due to reset of latch in
WaitForReplicationWorkerAttach(). Is that understanding correct? If
so, can we use some other way to wake up
WaitForReplicationWorkerAttach() say condition variable? The current
proposal can fix the issue but looks bit adhoc.

-- 
With Regards,
Amit Kapila.




RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-16 Thread Zhijie Hou (Fujitsu)
On Wednesday, March 13, 2024 11:49 AM vignesh C  wrote:
> 
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following 
> > the
> steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor
> comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> 
> Thanks for reviewing the patch, the attached v6 patch has the changes for the
> same.

FYI, I noticed that there is one open item on
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to
the fix in this thread.

--
Intermittent failures in 040_standby_failover_slots_sync test
Possible solution in this thread: Race condition in FetchTableStates
--

AFAICS, the bug discussed here is not a new issue on
PG17, so I am thinking to move the item to the "Older bugs affecting stable
branches" section if no objections.

Best Regards,
Hou zj



Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-03-13 Thread vignesh C
On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, March 13, 2024 11:49 AMvignesh C  wrote:
> > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> > >
> > >
> > >
> > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> > >>
> > >>
> > >> Thanks, I have created the following Commitfest entry for this:
> > >> https://commitfest.postgresql.org/47/4816/
> > >>
> > >> Regards,
> > >> Vignesh
> > >
> > >
> > > Thanks for the patch, I have verified that the fix works well by 
> > > following the
> > steps mentioned to reproduce the problem.
> > > Reviewing the patch, it seems good and is well documented. Just one minor
> > comment I had was probably to change the name of the variable
> > table_states_valid to table_states_validity. The current name made sense 
> > when
> > it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> >
> > Thanks for reviewing the patch, the attached v6 patch has the changes for 
> > the
> > same.
>
> Thanks for the patches.
>
> I saw a recent similar BF error[1] which seems related to the issue that 0001
> patch is trying to solve. i.e. The table sync worker is somehow not started
> after refreshing the publication on the subscriber. I didn't see other 
> related ERRORs in
> the log, so I think the reason is the same as the one being discussed in this
> thread, which is the table state invalidation got lost. And the 0001 patch
> looks good to me.
>
> For 0002, instead of avoid resetting the latch, is it possible to let the
> logical rep worker wake up the launcher once after attaching ?

Waking up of the launch process uses the same latch that is used for
subscription creation/modification and apply worker process exit. As
the handling of this latch for subscription creation/modification and
worker process exit can be done only by ApplyLauncherMain, we will not
be able to reset the latch in WaitForReplicationWorkerAttach. I feel
waking up the launcher process might not help in this case as
currently we will not be able to differentiate between worker
attached, subscription creation/modification and apply worker process
exit.

Regards,
Vignesh




RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-03-12 Thread Zhijie Hou (Fujitsu)
On Wednesday, March 13, 2024 11:49 AMvignesh C  wrote:
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following 
> > the
> steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor
> comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> 
> Thanks for reviewing the patch, the attached v6 patch has the changes for the
> same.

Thanks for the patches.

I saw a recent similar BF error[1] which seems related to the issue that 0001
patch is trying to solve. i.e. The table sync worker is somehow not started
after refreshing the publication on the subscriber. I didn't see other related 
ERRORs in
the log, so I think the reason is the same as the one being discussed in this
thread, which is the table state invalidation got lost. And the 0001 patch
looks good to me.

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin=2024-03-11%2000%3A52%3A42

Best Regards,
Hou zj


Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-03-12 Thread vignesh C
On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
>
>
>
> On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
>>
>>
>> Thanks, I have created the following Commitfest entry for this:
>> https://commitfest.postgresql.org/47/4816/
>>
>> Regards,
>> Vignesh
>
>
> Thanks for the patch, I have verified that the fix works well by following 
> the steps mentioned to reproduce the problem.
> Reviewing the patch, it seems good and is well documented. Just one minor 
> comment I had was probably to change the name of the variable 
> table_states_valid to table_states_validity. The current name made sense when 
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.

Thanks for reviewing the patch, the attached v6 patch has the changes
for the same.

Regards,
Vignesh
From ab68e343b4b3f7ec09f758089a9f90a16f21de42 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v6 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not resetting the latch, as this latch will be set for
subscription modifications and apply worker exit. We should check if the
new worker needs to be started or not and reset the latch in ApplyLauncherMain.
---
 src/backend/replication/logical/launcher.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 66070e9131..f2545482c8 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -185,7 +185,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 			   BackgroundWorkerHandle *handle)
 {
 	BgwHandleStatus status;
-	int			rc;
 
 	for (;;)
 	{
@@ -220,16 +219,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		/*
 		 * We need timeout because we generally don't get notified via latch
 		 * about the worker attach.  But we don't expect to have to wait long.
+		 * Since this latch is also used for subscription creation/modification
+		 * and apply worker process exit signal handling, the latch must not be
+		 * reset here. We should check if the new worker needs to be started or
+		 * not and reset the latch in ApplyLauncherMain.
 		 */
-		rc = WaitLatch(MyLatch,
-	   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-	   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		(void) WaitLatch(MyLatch,
+		 WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+		 10L, WAIT_EVENT_BGWORKER_STARTUP);
 	}
 }
 
-- 
2.34.1

From 47c6cd886902693074ece4b2a1188dce6f066bf8 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v6 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +
 src/tools/pgindent/typedefs.list|  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ee06629088..eefe620e5e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if 

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-03-11 Thread Ajin Cherian
On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:

>
> Thanks, I have created the following Commitfest entry for this:
> https://commitfest.postgresql.org/47/4816/
>
> Regards,
> Vignesh
>

Thanks for the patch, I have verified that the fix works well by following
the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor
comment I had was probably to change the name of the variable
table_states_valid to table_states_validity. The current name made sense
when it was a bool, but now that it is a tri-state enum, it doesn't fit
well.

regards,
Ajin Cherian
Fujitsu Australia


Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-08 Thread vignesh C
On Thu, 8 Feb 2024 at 23:30, Alexander Lakhin  wrote:
>
> 08.02.2024 12:25, vignesh C wrote:
> > Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
> > getting missed in this case. The wakeup call can be sent during
> > subscription creation/modification and when the apply worker exits.
> > WaitForReplicationWorkerAttach should not reset the latch here as it
> > will end up delaying the apply worker to get started after 180 seconds
> > timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
> > the latch and lets ApplyLauncherMain to reset the latch and checks if
> > any new worker or missing worker needs to be started.
>
> Thank you for the updated patch!
> I ran all the subscription tests in a loop (with the sleeps added as
> before) and observed no failures and 180+ seconds duration.

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-08 Thread Alexander Lakhin

08.02.2024 12:25, vignesh C wrote:

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.


Thank you for the updated patch!
I ran all the subscription tests in a loop (with the sleeps added as
before) and observed no failures and 180+ seconds duration.

Best regards,
Alexander




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-08 Thread vignesh C
On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin  wrote:
>
> 05.02.2024 13:13, vignesh C wrote:
> > Thanks for the steps for the issue, I was able to reproduce this issue
> > in my environment with the steps provided. The attached patch has a
> > proposed fix where the latch will not be set in case of the apply
> > worker exiting immediately after starting.
>
> It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
> called by a backend executing CREATE SUBSCRIPTION command.
> That is, with the v4-0002 patch applied and pg_usleep(30L); added
> just below
>  if (!worker_in_use)
>  return worker_in_use;
> I still observe the test 027_nosuperuser running for 3+ minutes:
> t/027_nosuperuser.pl .. ok
> All tests successful.
> Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 
> csys =  9.30 CPU)
>
> IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
> regression_sub ...", gets missed when launcher waits for start of another
> worker (logical replication worker for subscription "admin_sub"), launched
> just before that command.

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Regards,
Vignesh
From f04db050a583c9c01eb77766f830a0cf77b0a6c7 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v5 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not resetting the latch, as this latch will be set for
subscription modifications and apply worker exit. We should check if the
new worker needs to be started or not and reset the latch in ApplyLauncherMain.
---
 src/backend/replication/logical/launcher.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..a754f2c757 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -191,7 +191,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 			   BackgroundWorkerHandle *handle)
 {
 	BgwHandleStatus status;
-	int			rc;
 
 	for (;;)
 	{
@@ -226,16 +225,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		/*
 		 * We need timeout because we generally don't get notified via latch
 		 * about the worker attach.  But we don't expect to have to wait long.
+		 * Since this latch is also used for subscription creation/modification
+		 * and apply worker process exit signal handling, the latch must not be
+		 * reset here. We should check if the new worker needs to be started or
+		 * not and reset the latch in ApplyLauncherMain.
 		 */
-		rc = WaitLatch(MyLatch,
-	   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-	   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		(void) WaitLatch(MyLatch,
+		 WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+		 10L, WAIT_EVENT_BGWORKER_STARTUP);
 	}
 }
 
-- 
2.34.1

From 385c96a71396ea8efa86d6136bbf0bfe5282f1d1 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v5 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +
 src/tools/pgindent/typedefs.list|  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-06 Thread Alexander Lakhin

05.02.2024 13:13, vignesh C wrote:

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.


It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
called by a backend executing CREATE SUBSCRIPTION command.
That is, with the v4-0002 patch applied and pg_usleep(30L); added
just below
    if (!worker_in_use)
    return worker_in_use;
I still observe the test 027_nosuperuser running for 3+ minutes:
t/027_nosuperuser.pl .. ok
All tests successful.
Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 
csys =  9.30 CPU)

IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
regression_sub ...", gets missed when launcher waits for start of another
worker (logical replication worker for subscription "admin_sub"), launched
just before that command.

Best regards,
Alexander




Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-05 Thread vignesh C
On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin  wrote:
>
> Hello Vignesh and Hou-san,
>
> 01.02.2024 07:59, vignesh C wrote:
> > Here is an updated patch which changes the boolean variable to a
> > tri-state enum and set stable state to valid only if no invalidations
> > have been occurred while the list is being prepared.
> >
>
> While testing the v3 patch, I observed another anomaly with the 014_binary
> test. When I run the test in parallel, I see that sometimes one of the test
> instances runs much longer than others. For example:
> 49  All tests successful.
> 49  Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  
> 0.30 csys =  0.77 CPU)
> 49  Result: PASS
> 12  All tests successful.
> 12  Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr 
>  0.40 csys =  0.89 CPU)
> 12  Result: PASS
>
> As far as I can see, this happens due to another race condition, this time
> in launcher.c.
> For such a three-minute case I see in _subscriber.log:
> 2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation 
> "public.test_mismatching_types" does not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for 
> replication origin "pg_16398" during message
> type "INSERT" in transaction 757, finished at 0/153C838
> 2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target 
> relation "public.test_mismatching_types" does
> not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for 
> replication origin "pg_16398" during message
> type "INSERT" in transaction 757, finished at 0/153C838
> ...
> 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE 
> TABLE public.test_mismatching_types (
>  a int PRIMARY KEY
>  );
> 2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker 
> "logical replication apply worker for
> subscription 16398"
> 2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical 
> replication apply worker" (PID 949255) exited with
> exit code 1
> ...
> 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER 
> SUBSCRIPTION tsub REFRESH PUBLICATION;
> ...
> 2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication 
> worker for subscription "tsub"
> (there is no interesting activity between 14:33:13 and 14:36:13)
>
> So logical replication apply worker exited because CREATE TABLE was not
> executed on subscriber yet, and new replication worker started because of a
> timeout occurred in WaitLatch(..., wait_time, ...) inside
> ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
> (180 sec)).
>
> But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
> a result of:
> logical replication apply worker| logicalrep_worker_onexit() ->
>ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).
>
> In a bad case, I see that the SetLatch() called as well, but then the latch
> gets reset by the following code in WaitForReplicationWorkerAttach():
>  rc = WaitLatch(MyLatch,
> WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> 10L, WAIT_EVENT_BGWORKER_STARTUP);
>
>  if (rc & WL_LATCH_SET)
>  {
>  ResetLatch(MyLatch);
>  CHECK_FOR_INTERRUPTS();
>  }
>
> With pg_usleep(30); added just before ResetLatch and
>   $node_subscriber->safe_psql(
>   'postgres', qq(
> +SELECT pg_sleep(0.5);
>   CREATE TABLE public.test_mismatching_types (
> in 014_binary.pl, I can see the anomaly without running tests in parallel,
> just when running that test in a loop:
> for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C 
> src/test/subscription/ PROVE_TESTS="t/014*"; done
> ...
> iteration 2
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 
> csys =  0.42 CPU)
> Result: PASS
> iteration 3
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 
> csys =  0.55 CPU)
> Result: PASS
> ...
>
> In other words, the abnormal test execution takes place due to the
> following events:
> 1. logicalrep worker launcher launches replication worker and waits for it
>   to attach:
>   ApplyLauncherMain() -> logicalrep_worker_launch() -> 
> WaitForReplicationWorkerAttach()
> 2. logicalrep worker exits due to some error (logical replication target
>   relation does not exist, in our case) and sends a signal to set the latch
>   for launcher
> 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
>   it inside WaitForReplicationWorkerAttach()
> 4. launcher gets back to ApplyLauncherMain() where it waits for the latch
>   (not set) 

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-02-01 Thread Alexander Lakhin

Hello Vignesh and Hou-san,

01.02.2024 07:59, vignesh C wrote:

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.



While testing the v3 patch, I observed another anomaly with the 014_binary
test. When I run the test in parallel, I see that sometimes one of the test
instances runs much longer than others. For example:
49  All tests successful.
49  Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  
0.30 csys =  0.77 CPU)
49  Result: PASS
12  All tests successful.
12  Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  
0.40 csys =  0.89 CPU)
12  Result: PASS

As far as I can see, this happens due to another race condition, this time
in launcher.c.
For such a three-minute case I see in _subscriber.log:
2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation 
"public.test_mismatching_types" does not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message 
type "INSERT" in transaction 757, finished at 0/153C838
2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types" does 
not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message 
type "INSERT" in transaction 757, finished at 0/153C838

...
2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE 
TABLE public.test_mismatching_types (
    a int PRIMARY KEY
    );
2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for 
subscription 16398"
2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited with 
exit code 1

...
2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER 
SUBSCRIPTION tsub REFRESH PUBLICATION;
...
2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for 
subscription "tsub"
(there is no interesting activity between 14:33:13 and 14:36:13)

So logical replication apply worker exited because CREATE TABLE was not
executed on subscriber yet, and new replication worker started because of a
timeout occurred in WaitLatch(..., wait_time, ...) inside
ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
(180 sec)).

But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
a result of:
logical replication apply worker| logicalrep_worker_onexit() ->
  ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).

In a bad case, I see that the SetLatch() called as well, but then the latch
gets reset by the following code in WaitForReplicationWorkerAttach():
    rc = WaitLatch(MyLatch,
   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
   10L, WAIT_EVENT_BGWORKER_STARTUP);

    if (rc & WL_LATCH_SET)
    {
    ResetLatch(MyLatch);
    CHECK_FOR_INTERRUPTS();
    }

With pg_usleep(30); added just before ResetLatch and
 $node_subscriber->safe_psql(
 'postgres', qq(
+SELECT pg_sleep(0.5);
 CREATE TABLE public.test_mismatching_types (
in 014_binary.pl, I can see the anomaly without running tests in parallel,
just when running that test in a loop:
for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ 
PROVE_TESTS="t/014*"; done
...
iteration 2
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 
csys =  0.42 CPU)
Result: PASS
iteration 3
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 
csys =  0.55 CPU)
Result: PASS
...

In other words, the abnormal test execution takes place due to the
following events:
1. logicalrep worker launcher launches replication worker and waits for it
 to attach:
 ApplyLauncherMain() -> logicalrep_worker_launch() -> 
WaitForReplicationWorkerAttach()
2. logicalrep worker exits due to some error (logical replication target
 relation does not exist, in our case) and sends a signal to set the latch
 for launcher
3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
 it inside WaitForReplicationWorkerAttach()
4. launcher gets back to ApplyLauncherMain() where it waits for the latch
 (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).

Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
test 027_nosuperuser executes for 3+ minutes on each run for me.
And we can find such abnormal execution on the buildfarm too:

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-01-31 Thread vignesh C
On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, January 30, 2024 11:21 AM vignesh C  wrote:
> >
> > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) 
> > wrote:
> > >
> > > On Monday, January 29, 2024 9:22 PM vignesh C 
> > wrote:
> > > >
> > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin 
> > wrote:
> > > > >
> > > > > Hello hackers,
> > > > >
> > > > > After determining a possible cause for intermittent failures of
> > > > > the test subscription/031_column_list [1], I was wondering what
> > > > > makes another subscription test (014_binary) fail on the buildfarm:
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly
> > > > > t=20
> > > > > 24-01-22%2001%3A19%3A03
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2
> > > > 02
> > > > > 4-01-14%2018%3A19%3A20
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet
> > > > > =202
> > > > > 3-12-21%2001%3A11%3A52
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2
> > > > 02
> > > > > 3-11-27%2001%3A42%3A39
> > > > >
> > > > > All those failures caused by a timeout when waiting for a message
> > > > > expected in _subscriber.log. For example, in the snakefly's case:
> > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > > > incorrect binary data format) at
> > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
> > > > ary.pl
> > > > line 269.
> > > > >
> > > > > _subscriber.log contains:
> > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2023-12-21 01:17:46.756 UTC [40] ERROR:  could not receive
> > > > > data from WAL stream: server closed the connection unexpectedly
> > > > >  This probably means the server terminated abnormally
> > > > >  before or while processing the request.
> > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker
> > > > > "logical replication apply worker" (PID 40) exited with exit
> > > > > code 1
> > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication
> > > > > apply worker for subscription "tsub" has started ...
> > > > >
> > > > > While _subscriber.log from a successful test run contains:
> > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > > > session time: 0:00:00.014 user=postgres database=postgres
> > > > > host=[local]
> > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication
> > > > > apply worker for subscription "tsub" has started
> > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication
> > > > > table synchronization worker for subscription "tsub", table
> > > > > "test_mismatching_types" has started
> > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > > > format
> > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > > > test_mismatching_types, line 1, column a
> > > > >
> > > > > In this case, "logical replication apply worker for subscription
> > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
> > > > > minutes
> > > > later.
> > > > >
> > > > > I've managed to reproduce this failure locally by running multiple
> > > > > tests in parallel, and my analysis shows that it is caused by a
> > > > > race condition when accessing variable table_states_valid inside
> > tablesync.c.
> > > > >
> > > > > tablesync.c does the following with table_states_valid:
> > > > > /*
> > > > >   * Callback from syscache invalidation.
> > > > >   */
> > > > > void
> > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > > > hashvalue) {
> > > > >  table_states_valid = false;
> > > > > }
> > > > > ...
> > > > > static bool
> > > > > FetchTableStates(bool *started_tx) { ...
> > > > >  if (!table_states_valid)
> > > > >  {
> > > > > ...
> > > > >  /* Fetch all non-ready tables. */
> > > > >  rstates = GetSubscriptionRelations(MySubscription->oid,
> > > > > true); ...
> > > > >  table_states_valid = true;
> > > > >  }
> > > > >
> > > > > So, when syscache invalidation occurs inside the code block "if
> > > > > (!table_states_valid)", that invalidation is effectively ignored.
> > > > >
> > > > > In a failed case I observe the following events:
> > > > > 1. logical replication apply worker performs
> > > > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > > > >   process_syncing_tables_for_apply() -> FetchTableStates() 
> > > > > periodically.
> > > > >
> > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates
> > syscache
> > > > >   for SUBSCRIPTIONRELMAP, and that leads to calling
> > > > 

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-01-30 Thread Zhijie Hou (Fujitsu)
On Tuesday, January 30, 2024 11:21 AM vignesh C  wrote:
> 
> On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Monday, January 29, 2024 9:22 PM vignesh C 
> wrote:
> > >
> > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin 
> wrote:
> > > >
> > > > Hello hackers,
> > > >
> > > > After determining a possible cause for intermittent failures of
> > > > the test subscription/031_column_list [1], I was wondering what
> > > > makes another subscription test (014_binary) fail on the buildfarm:
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly
> > > > t=20
> > > > 24-01-22%2001%3A19%3A03
> > > >
> > >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2
> > > 02
> > > > 4-01-14%2018%3A19%3A20
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet
> > > > =202
> > > > 3-12-21%2001%3A11%3A52
> > > >
> > >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2
> > > 02
> > > > 3-11-27%2001%3A42%3A39
> > > >
> > > > All those failures caused by a timeout when waiting for a message
> > > > expected in _subscriber.log. For example, in the snakefly's case:
> > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > > incorrect binary data format) at
> > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
> > > ary.pl
> > > line 269.
> > > >
> > > > _subscriber.log contains:
> > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > 2023-12-21 01:17:46.756 UTC [40] ERROR:  could not receive
> > > > data from WAL stream: server closed the connection unexpectedly
> > > >  This probably means the server terminated abnormally
> > > >  before or while processing the request.
> > > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker
> > > > "logical replication apply worker" (PID 40) exited with exit
> > > > code 1
> > > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication
> > > > apply worker for subscription "tsub" has started ...
> > > >
> > > > While _subscriber.log from a successful test run contains:
> > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > > session time: 0:00:00.014 user=postgres database=postgres
> > > > host=[local]
> > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication
> > > > apply worker for subscription "tsub" has started
> > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication
> > > > table synchronization worker for subscription "tsub", table
> > > > "test_mismatching_types" has started
> > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > > format
> > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > > test_mismatching_types, line 1, column a
> > > >
> > > > In this case, "logical replication apply worker for subscription
> > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
> > > > minutes
> > > later.
> > > >
> > > > I've managed to reproduce this failure locally by running multiple
> > > > tests in parallel, and my analysis shows that it is caused by a
> > > > race condition when accessing variable table_states_valid inside
> tablesync.c.
> > > >
> > > > tablesync.c does the following with table_states_valid:
> > > > /*
> > > >   * Callback from syscache invalidation.
> > > >   */
> > > > void
> > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > > hashvalue) {
> > > >  table_states_valid = false;
> > > > }
> > > > ...
> > > > static bool
> > > > FetchTableStates(bool *started_tx) { ...
> > > >  if (!table_states_valid)
> > > >  {
> > > > ...
> > > >  /* Fetch all non-ready tables. */
> > > >  rstates = GetSubscriptionRelations(MySubscription->oid,
> > > > true); ...
> > > >  table_states_valid = true;
> > > >  }
> > > >
> > > > So, when syscache invalidation occurs inside the code block "if
> > > > (!table_states_valid)", that invalidation is effectively ignored.
> > > >
> > > > In a failed case I observe the following events:
> > > > 1. logical replication apply worker performs
> > > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > > >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> > > >
> > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates
> syscache
> > > >   for SUBSCRIPTIONRELMAP, and that leads to calling
> > > >   invalidate_syncing_table_states().
> > > >
> > > > 3. If the apply worker manages to fetch no non-ready tables in
> > > >   FetchTableStates() and ignore "table_states_valid = false" from
> > > >   invalidate_syncing_table_states(), then it just misses the 
> > > > invalidation
> > > >   event, 

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-01-29 Thread vignesh C
On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, January 29, 2024 9:22 PM vignesh C  wrote:
> >
> > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin  wrote:
> > >
> > > Hello hackers,
> > >
> > > After determining a possible cause for intermittent failures of the
> > > test subscription/031_column_list [1], I was wondering what makes
> > > another subscription test (014_binary) fail on the buildfarm:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=20
> > > 24-01-22%2001%3A19%3A03
> > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202
> > > 4-01-14%2018%3A19%3A20
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=202
> > > 3-12-21%2001%3A11%3A52
> > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202
> > > 3-11-27%2001%3A42%3A39
> > >
> > > All those failures caused by a timeout when waiting for a message
> > > expected in _subscriber.log. For example, in the snakefly's case:
> > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > incorrect binary data format) at
> > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
> > line 269.
> > >
> > > _subscriber.log contains:
> > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > 2023-12-21 01:17:46.756 UTC [40] ERROR:  could not receive data
> > > from WAL stream: server closed the connection unexpectedly
> > >  This probably means the server terminated abnormally
> > >  before or while processing the request.
> > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical
> > > replication apply worker" (PID 40) exited with exit code 1
> > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply
> > > worker for subscription "tsub" has started ...
> > >
> > > While _subscriber.log from a successful test run contains:
> > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > session time: 0:00:00.014 user=postgres database=postgres host=[local]
> > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply
> > > worker for subscription "tsub" has started
> > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table
> > > synchronization worker for subscription "tsub", table
> > > "test_mismatching_types" has started
> > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > format
> > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > test_mismatching_types, line 1, column a
> > >
> > > In this case, "logical replication apply worker for subscription
> > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes
> > later.
> > >
> > > I've managed to reproduce this failure locally by running multiple
> > > tests in parallel, and my analysis shows that it is caused by a race
> > > condition when accessing variable table_states_valid inside tablesync.c.
> > >
> > > tablesync.c does the following with table_states_valid:
> > > /*
> > >   * Callback from syscache invalidation.
> > >   */
> > > void
> > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > hashvalue) {
> > >  table_states_valid = false;
> > > }
> > > ...
> > > static bool
> > > FetchTableStates(bool *started_tx)
> > > {
> > > ...
> > >  if (!table_states_valid)
> > >  {
> > > ...
> > >  /* Fetch all non-ready tables. */
> > >  rstates = GetSubscriptionRelations(MySubscription->oid,
> > > true); ...
> > >  table_states_valid = true;
> > >  }
> > >
> > > So, when syscache invalidation occurs inside the code block "if
> > > (!table_states_valid)", that invalidation is effectively ignored.
> > >
> > > In a failed case I observe the following events:
> > > 1. logical replication apply worker performs
> > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> > >
> > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
> > >   for SUBSCRIPTIONRELMAP, and that leads to calling
> > >   invalidate_syncing_table_states().
> > >
> > > 3. If the apply worker manages to fetch no non-ready tables in
> > >   FetchTableStates() and ignore "table_states_valid = false" from
> > >   invalidate_syncing_table_states(), then it just misses the invalidation
> > >   event, so it keeps working without noticing non-ready tables appeared as
> > >   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()
> > skips
> > >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> > >   occurs).
> > >
> > > pg_usleep(10) added just below GetSubscriptionRelations(...)
> > > 

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-01-29 Thread Zhijie Hou (Fujitsu)
On Monday, January 29, 2024 9:22 PM vignesh C  wrote:
> 
> On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin  wrote:
> >
> > Hello hackers,
> >
> > After determining a possible cause for intermittent failures of the
> > test subscription/031_column_list [1], I was wondering what makes
> > another subscription test (014_binary) fail on the buildfarm:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=20
> > 24-01-22%2001%3A19%3A03
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202
> > 4-01-14%2018%3A19%3A20
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=202
> > 3-12-21%2001%3A11%3A52
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202
> > 3-11-27%2001%3A42%3A39
> >
> > All those failures caused by a timeout when waiting for a message
> > expected in _subscriber.log. For example, in the snakefly's case:
> > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > incorrect binary data format) at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
> line 269.
> >
> > _subscriber.log contains:
> > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > 2023-12-21 01:17:46.756 UTC [40] ERROR:  could not receive data
> > from WAL stream: server closed the connection unexpectedly
> >  This probably means the server terminated abnormally
> >  before or while processing the request.
> > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical
> > replication apply worker" (PID 40) exited with exit code 1
> > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply
> > worker for subscription "tsub" has started ...
> >
> > While _subscriber.log from a successful test run contains:
> > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > session time: 0:00:00.014 user=postgres database=postgres host=[local]
> > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply
> > worker for subscription "tsub" has started
> > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table
> > synchronization worker for subscription "tsub", table
> > "test_mismatching_types" has started
> > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > format
> > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > test_mismatching_types, line 1, column a
> >
> > In this case, "logical replication apply worker for subscription
> > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes
> later.
> >
> > I've managed to reproduce this failure locally by running multiple
> > tests in parallel, and my analysis shows that it is caused by a race
> > condition when accessing variable table_states_valid inside tablesync.c.
> >
> > tablesync.c does the following with table_states_valid:
> > /*
> >   * Callback from syscache invalidation.
> >   */
> > void
> > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > hashvalue) {
> >  table_states_valid = false;
> > }
> > ...
> > static bool
> > FetchTableStates(bool *started_tx)
> > {
> > ...
> >  if (!table_states_valid)
> >  {
> > ...
> >  /* Fetch all non-ready tables. */
> >  rstates = GetSubscriptionRelations(MySubscription->oid,
> > true); ...
> >  table_states_valid = true;
> >  }
> >
> > So, when syscache invalidation occurs inside the code block "if
> > (!table_states_valid)", that invalidation is effectively ignored.
> >
> > In a failed case I observe the following events:
> > 1. logical replication apply worker performs
> >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> >
> > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
> >   for SUBSCRIPTIONRELMAP, and that leads to calling
> >   invalidate_syncing_table_states().
> >
> > 3. If the apply worker manages to fetch no non-ready tables in
> >   FetchTableStates() and ignore "table_states_valid = false" from
> >   invalidate_syncing_table_states(), then it just misses the invalidation
> >   event, so it keeps working without noticing non-ready tables appeared as
> >   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()
> skips
> >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> >   occurs).
> >
> > pg_usleep(10) added just below GetSubscriptionRelations(...)
> > proves my analysis -- without it, I need tens of iterations (with 50
> > tests running in
> > parallel) to catch the failure, but with it, I get the failure on the
> > first iteration.
> 
> Thanks for the analysis, I was able to reproduce the issue with the steps you 
> had
> 

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-01-29 Thread vignesh C
On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin  wrote:
>
> Hello hackers,
>
> After determining a possible cause for intermittent failures of the test
> subscription/031_column_list [1], I was wondering what makes another
> subscription test (014_binary) fail on the buildfarm:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=2024-01-22%2001%3A19%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-01-14%2018%3A19%3A20
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2023-12-21%2001%3A11%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-27%2001%3A42%3A39
>
> All those failures caused by a timeout when waiting for a message expected
> in _subscriber.log. For example, in the snakefly's case:
> [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
> timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data 
> format) at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl 
> line 269.
>
> _subscriber.log contains:
> 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement: ALTER 
> SUBSCRIPTION tsub REFRESH PUBLICATION;
> 2023-12-21 01:17:46.756 UTC [40] ERROR:  could not receive data from WAL 
> stream: server closed the connection
> unexpectedly
>  This probably means the server terminated abnormally
>  before or while processing the request.
> 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical 
> replication apply worker" (PID 40) exited with
> exit code 1
> 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply worker 
> for subscription "tsub" has started
> ...
>
> While _subscriber.log from a successful test run contains:
> 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement: ALTER 
> SUBSCRIPTION tsub REFRESH PUBLICATION;
> 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: 
> session time: 0:00:00.014 user=postgres
> database=postgres host=[local]
> 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply worker 
> for subscription "tsub" has started
> 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table 
> synchronization worker for subscription "tsub",
> table "test_mismatching_types" has started
> 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data format
> 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY test_mismatching_types, 
> line 1, column a
>
> In this case, "logical replication apply worker for subscription "tsub" has
> started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.
>
> I've managed to reproduce this failure locally by running multiple tests in
> parallel, and my analysis shows that it is caused by a race condition when
> accessing variable table_states_valid inside tablesync.c.
>
> tablesync.c does the following with table_states_valid:
> /*
>   * Callback from syscache invalidation.
>   */
> void
> invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
> {
>  table_states_valid = false;
> }
> ...
> static bool
> FetchTableStates(bool *started_tx)
> {
> ...
>  if (!table_states_valid)
>  {
> ...
>  /* Fetch all non-ready tables. */
>  rstates = GetSubscriptionRelations(MySubscription->oid, true);
> ...
>  table_states_valid = true;
>  }
>
> So, when syscache invalidation occurs inside the code block
> "if (!table_states_valid)", that invalidation is effectively ignored.
>
> In a failed case I observe the following events:
> 1. logical replication apply worker performs
>   LogicalRepApplyLoop() -> process_syncing_tables() ->
>   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
>
> 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
>   for SUBSCRIPTIONRELMAP, and that leads to calling
>   invalidate_syncing_table_states().
>
> 3. If the apply worker manages to fetch no non-ready tables in
>   FetchTableStates() and ignore "table_states_valid = false" from
>   invalidate_syncing_table_states(), then it just misses the invalidation
>   event, so it keeps working without noticing non-ready tables appeared as
>   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
>   a loop "foreach(lc, table_states_not_ready) ..." until some other event
>   occurs).
>
> pg_usleep(10) added just below GetSubscriptionRelations(...) proves my
> analysis -- without it, I need tens of iterations (with 50 tests running in
> parallel) to catch the failure, but with it, I get the failure on the first
> iteration.

Thanks for the analysis, I was able to reproduce the issue with the
steps you had shared. I agree with your analysis. I added some logs to
verify that the invalidation was getting missed.

I felt that this invalidation is getting ignored because we have used
a boolean variable here, how about changing it slightly so that
table_states_invalid gets incremented for