Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Thu, 25 Apr 2024 at 16:20, Amit Kapila wrote: > > On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Wednesday, April 24, 2024 6:29 PM vignesh C wrote: > > > > > > > > > The attached patch > > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch > > > applies for master and PG16 branch, > > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch > > > applies for PG15 branch. > > > > Thanks, I have verified that the patches can be applied cleanly and fix the > > issue on each branch. The regression test can also pass after applying the > > patch > > on my machine. > > > > Pushed. Thanks for pushing this, I have marked the commitfest entry at [1] as committed. [1] - https://commitfest.postgresql.org/48/4816/ Regards, Vignesh
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wed, 24 Apr 2024 at 15:49, Amit Kapila wrote: > > On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila wrote: > > > > On Wed, Mar 13, 2024 at 11:59 AM vignesh C wrote: > > > > > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) > > > wrote: > > > > > > > > > > > > For 0002, instead of avoid resetting the latch, is it possible to let > > > > the > > > > logical rep worker wake up the launcher once after attaching ? > > > > > > Waking up of the launch process uses the same latch that is used for > > > subscription creation/modification and apply worker process exit. As > > > the handling of this latch for subscription creation/modification and > > > worker process exit can be done only by ApplyLauncherMain, we will not > > > be able to reset the latch in WaitForReplicationWorkerAttach. I feel > > > waking up the launcher process might not help in this case as > > > currently we will not be able to differentiate between worker > > > attached, subscription creation/modification and apply worker process > > > exit. > > > > > > > IIUC, even if we set the latch once the worker attaches, the other > > set_latch by subscription creation/modification or apply_worker_exit > > could also be consumed due to reset of latch in > > WaitForReplicationWorkerAttach(). Is that understanding correct? If > > so, can we use some other way to wake up > > WaitForReplicationWorkerAttach() say condition variable? > > > > The other possibility is to have a GUC launcher_retry_time or > something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This > still may not resolve the issue if launcher_retry_time is longer but > at least users would be able to configure it. I am not sure if this is > a good idea or not but just trying to brainstorm different ideas to > solve this problem. > > BTW, as far as I understand, this is an improvement in the existing > code, so should be done only for HEAD (probably PG18) and should be > discussed in a separate thread. I have started a new thread at [1] to discuss this: https://www.postgresql.org/message-id/CALDaNm01_KEgHM1tKtgXkCGLJ5209SMSmGw3UmhZbOz365_%3DeA%40mail.gmail.com Regards, Vignesh
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu) wrote: > > On Wednesday, April 24, 2024 6:29 PM vignesh C wrote: > > > > > > The attached patch > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch > > applies for master and PG16 branch, > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch > > applies for PG15 branch. > > Thanks, I have verified that the patches can be applied cleanly and fix the > issue on each branch. The regression test can also pass after applying the > patch > on my machine. > Pushed. -- With Regards, Amit Kapila.
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wednesday, April 24, 2024 6:29 PM vignesh C wrote: > > On Wed, 24 Apr 2024 at 11:59, Amit Kapila wrote: > > > > On Wed, Mar 13, 2024 at 9:19 AM vignesh C wrote: > > > > > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > > > > > > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C > wrote: > > > >> > > > >> > > > >> Thanks, I have created the following Commitfest entry for this: > > > >> https://commitfest.postgresql.org/47/4816/ > > > >> > > > >> Regards, > > > >> Vignesh > > > > > > > > > > > > Thanks for the patch, I have verified that the fix works well by > > > > following > the steps mentioned to reproduce the problem. > > > > Reviewing the patch, it seems good and is well documented. Just one > minor comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > > > > > Thanks for reviewing the patch, the attached v6 patch has the > > > changes for the same. > > > > > > > v6_0001* looks good to me. This should be backpatched unless you or > > others think otherwise. > > This patch needs to be committed in master,PG16 and PG15. > This is not required from PG14 onwards because they don't have > HasSubscriptionRelations call before updating table_states_valid: > /* > * Does the subscription have tables? > * > * If there were not-READY relations found then we know it does. But > * if table_states_not_ready was empty we still need to check again to > * see if there are 0 tables. > */ > has_subrels = (table_states_not_ready != NIL) || > HasSubscriptionRelations(MySubscription->oid); > > So the invalidation function will not be called here. > > Whereas for PG15 and newer versions this is applicable: > HasSubscriptionRelations calls table_open function which will get the > invalidate callback like in: > HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid > -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages -> > LocalExecuteInvalidationMessage -> CallSyscacheCallbacks -> > invalidate_syncing_table_states > > The attached patch > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch > applies for master and PG16 branch, > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch > applies for PG15 branch. Thanks, I have verified that the patches can be applied cleanly and fix the issue on each branch. The regression test can also pass after applying the patch on my machine. Best Regards, Hou zj
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wed, 24 Apr 2024 at 11:59, Amit Kapila wrote: > > On Wed, Mar 13, 2024 at 9:19 AM vignesh C wrote: > > > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: > > >> > > >> > > >> Thanks, I have created the following Commitfest entry for this: > > >> https://commitfest.postgresql.org/47/4816/ > > >> > > >> Regards, > > >> Vignesh > > > > > > > > > Thanks for the patch, I have verified that the fix works well by > > > following the steps mentioned to reproduce the problem. > > > Reviewing the patch, it seems good and is well documented. Just one minor > > > comment I had was probably to change the name of the variable > > > table_states_valid to table_states_validity. The current name made sense > > > when it was a bool, but now that it is a tri-state enum, it doesn't fit > > > well. > > > > Thanks for reviewing the patch, the attached v6 patch has the changes > > for the same. > > > > v6_0001* looks good to me. This should be backpatched unless you or > others think otherwise. This patch needs to be committed in master,PG16 and PG15. This is not required from PG14 onwards because they don't have HasSubscriptionRelations call before updating table_states_valid: /* * Does the subscription have tables? * * If there were not-READY relations found then we know it does. But * if table_states_not_ready was empty we still need to check again to * see if there are 0 tables. */ has_subrels = (table_states_not_ready != NIL) || HasSubscriptionRelations(MySubscription->oid); So the invalidation function will not be called here. Whereas for PG15 and newer versions this is applicable: HasSubscriptionRelations calls table_open function which will get the invalidate callback like in: HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages -> LocalExecuteInvalidationMessage -> CallSyscacheCallbacks -> invalidate_syncing_table_states The attached patch v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch applies for master and PG16 branch, v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch applies for PG15 branch. Regards, Vignesh From 93116320bd9ffbcddc83a1fc253cba90c56cb47e Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Thu, 1 Feb 2024 09:46:40 +0530 Subject: [PATCH v7] Table sync missed due to race condition in subscription cache handling. Table sync was missed if the invalidation of table sync occurs while the non ready tables were getting prepared. This occurrs because the table state was being set to valid at the end of non ready table list preparation irrespective of any inavlidations occurred while preparing the list. Fixed it by changing the boolean variable to an tri-state enum and by setting table state to valid only if no invalidations have been occurred while the list is being prepared. --- src/backend/replication/logical/tablesync.c | 25 + src/tools/pgindent/typedefs.list| 1 + 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 81fff1919d..7d4ee48206 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -122,7 +122,14 @@ #include "utils/syscache.h" #include "utils/usercontext.h" -static bool table_states_valid = false; +typedef enum +{ + SYNC_TABLE_STATE_NEEDS_REBUILD, + SYNC_TABLE_STATE_REBUILD_STARTED, + SYNC_TABLE_STATE_VALID, +} SyncingTablesState; + +static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD; static List *table_states_not_ready = NIL; static bool FetchTableStates(bool *started_tx); @@ -272,7 +279,7 @@ wait_for_worker_state_change(char expected_state) void invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) { - table_states_valid = false; + table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD; } /* @@ -1556,13 +1563,15 @@ FetchTableStates(bool *started_tx) *started_tx = false; - if (!table_states_valid) + if (table_states_validity != SYNC_TABLE_STATE_VALID) { MemoryContext oldctx; List *rstates; ListCell *lc; SubscriptionRelState *rstate; + table_states_validity = SYNC_TABLE_STATE_REBUILD_STARTED; + /* Clean the old lists. */ list_free_deep(table_states_not_ready); table_states_not_ready = NIL; @@ -1596,7 +1605,15 @@ FetchTableStates(bool *started_tx) has_subrels = (table_states_not_ready != NIL) || HasSubscriptionRelations(MySubscription->oid); - table_states_valid = true; + /* + * If the subscription relation cache has been invalidated since we + * entered this routine, we still use and return the relations we just + * finished constructing, to avoid infinite loops, but we leave the + * table states marked as stale so that we'll rebuild
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila wrote: > > On Wed, Mar 13, 2024 at 11:59 AM vignesh C wrote: > > > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) > > wrote: > > > > > > > > > For 0002, instead of avoid resetting the latch, is it possible to let the > > > logical rep worker wake up the launcher once after attaching ? > > > > Waking up of the launch process uses the same latch that is used for > > subscription creation/modification and apply worker process exit. As > > the handling of this latch for subscription creation/modification and > > worker process exit can be done only by ApplyLauncherMain, we will not > > be able to reset the latch in WaitForReplicationWorkerAttach. I feel > > waking up the launcher process might not help in this case as > > currently we will not be able to differentiate between worker > > attached, subscription creation/modification and apply worker process > > exit. > > > > IIUC, even if we set the latch once the worker attaches, the other > set_latch by subscription creation/modification or apply_worker_exit > could also be consumed due to reset of latch in > WaitForReplicationWorkerAttach(). Is that understanding correct? If > so, can we use some other way to wake up > WaitForReplicationWorkerAttach() say condition variable? > The other possibility is to have a GUC launcher_retry_time or something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This still may not resolve the issue if launcher_retry_time is longer but at least users would be able to configure it. I am not sure if this is a good idea or not but just trying to brainstorm different ideas to solve this problem. BTW, as far as I understand, this is an improvement in the existing code, so should be done only for HEAD (probably PG18) and should be discussed in a separate thread. -- With Regards, Amit Kapila.
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wed, Mar 13, 2024 at 9:19 AM vignesh C wrote: > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: > >> > >> > >> Thanks, I have created the following Commitfest entry for this: > >> https://commitfest.postgresql.org/47/4816/ > >> > >> Regards, > >> Vignesh > > > > > > Thanks for the patch, I have verified that the fix works well by following > > the steps mentioned to reproduce the problem. > > Reviewing the patch, it seems good and is well documented. Just one minor > > comment I had was probably to change the name of the variable > > table_states_valid to table_states_validity. The current name made sense > > when it was a bool, but now that it is a tri-state enum, it doesn't fit > > well. > > Thanks for reviewing the patch, the attached v6 patch has the changes > for the same. > v6_0001* looks good to me. This should be backpatched unless you or others think otherwise. -- With Regards, Amit Kapila.
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wed, Mar 13, 2024 at 11:59 AM vignesh C wrote: > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) > wrote: > > > > > > For 0002, instead of avoid resetting the latch, is it possible to let the > > logical rep worker wake up the launcher once after attaching ? > > Waking up of the launch process uses the same latch that is used for > subscription creation/modification and apply worker process exit. As > the handling of this latch for subscription creation/modification and > worker process exit can be done only by ApplyLauncherMain, we will not > be able to reset the latch in WaitForReplicationWorkerAttach. I feel > waking up the launcher process might not help in this case as > currently we will not be able to differentiate between worker > attached, subscription creation/modification and apply worker process > exit. > IIUC, even if we set the latch once the worker attaches, the other set_latch by subscription creation/modification or apply_worker_exit could also be consumed due to reset of latch in WaitForReplicationWorkerAttach(). Is that understanding correct? If so, can we use some other way to wake up WaitForReplicationWorkerAttach() say condition variable? The current proposal can fix the issue but looks bit adhoc. -- With Regards, Amit Kapila.
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wednesday, March 13, 2024 11:49 AM vignesh C wrote: > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: > >> > >> > >> Thanks, I have created the following Commitfest entry for this: > >> https://commitfest.postgresql.org/47/4816/ > >> > >> Regards, > >> Vignesh > > > > > > Thanks for the patch, I have verified that the fix works well by following > > the > steps mentioned to reproduce the problem. > > Reviewing the patch, it seems good and is well documented. Just one minor > comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > Thanks for reviewing the patch, the attached v6 patch has the changes for the > same. FYI, I noticed that there is one open item on https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to the fix in this thread. -- Intermittent failures in 040_standby_failover_slots_sync test Possible solution in this thread: Race condition in FetchTableStates -- AFAICS, the bug discussed here is not a new issue on PG17, so I am thinking to move the item to the "Older bugs affecting stable branches" section if no objections. Best Regards, Hou zj
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, March 13, 2024 11:49 AMvignesh C wrote: > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: > > >> > > >> > > >> Thanks, I have created the following Commitfest entry for this: > > >> https://commitfest.postgresql.org/47/4816/ > > >> > > >> Regards, > > >> Vignesh > > > > > > > > > Thanks for the patch, I have verified that the fix works well by > > > following the > > steps mentioned to reproduce the problem. > > > Reviewing the patch, it seems good and is well documented. Just one minor > > comment I had was probably to change the name of the variable > > table_states_valid to table_states_validity. The current name made sense > > when > > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > > > Thanks for reviewing the patch, the attached v6 patch has the changes for > > the > > same. > > Thanks for the patches. > > I saw a recent similar BF error[1] which seems related to the issue that 0001 > patch is trying to solve. i.e. The table sync worker is somehow not started > after refreshing the publication on the subscriber. I didn't see other > related ERRORs in > the log, so I think the reason is the same as the one being discussed in this > thread, which is the table state invalidation got lost. And the 0001 patch > looks good to me. > > For 0002, instead of avoid resetting the latch, is it possible to let the > logical rep worker wake up the launcher once after attaching ? Waking up of the launch process uses the same latch that is used for subscription creation/modification and apply worker process exit. As the handling of this latch for subscription creation/modification and worker process exit can be done only by ApplyLauncherMain, we will not be able to reset the latch in WaitForReplicationWorkerAttach. I feel waking up the launcher process might not help in this case as currently we will not be able to differentiate between worker attached, subscription creation/modification and apply worker process exit. Regards, Vignesh
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Wednesday, March 13, 2024 11:49 AMvignesh C wrote: > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: > >> > >> > >> Thanks, I have created the following Commitfest entry for this: > >> https://commitfest.postgresql.org/47/4816/ > >> > >> Regards, > >> Vignesh > > > > > > Thanks for the patch, I have verified that the fix works well by following > > the > steps mentioned to reproduce the problem. > > Reviewing the patch, it seems good and is well documented. Just one minor > comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > Thanks for reviewing the patch, the attached v6 patch has the changes for the > same. Thanks for the patches. I saw a recent similar BF error[1] which seems related to the issue that 0001 patch is trying to solve. i.e. The table sync worker is somehow not started after refreshing the publication on the subscriber. I didn't see other related ERRORs in the log, so I think the reason is the same as the one being discussed in this thread, which is the table state invalidation got lost. And the 0001 patch looks good to me. For 0002, instead of avoid resetting the latch, is it possible to let the logical rep worker wake up the launcher once after attaching ? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin=2024-03-11%2000%3A52%3A42 Best Regards, Hou zj
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tue, 12 Mar 2024 at 09:34, Ajin Cherian wrote: > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: >> >> >> Thanks, I have created the following Commitfest entry for this: >> https://commitfest.postgresql.org/47/4816/ >> >> Regards, >> Vignesh > > > Thanks for the patch, I have verified that the fix works well by following > the steps mentioned to reproduce the problem. > Reviewing the patch, it seems good and is well documented. Just one minor > comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. Thanks for reviewing the patch, the attached v6 patch has the changes for the same. Regards, Vignesh From ab68e343b4b3f7ec09f758089a9f90a16f21de42 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Mon, 5 Feb 2024 14:55:48 +0530 Subject: [PATCH v6 2/2] Apply worker will get started after 180 seconds by the launcher in case the apply worker exits immediately after startup. Apply worker was getting started after 180 seconds tiemout of the launcher in case the apply worker exits immediately after startup. This was happening because the launcher process's latch was getting reset after the apply worker was started, which resulted in the launcher to wait for the next 180 seconds timeout before starting the apply worker. Fixed this issue by not resetting the latch, as this latch will be set for subscription modifications and apply worker exit. We should check if the new worker needs to be started or not and reset the latch in ApplyLauncherMain. --- src/backend/replication/logical/launcher.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 66070e9131..f2545482c8 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -185,7 +185,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, BackgroundWorkerHandle *handle) { BgwHandleStatus status; - int rc; for (;;) { @@ -220,16 +219,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, /* * We need timeout because we generally don't get notified via latch * about the worker attach. But we don't expect to have to wait long. + * Since this latch is also used for subscription creation/modification + * and apply worker process exit signal handling, the latch must not be + * reset here. We should check if the new worker needs to be started or + * not and reset the latch in ApplyLauncherMain. */ - rc = WaitLatch(MyLatch, - WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, - 10L, WAIT_EVENT_BGWORKER_STARTUP); - - if (rc & WL_LATCH_SET) - { - ResetLatch(MyLatch); - CHECK_FOR_INTERRUPTS(); - } + (void) WaitLatch(MyLatch, + WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + 10L, WAIT_EVENT_BGWORKER_STARTUP); } } -- 2.34.1 From 47c6cd886902693074ece4b2a1188dce6f066bf8 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Thu, 1 Feb 2024 09:46:40 +0530 Subject: [PATCH v6 1/2] Table sync missed for newly added tables because subscription relation cache invalidation was not handled in certain concurrent scenarios. Table sync was missed if the invalidation of table sync occurs while the non ready tables were getting prepared. This occurrs because the table state was being set to valid at the end of non ready table list preparation irrespective of any inavlidations occurred while preparing the list. Fixed it by changing the boolean variable to an tri-state enum and by setting table state to valid only if no invalidations have been occurred while the list is being prepared. --- src/backend/replication/logical/tablesync.c | 25 + src/tools/pgindent/typedefs.list| 1 + 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index ee06629088..eefe620e5e 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -123,7 +123,14 @@ #include "utils/syscache.h" #include "utils/usercontext.h" -static bool table_states_valid = false; +typedef enum +{ + SYNC_TABLE_STATE_NEEDS_REBUILD, + SYNC_TABLE_STATE_REBUILD_STARTED, + SYNC_TABLE_STATE_VALID, +} SyncingTablesState; + +static SyncingTablesState table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD; static List *table_states_not_ready = NIL; static bool FetchTableStates(bool *started_tx); @@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state) void invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) { - table_states_valid = false; + table_states_validity = SYNC_TABLE_STATE_NEEDS_REBUILD; } /* @@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx) *started_tx = false; - if
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tue, Mar 12, 2024 at 2:59 PM vignesh C wrote: > > Thanks, I have created the following Commitfest entry for this: > https://commitfest.postgresql.org/47/4816/ > > Regards, > Vignesh > Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem. Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the name of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but now that it is a tri-state enum, it doesn't fit well. regards, Ajin Cherian Fujitsu Australia
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Thu, 8 Feb 2024 at 23:30, Alexander Lakhin wrote: > > 08.02.2024 12:25, vignesh C wrote: > > Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was > > getting missed in this case. The wakeup call can be sent during > > subscription creation/modification and when the apply worker exits. > > WaitForReplicationWorkerAttach should not reset the latch here as it > > will end up delaying the apply worker to get started after 180 seconds > > timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset > > the latch and lets ApplyLauncherMain to reset the latch and checks if > > any new worker or missing worker needs to be started. > > Thank you for the updated patch! > I ran all the subscription tests in a loop (with the sleeps added as > before) and observed no failures and 180+ seconds duration. Thanks, I have created the following Commitfest entry for this: https://commitfest.postgresql.org/47/4816/ Regards, Vignesh
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
08.02.2024 12:25, vignesh C wrote: Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was getting missed in this case. The wakeup call can be sent during subscription creation/modification and when the apply worker exits. WaitForReplicationWorkerAttach should not reset the latch here as it will end up delaying the apply worker to get started after 180 seconds timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset the latch and lets ApplyLauncherMain to reset the latch and checks if any new worker or missing worker needs to be started. Thank you for the updated patch! I ran all the subscription tests in a loop (with the sleeps added as before) and observed no failures and 180+ seconds duration. Best regards, Alexander
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin wrote: > > 05.02.2024 13:13, vignesh C wrote: > > Thanks for the steps for the issue, I was able to reproduce this issue > > in my environment with the steps provided. The attached patch has a > > proposed fix where the latch will not be set in case of the apply > > worker exiting immediately after starting. > > It looks like the proposed fix doesn't help when ApplyLauncherWakeup() > called by a backend executing CREATE SUBSCRIPTION command. > That is, with the v4-0002 patch applied and pg_usleep(30L); added > just below > if (!worker_in_use) > return worker_in_use; > I still observe the test 027_nosuperuser running for 3+ minutes: > t/027_nosuperuser.pl .. ok > All tests successful. > Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 > csys = 9.30 CPU) > > IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION > regression_sub ...", gets missed when launcher waits for start of another > worker (logical replication worker for subscription "admin_sub"), launched > just before that command. Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was getting missed in this case. The wakeup call can be sent during subscription creation/modification and when the apply worker exits. WaitForReplicationWorkerAttach should not reset the latch here as it will end up delaying the apply worker to get started after 180 seconds timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset the latch and lets ApplyLauncherMain to reset the latch and checks if any new worker or missing worker needs to be started. Regards, Vignesh From f04db050a583c9c01eb77766f830a0cf77b0a6c7 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Mon, 5 Feb 2024 14:55:48 +0530 Subject: [PATCH v5 2/2] Apply worker will get started after 180 seconds by the launcher in case the apply worker exits immediately after startup. Apply worker was getting started after 180 seconds tiemout of the launcher in case the apply worker exits immediately after startup. This was happening because the launcher process's latch was getting reset after the apply worker was started, which resulted in the launcher to wait for the next 180 seconds timeout before starting the apply worker. Fixed this issue by not resetting the latch, as this latch will be set for subscription modifications and apply worker exit. We should check if the new worker needs to be started or not and reset the latch in ApplyLauncherMain. --- src/backend/replication/logical/launcher.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 122db0bb13..a754f2c757 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -191,7 +191,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, BackgroundWorkerHandle *handle) { BgwHandleStatus status; - int rc; for (;;) { @@ -226,16 +225,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, /* * We need timeout because we generally don't get notified via latch * about the worker attach. But we don't expect to have to wait long. + * Since this latch is also used for subscription creation/modification + * and apply worker process exit signal handling, the latch must not be + * reset here. We should check if the new worker needs to be started or + * not and reset the latch in ApplyLauncherMain. */ - rc = WaitLatch(MyLatch, - WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, - 10L, WAIT_EVENT_BGWORKER_STARTUP); - - if (rc & WL_LATCH_SET) - { - ResetLatch(MyLatch); - CHECK_FOR_INTERRUPTS(); - } + (void) WaitLatch(MyLatch, + WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + 10L, WAIT_EVENT_BGWORKER_STARTUP); } } -- 2.34.1 From 385c96a71396ea8efa86d6136bbf0bfe5282f1d1 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Thu, 1 Feb 2024 09:46:40 +0530 Subject: [PATCH v5 1/2] Table sync missed for newly added tables because subscription relation cache invalidation was not handled in certain concurrent scenarios. Table sync was missed if the invalidation of table sync occurs while the non ready tables were getting prepared. This occurrs because the table state was being set to valid at the end of non ready table list preparation irrespective of any inavlidations occurred while preparing the list. Fixed it by changing the boolean variable to an tri-state enum and by setting table state to valid only if no invalidations have been occurred while the list is being prepared. --- src/backend/replication/logical/tablesync.c | 25 + src/tools/pgindent/typedefs.list| 1 + 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
05.02.2024 13:13, vignesh C wrote: Thanks for the steps for the issue, I was able to reproduce this issue in my environment with the steps provided. The attached patch has a proposed fix where the latch will not be set in case of the apply worker exiting immediately after starting. It looks like the proposed fix doesn't help when ApplyLauncherWakeup() called by a backend executing CREATE SUBSCRIPTION command. That is, with the v4-0002 patch applied and pg_usleep(30L); added just below if (!worker_in_use) return worker_in_use; I still observe the test 027_nosuperuser running for 3+ minutes: t/027_nosuperuser.pl .. ok All tests successful. Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 csys = 9.30 CPU) IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION regression_sub ...", gets missed when launcher waits for start of another worker (logical replication worker for subscription "admin_sub"), launched just before that command. Best regards, Alexander
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin wrote: > > Hello Vignesh and Hou-san, > > 01.02.2024 07:59, vignesh C wrote: > > Here is an updated patch which changes the boolean variable to a > > tri-state enum and set stable state to valid only if no invalidations > > have been occurred while the list is being prepared. > > > > While testing the v3 patch, I observed another anomaly with the 014_binary > test. When I run the test in parallel, I see that sometimes one of the test > instances runs much longer than others. For example: > 49 All tests successful. > 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr > 0.30 csys = 0.77 CPU) > 49 Result: PASS > 12 All tests successful. > 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr > 0.40 csys = 0.89 CPU) > 12 Result: PASS > > As far as I can see, this happens due to another race condition, this time > in launcher.c. > For such a three-minute case I see in _subscriber.log: > 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation > "public.test_mismatching_types" does not exist > 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for > replication origin "pg_16398" during message > type "INSERT" in transaction 757, finished at 0/153C838 > 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target > relation "public.test_mismatching_types" does > not exist > 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for > replication origin "pg_16398" during message > type "INSERT" in transaction 757, finished at 0/153C838 > ... > 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE > TABLE public.test_mismatching_types ( > a int PRIMARY KEY > ); > 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker > "logical replication apply worker for > subscription 16398" > 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical > replication apply worker" (PID 949255) exited with > exit code 1 > ... > 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER > SUBSCRIPTION tsub REFRESH PUBLICATION; > ... > 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication > worker for subscription "tsub" > (there is no interesting activity between 14:33:13 and 14:36:13) > > So logical replication apply worker exited because CREATE TABLE was not > executed on subscriber yet, and new replication worker started because of a > timeout occurred in WaitLatch(..., wait_time, ...) inside > ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE > (180 sec)). > > But in a normal (fast) case the same WaitLatch exits due to MyLatch set as > a result of: > logical replication apply worker| logicalrep_worker_onexit() -> >ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) -> > launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)). > > In a bad case, I see that the SetLatch() called as well, but then the latch > gets reset by the following code in WaitForReplicationWorkerAttach(): > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > if (rc & WL_LATCH_SET) > { > ResetLatch(MyLatch); > CHECK_FOR_INTERRUPTS(); > } > > With pg_usleep(30); added just before ResetLatch and > $node_subscriber->safe_psql( > 'postgres', qq( > +SELECT pg_sleep(0.5); > CREATE TABLE public.test_mismatching_types ( > in 014_binary.pl, I can see the anomaly without running tests in parallel, > just when running that test in a loop: > for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C > src/test/subscription/ PROVE_TESTS="t/014*"; done > ... > iteration 2 > # +++ tap check in src/test/subscription +++ > t/014_binary.pl .. ok > All tests successful. > Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 > csys = 0.42 CPU) > Result: PASS > iteration 3 > # +++ tap check in src/test/subscription +++ > t/014_binary.pl .. ok > All tests successful. > Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 > csys = 0.55 CPU) > Result: PASS > ... > > In other words, the abnormal test execution takes place due to the > following events: > 1. logicalrep worker launcher launches replication worker and waits for it > to attach: > ApplyLauncherMain() -> logicalrep_worker_launch() -> > WaitForReplicationWorkerAttach() > 2. logicalrep worker exits due to some error (logical replication target > relation does not exist, in our case) and sends a signal to set the latch > for launcher > 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets > it inside WaitForReplicationWorkerAttach() > 4. launcher gets back to ApplyLauncherMain() where it waits for the latch > (not set)
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Hello Vignesh and Hou-san, 01.02.2024 07:59, vignesh C wrote: Here is an updated patch which changes the boolean variable to a tri-state enum and set stable state to valid only if no invalidations have been occurred while the list is being prepared. While testing the v3 patch, I observed another anomaly with the 014_binary test. When I run the test in parallel, I see that sometimes one of the test instances runs much longer than others. For example: 49 All tests successful. 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU) 49 Result: PASS 12 All tests successful. 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU) 12 Result: PASS As far as I can see, this happens due to another race condition, this time in launcher.c. For such a three-minute case I see in _subscriber.log: 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message type "INSERT" in transaction 757, finished at 0/153C838 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does not exist 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message type "INSERT" in transaction 757, finished at 0/153C838 ... 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types ( a int PRIMARY KEY ); 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for subscription 16398" 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with exit code 1 ... 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; ... 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub" (there is no interesting activity between 14:33:13 and 14:36:13) So logical replication apply worker exited because CREATE TABLE was not executed on subscriber yet, and new replication worker started because of a timeout occurred in WaitLatch(..., wait_time, ...) inside ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE (180 sec)). But in a normal (fast) case the same WaitLatch exits due to MyLatch set as a result of: logical replication apply worker| logicalrep_worker_onexit() -> ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) -> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)). In a bad case, I see that the SetLatch() called as well, but then the latch gets reset by the following code in WaitForReplicationWorkerAttach(): rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_BGWORKER_STARTUP); if (rc & WL_LATCH_SET) { ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } With pg_usleep(30); added just before ResetLatch and $node_subscriber->safe_psql( 'postgres', qq( +SELECT pg_sleep(0.5); CREATE TABLE public.test_mismatching_types ( in 014_binary.pl, I can see the anomaly without running tests in parallel, just when running that test in a loop: for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done ... iteration 2 # +++ tap check in src/test/subscription +++ t/014_binary.pl .. ok All tests successful. Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU) Result: PASS iteration 3 # +++ tap check in src/test/subscription +++ t/014_binary.pl .. ok All tests successful. Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU) Result: PASS ... In other words, the abnormal test execution takes place due to the following events: 1. logicalrep worker launcher launches replication worker and waits for it to attach: ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach() 2. logicalrep worker exits due to some error (logical replication target relation does not exist, in our case) and sends a signal to set the latch for launcher 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets it inside WaitForReplicationWorkerAttach() 4. launcher gets back to ApplyLauncherMain() where it waits for the latch (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms). Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the test 027_nosuperuser executes for 3+ minutes on each run for me. And we can find such abnormal execution on the buildfarm too:
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, January 30, 2024 11:21 AM vignesh C wrote: > > > > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Monday, January 29, 2024 9:22 PM vignesh C > > wrote: > > > > > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin > > wrote: > > > > > > > > > > Hello hackers, > > > > > > > > > > After determining a possible cause for intermittent failures of > > > > > the test subscription/031_column_list [1], I was wondering what > > > > > makes another subscription test (014_binary) fail on the buildfarm: > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly > > > > > t=20 > > > > > 24-01-22%2001%3A19%3A03 > > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2 > > > > 02 > > > > > 4-01-14%2018%3A19%3A20 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet > > > > > =202 > > > > > 3-12-21%2001%3A11%3A52 > > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2 > > > > 02 > > > > > 3-11-27%2001%3A42%3A39 > > > > > > > > > > All those failures caused by a timeout when waiting for a message > > > > > expected in _subscriber.log. For example, in the snakefly's case: > > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > > > incorrect binary data format) at > > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin > > > > ary.pl > > > > line 269. > > > > > > > > > > _subscriber.log contains: > > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > > 2023-12-21 01:17:46.756 UTC [40] ERROR: could not receive > > > > > data from WAL stream: server closed the connection unexpectedly > > > > > This probably means the server terminated abnormally > > > > > before or while processing the request. > > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker > > > > > "logical replication apply worker" (PID 40) exited with exit > > > > > code 1 > > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication > > > > > apply worker for subscription "tsub" has started ... > > > > > > > > > > While _subscriber.log from a successful test run contains: > > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > > > session time: 0:00:00.014 user=postgres database=postgres > > > > > host=[local] > > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication > > > > > apply worker for subscription "tsub" has started > > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication > > > > > table synchronization worker for subscription "tsub", table > > > > > "test_mismatching_types" has started > > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > > > format > > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > > > test_mismatching_types, line 1, column a > > > > > > > > > > In this case, "logical replication apply worker for subscription > > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 > > > > > minutes > > > > later. > > > > > > > > > > I've managed to reproduce this failure locally by running multiple > > > > > tests in parallel, and my analysis shows that it is caused by a > > > > > race condition when accessing variable table_states_valid inside > > tablesync.c. > > > > > > > > > > tablesync.c does the following with table_states_valid: > > > > > /* > > > > > * Callback from syscache invalidation. > > > > > */ > > > > > void > > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > > > hashvalue) { > > > > > table_states_valid = false; > > > > > } > > > > > ... > > > > > static bool > > > > > FetchTableStates(bool *started_tx) { ... > > > > > if (!table_states_valid) > > > > > { > > > > > ... > > > > > /* Fetch all non-ready tables. */ > > > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > > > true); ... > > > > > table_states_valid = true; > > > > > } > > > > > > > > > > So, when syscache invalidation occurs inside the code block "if > > > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > > > > > In a failed case I observe the following events: > > > > > 1. logical replication apply worker performs > > > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > > > process_syncing_tables_for_apply() -> FetchTableStates() > > > > > periodically. > > > > > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates > > syscache > > > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > >
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tuesday, January 30, 2024 11:21 AM vignesh C wrote: > > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) > wrote: > > > > On Monday, January 29, 2024 9:22 PM vignesh C > wrote: > > > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin > wrote: > > > > > > > > Hello hackers, > > > > > > > > After determining a possible cause for intermittent failures of > > > > the test subscription/031_column_list [1], I was wondering what > > > > makes another subscription test (014_binary) fail on the buildfarm: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly > > > > t=20 > > > > 24-01-22%2001%3A19%3A03 > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2 > > > 02 > > > > 4-01-14%2018%3A19%3A20 > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet > > > > =202 > > > > 3-12-21%2001%3A11%3A52 > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2 > > > 02 > > > > 3-11-27%2001%3A42%3A39 > > > > > > > > All those failures caused by a timeout when waiting for a message > > > > expected in _subscriber.log. For example, in the snakefly's case: > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > > incorrect binary data format) at > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin > > > ary.pl > > > line 269. > > > > > > > > _subscriber.log contains: > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > 2023-12-21 01:17:46.756 UTC [40] ERROR: could not receive > > > > data from WAL stream: server closed the connection unexpectedly > > > > This probably means the server terminated abnormally > > > > before or while processing the request. > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker > > > > "logical replication apply worker" (PID 40) exited with exit > > > > code 1 > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication > > > > apply worker for subscription "tsub" has started ... > > > > > > > > While _subscriber.log from a successful test run contains: > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > > session time: 0:00:00.014 user=postgres database=postgres > > > > host=[local] > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication > > > > apply worker for subscription "tsub" has started > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication > > > > table synchronization worker for subscription "tsub", table > > > > "test_mismatching_types" has started > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > > format > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > > test_mismatching_types, line 1, column a > > > > > > > > In this case, "logical replication apply worker for subscription > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 > > > > minutes > > > later. > > > > > > > > I've managed to reproduce this failure locally by running multiple > > > > tests in parallel, and my analysis shows that it is caused by a > > > > race condition when accessing variable table_states_valid inside > tablesync.c. > > > > > > > > tablesync.c does the following with table_states_valid: > > > > /* > > > > * Callback from syscache invalidation. > > > > */ > > > > void > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > > hashvalue) { > > > > table_states_valid = false; > > > > } > > > > ... > > > > static bool > > > > FetchTableStates(bool *started_tx) { ... > > > > if (!table_states_valid) > > > > { > > > > ... > > > > /* Fetch all non-ready tables. */ > > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > > true); ... > > > > table_states_valid = true; > > > > } > > > > > > > > So, when syscache invalidation occurs inside the code block "if > > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > > > In a failed case I observe the following events: > > > > 1. logical replication apply worker performs > > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates > syscache > > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > > invalidate_syncing_table_states(). > > > > > > > > 3. If the apply worker manages to fetch no non-ready tables in > > > > FetchTableStates() and ignore "table_states_valid = false" from > > > > invalidate_syncing_table_states(), then it just misses the > > > > invalidation > > > > event,
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) wrote: > > On Monday, January 29, 2024 9:22 PM vignesh C wrote: > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin wrote: > > > > > > Hello hackers, > > > > > > After determining a possible cause for intermittent failures of the > > > test subscription/031_column_list [1], I was wondering what makes > > > another subscription test (014_binary) fail on the buildfarm: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=20 > > > 24-01-22%2001%3A19%3A03 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202 > > > 4-01-14%2018%3A19%3A20 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=202 > > > 3-12-21%2001%3A11%3A52 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202 > > > 3-11-27%2001%3A42%3A39 > > > > > > All those failures caused by a timeout when waiting for a message > > > expected in _subscriber.log. For example, in the snakefly's case: > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > incorrect binary data format) at > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > > line 269. > > > > > > _subscriber.log contains: > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > 2023-12-21 01:17:46.756 UTC [40] ERROR: could not receive data > > > from WAL stream: server closed the connection unexpectedly > > > This probably means the server terminated abnormally > > > before or while processing the request. > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > > > replication apply worker" (PID 40) exited with exit code 1 > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply > > > worker for subscription "tsub" has started ... > > > > > > While _subscriber.log from a successful test run contains: > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > session time: 0:00:00.014 user=postgres database=postgres host=[local] > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply > > > worker for subscription "tsub" has started > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > > > synchronization worker for subscription "tsub", table > > > "test_mismatching_types" has started > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > format > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > test_mismatching_types, line 1, column a > > > > > > In this case, "logical replication apply worker for subscription > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes > > later. > > > > > > I've managed to reproduce this failure locally by running multiple > > > tests in parallel, and my analysis shows that it is caused by a race > > > condition when accessing variable table_states_valid inside tablesync.c. > > > > > > tablesync.c does the following with table_states_valid: > > > /* > > > * Callback from syscache invalidation. > > > */ > > > void > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > hashvalue) { > > > table_states_valid = false; > > > } > > > ... > > > static bool > > > FetchTableStates(bool *started_tx) > > > { > > > ... > > > if (!table_states_valid) > > > { > > > ... > > > /* Fetch all non-ready tables. */ > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > true); ... > > > table_states_valid = true; > > > } > > > > > > So, when syscache invalidation occurs inside the code block "if > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > In a failed case I observe the following events: > > > 1. logical replication apply worker performs > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > invalidate_syncing_table_states(). > > > > > > 3. If the apply worker manages to fetch no non-ready tables in > > > FetchTableStates() and ignore "table_states_valid = false" from > > > invalidate_syncing_table_states(), then it just misses the invalidation > > > event, so it keeps working without noticing non-ready tables appeared as > > > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() > > skips > > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > > occurs). > > > > > > pg_usleep(10) added just below GetSubscriptionRelations(...) > > >
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Monday, January 29, 2024 9:22 PM vignesh C wrote: > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin wrote: > > > > Hello hackers, > > > > After determining a possible cause for intermittent failures of the > > test subscription/031_column_list [1], I was wondering what makes > > another subscription test (014_binary) fail on the buildfarm: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=20 > > 24-01-22%2001%3A19%3A03 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202 > > 4-01-14%2018%3A19%3A20 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=202 > > 3-12-21%2001%3A11%3A52 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=202 > > 3-11-27%2001%3A42%3A39 > > > > All those failures caused by a timeout when waiting for a message > > expected in _subscriber.log. For example, in the snakefly's case: > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > incorrect binary data format) at > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > line 269. > > > > _subscriber.log contains: > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > 2023-12-21 01:17:46.756 UTC [40] ERROR: could not receive data > > from WAL stream: server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > > replication apply worker" (PID 40) exited with exit code 1 > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply > > worker for subscription "tsub" has started ... > > > > While _subscriber.log from a successful test run contains: > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > session time: 0:00:00.014 user=postgres database=postgres host=[local] > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply > > worker for subscription "tsub" has started > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > > synchronization worker for subscription "tsub", table > > "test_mismatching_types" has started > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > format > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > test_mismatching_types, line 1, column a > > > > In this case, "logical replication apply worker for subscription > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes > later. > > > > I've managed to reproduce this failure locally by running multiple > > tests in parallel, and my analysis shows that it is caused by a race > > condition when accessing variable table_states_valid inside tablesync.c. > > > > tablesync.c does the following with table_states_valid: > > /* > > * Callback from syscache invalidation. > > */ > > void > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > hashvalue) { > > table_states_valid = false; > > } > > ... > > static bool > > FetchTableStates(bool *started_tx) > > { > > ... > > if (!table_states_valid) > > { > > ... > > /* Fetch all non-ready tables. */ > > rstates = GetSubscriptionRelations(MySubscription->oid, > > true); ... > > table_states_valid = true; > > } > > > > So, when syscache invalidation occurs inside the code block "if > > (!table_states_valid)", that invalidation is effectively ignored. > > > > In a failed case I observe the following events: > > 1. logical replication apply worker performs > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > > for SUBSCRIPTIONRELMAP, and that leads to calling > > invalidate_syncing_table_states(). > > > > 3. If the apply worker manages to fetch no non-ready tables in > > FetchTableStates() and ignore "table_states_valid = false" from > > invalidate_syncing_table_states(), then it just misses the invalidation > > event, so it keeps working without noticing non-ready tables appeared as > > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() > skips > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > occurs). > > > > pg_usleep(10) added just below GetSubscriptionRelations(...) > > proves my analysis -- without it, I need tens of iterations (with 50 > > tests running in > > parallel) to catch the failure, but with it, I get the failure on the > > first iteration. > > Thanks for the analysis, I was able to reproduce the issue with the steps you > had >
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin wrote: > > Hello hackers, > > After determining a possible cause for intermittent failures of the test > subscription/031_column_list [1], I was wondering what makes another > subscription test (014_binary) fail on the buildfarm: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly=2024-01-22%2001%3A19%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-01-14%2018%3A19%3A20 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2023-12-21%2001%3A11%3A52 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2023-11-27%2001%3A42%3A39 > > All those failures caused by a timeout when waiting for a message expected > in _subscriber.log. For example, in the snakefly's case: > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type > timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data > format) at > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > line 269. > > _subscriber.log contains: > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: ALTER > SUBSCRIPTION tsub REFRESH PUBLICATION; > 2023-12-21 01:17:46.756 UTC [40] ERROR: could not receive data from WAL > stream: server closed the connection > unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > replication apply worker" (PID 40) exited with > exit code 1 > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply worker > for subscription "tsub" has started > ... > > While _subscriber.log from a successful test run contains: > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: ALTER > SUBSCRIPTION tsub REFRESH PUBLICATION; > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > session time: 0:00:00.014 user=postgres > database=postgres host=[local] > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply worker > for subscription "tsub" has started > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > synchronization worker for subscription "tsub", > table "test_mismatching_types" has started > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data format > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY test_mismatching_types, > line 1, column a > > In this case, "logical replication apply worker for subscription "tsub" has > started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later. > > I've managed to reproduce this failure locally by running multiple tests in > parallel, and my analysis shows that it is caused by a race condition when > accessing variable table_states_valid inside tablesync.c. > > tablesync.c does the following with table_states_valid: > /* > * Callback from syscache invalidation. > */ > void > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) > { > table_states_valid = false; > } > ... > static bool > FetchTableStates(bool *started_tx) > { > ... > if (!table_states_valid) > { > ... > /* Fetch all non-ready tables. */ > rstates = GetSubscriptionRelations(MySubscription->oid, true); > ... > table_states_valid = true; > } > > So, when syscache invalidation occurs inside the code block > "if (!table_states_valid)", that invalidation is effectively ignored. > > In a failed case I observe the following events: > 1. logical replication apply worker performs > LogicalRepApplyLoop() -> process_syncing_tables() -> > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > for SUBSCRIPTIONRELMAP, and that leads to calling > invalidate_syncing_table_states(). > > 3. If the apply worker manages to fetch no non-ready tables in > FetchTableStates() and ignore "table_states_valid = false" from > invalidate_syncing_table_states(), then it just misses the invalidation > event, so it keeps working without noticing non-ready tables appeared as > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips > a loop "foreach(lc, table_states_not_ready) ..." until some other event > occurs). > > pg_usleep(10) added just below GetSubscriptionRelations(...) proves my > analysis -- without it, I need tens of iterations (with 50 tests running in > parallel) to catch the failure, but with it, I get the failure on the first > iteration. Thanks for the analysis, I was able to reproduce the issue with the steps you had shared. I agree with your analysis. I added some logs to verify that the invalidation was getting missed. I felt that this invalidation is getting ignored because we have used a boolean variable here, how about changing it slightly so that table_states_invalid gets incremented for