Dear hackers,

> > >
> > > Pushed!
> >
> > Hi all, the CF entry for this is marked RfC, and CI is trying to apply
> > the last patch committed. Is there further work that needs to be
> > re-attached and/or rebased?
> >
> 
> No. I have marked it as committed.
>

I found another failure related with the commit [1]. I think it is caused by the
autovacuum. I want to propose a patch which disables the feature for old 
publisher.

More detail, please see below.

# Analysis of the failure

Summary: this failure occurs when the autovacuum starts after the subscription
is disabled but before doing pg_upgrade.

According to the regress file, it unexpectedly failed the pg_upgrade [2]. There 
are
no possibilities for slots are invalidated, so some WALs seemed to be generated
after disabling the subscriber.

Also, server log caused by oldpub said that autovacuum worker was terminated 
when
it stopped. This was occurred after walsender released the logical slots. WAL 
records
caused by autovacuum workers could not be consumed by the slots, so that 
upgrading
function returned false.

# How to reproduce

I made a small file for reproducing the failure. Please see reproduce.txt. This 
contains
changes for launching autovacuum worker very often and for ensuring actual 
works are
done. After applying it, I could reproduce the same failure every time.

# How to fix

I think it is sufficient to fix only the test code.
The easiest way is to disable the autovacuum on old publisher. PSA the patch 
file.

How do you think?


[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-11-27%2020%3A52%3A10
[2]:
```
...
Checking for contrib/isn with bigint-passing mismatch         ok
Checking for valid logical replication slots                  fatal

Your installation contains logical replication slots that can't be upgraded.
You can remove invalid slots and/or consume the pending WAL for other slots,
and then restart the upgrade.
A list of the problematic slots is in the file:
    
/home/bf/bf-build/skink-master/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231127T220024.480/invalid_logical_slots.txt
Failure, exiting
[22:01:20.362](86.645s) not ok 10 - run of pg_upgrade of old cluster
...
```
[3]:
```
...
2023-11-27 22:00:23.546 UTC [3567962][walsender][4/0:0] LOG:  released logical 
replication slot "regress_sub"
2023-11-27 22:00:23.549 UTC [3559042][postmaster][:0] LOG:  received fast 
shutdown request
2023-11-27 22:00:23.552 UTC [3559042][postmaster][:0] LOG:  aborting any active 
transactions
*2023-11-27 22:00:23.663 UTC [3568793][autovacuum worker][5/3:738] FATAL:  
terminating autovacuum process due to administrator command*
2023-11-27 22:00:23.775 UTC [3559042][postmaster][:0] LOG:  background worker 
"logical replication launcher" (PID 3560674) exited with exit code 1
...
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment: disable_autovacuum.patch
Description: disable_autovacuum.patch

diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 86a3b3d8be..406c588a1d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -662,7 +662,7 @@ AutoVacLauncherMain(int argc, char *argv[])
                 */
                (void) WaitLatch(MyLatch,
                                                 WL_LATCH_SET | WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,
-                                                (nap.tv_sec * 1000L) + 
(nap.tv_usec / 1000L),
+                                                100L,
                                                 WAIT_EVENT_AUTOVACUUM_MAIN);
 
                ResetLatch(MyLatch);
@@ -769,6 +769,9 @@ AutoVacLauncherMain(int argc, char *argv[])
                }
                LWLockRelease(AutovacuumLock);  /* either shared or exclusive */
 
+               /* force launch */
+               can_launch = true;
+
                /* if we can't do anything, just go back to sleep */
                if (!can_launch)
                        continue;
@@ -1267,38 +1270,6 @@ do_start_worker(void)
                if (!tmp->adw_entry)
                        continue;
 
-               /*
-                * Also, skip a database that appears on the database list as 
having
-                * been processed recently (less than autovacuum_naptime 
seconds ago).
-                * We do this so that we don't select a database which we just
-                * selected, but that pgstat hasn't gotten around to updating 
the last
-                * autovacuum time yet.
-                */
-               skipit = false;
-
-               dlist_reverse_foreach(iter, &DatabaseList)
-               {
-                       avl_dbase  *dbp = dlist_container(avl_dbase, adl_node, 
iter.cur);
-
-                       if (dbp->adl_datid == tmp->adw_datid)
-                       {
-                               /*
-                                * Skip this database if its next_worker value 
falls between
-                                * the current time and the current time plus 
naptime.
-                                */
-                               if 
(!TimestampDifferenceExceeds(dbp->adl_next_worker,
-                                                                               
                current_time, 0) &&
-                                       
!TimestampDifferenceExceeds(current_time,
-                                                                               
                dbp->adl_next_worker,
-                                                                               
                autovacuum_naptime * 1000))
-                                       skipit = true;
-
-                               break;
-                       }
-               }
-               if (skipit)
-                       continue;
-
                /*
                 * Remember the db with oldest autovac time.  (If we are here, 
both
                 * tmp->entry and db->entry must be non-null.)
@@ -3198,6 +3169,9 @@ relation_needs_vacanalyze(Oid relid,
        /* ANALYZE refuses to work with pg_statistic */
        if (relid == StatisticRelationId)
                *doanalyze = false;
+
+       *dovacuum = true;
+       *doanalyze = true;
 }
 
 /*
diff --git a/src/bin/pg_upgrade/t/003_logical_slots.pl 
b/src/bin/pg_upgrade/t/003_logical_slots.pl
index 5b01cf8c40..5c181375a4 100644
--- a/src/bin/pg_upgrade/t/003_logical_slots.pl
+++ b/src/bin/pg_upgrade/t/003_logical_slots.pl
@@ -17,6 +17,7 @@ my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
 # Initialize old cluster
 my $oldpub = PostgreSQL::Test::Cluster->new('oldpub');
 $oldpub->init(allows_streaming => 'logical');
+$oldpub->append_conf('postgresql.conf', 'autovacuum_naptime = 3');
 
 # Initialize new cluster
 my $newpub = PostgreSQL::Test::Cluster->new('newpub');
@@ -164,6 +165,7 @@ $sub->wait_for_subscription_sync($oldpub, 'regress_sub');
 
 # 2. Temporarily disable the subscription
 $sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub DISABLE");
+sleep 4;
 $oldpub->stop;
 
 # pg_upgrade should be successful

Reply via email to