Dear hackers, > > > > > > Pushed! > > > > Hi all, the CF entry for this is marked RfC, and CI is trying to apply > > the last patch committed. Is there further work that needs to be > > re-attached and/or rebased? > > > > No. I have marked it as committed. >
I found another failure related with the commit [1]. I think it is caused by the autovacuum. I want to propose a patch which disables the feature for old publisher. More detail, please see below. # Analysis of the failure Summary: this failure occurs when the autovacuum starts after the subscription is disabled but before doing pg_upgrade. According to the regress file, it unexpectedly failed the pg_upgrade [2]. There are no possibilities for slots are invalidated, so some WALs seemed to be generated after disabling the subscriber. Also, server log caused by oldpub said that autovacuum worker was terminated when it stopped. This was occurred after walsender released the logical slots. WAL records caused by autovacuum workers could not be consumed by the slots, so that upgrading function returned false. # How to reproduce I made a small file for reproducing the failure. Please see reproduce.txt. This contains changes for launching autovacuum worker very often and for ensuring actual works are done. After applying it, I could reproduce the same failure every time. # How to fix I think it is sufficient to fix only the test code. The easiest way is to disable the autovacuum on old publisher. PSA the patch file. How do you think? [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-11-27%2020%3A52%3A10 [2]: ``` ... Checking for contrib/isn with bigint-passing mismatch ok Checking for valid logical replication slots fatal Your installation contains logical replication slots that can't be upgraded. You can remove invalid slots and/or consume the pending WAL for other slots, and then restart the upgrade. A list of the problematic slots is in the file: /home/bf/bf-build/skink-master/HEAD/pgsql.build/src/bin/pg_upgrade/tmp_check/t_003_logical_slots_newpub_data/pgdata/pg_upgrade_output.d/20231127T220024.480/invalid_logical_slots.txt Failure, exiting [22:01:20.362](86.645s) not ok 10 - run of pg_upgrade of old cluster ... ``` [3]: ``` ... 2023-11-27 22:00:23.546 UTC [3567962][walsender][4/0:0] LOG: released logical replication slot "regress_sub" 2023-11-27 22:00:23.549 UTC [3559042][postmaster][:0] LOG: received fast shutdown request 2023-11-27 22:00:23.552 UTC [3559042][postmaster][:0] LOG: aborting any active transactions *2023-11-27 22:00:23.663 UTC [3568793][autovacuum worker][5/3:738] FATAL: terminating autovacuum process due to administrator command* 2023-11-27 22:00:23.775 UTC [3559042][postmaster][:0] LOG: background worker "logical replication launcher" (PID 3560674) exited with exit code 1 ... ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
disable_autovacuum.patch
Description: disable_autovacuum.patch
diff --git a/src/backend/postmaster/autovacuum.c
b/src/backend/postmaster/autovacuum.c
index 86a3b3d8be..406c588a1d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -662,7 +662,7 @@ AutoVacLauncherMain(int argc, char *argv[])
*/
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,
- (nap.tv_sec * 1000L) +
(nap.tv_usec / 1000L),
+ 100L,
WAIT_EVENT_AUTOVACUUM_MAIN);
ResetLatch(MyLatch);
@@ -769,6 +769,9 @@ AutoVacLauncherMain(int argc, char *argv[])
}
LWLockRelease(AutovacuumLock); /* either shared or exclusive */
+ /* force launch */
+ can_launch = true;
+
/* if we can't do anything, just go back to sleep */
if (!can_launch)
continue;
@@ -1267,38 +1270,6 @@ do_start_worker(void)
if (!tmp->adw_entry)
continue;
- /*
- * Also, skip a database that appears on the database list as
having
- * been processed recently (less than autovacuum_naptime
seconds ago).
- * We do this so that we don't select a database which we just
- * selected, but that pgstat hasn't gotten around to updating
the last
- * autovacuum time yet.
- */
- skipit = false;
-
- dlist_reverse_foreach(iter, &DatabaseList)
- {
- avl_dbase *dbp = dlist_container(avl_dbase, adl_node,
iter.cur);
-
- if (dbp->adl_datid == tmp->adw_datid)
- {
- /*
- * Skip this database if its next_worker value
falls between
- * the current time and the current time plus
naptime.
- */
- if
(!TimestampDifferenceExceeds(dbp->adl_next_worker,
-
current_time, 0) &&
-
!TimestampDifferenceExceeds(current_time,
-
dbp->adl_next_worker,
-
autovacuum_naptime * 1000))
- skipit = true;
-
- break;
- }
- }
- if (skipit)
- continue;
-
/*
* Remember the db with oldest autovac time. (If we are here,
both
* tmp->entry and db->entry must be non-null.)
@@ -3198,6 +3169,9 @@ relation_needs_vacanalyze(Oid relid,
/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
*doanalyze = false;
+
+ *dovacuum = true;
+ *doanalyze = true;
}
/*
diff --git a/src/bin/pg_upgrade/t/003_logical_slots.pl
b/src/bin/pg_upgrade/t/003_logical_slots.pl
index 5b01cf8c40..5c181375a4 100644
--- a/src/bin/pg_upgrade/t/003_logical_slots.pl
+++ b/src/bin/pg_upgrade/t/003_logical_slots.pl
@@ -17,6 +17,7 @@ my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
# Initialize old cluster
my $oldpub = PostgreSQL::Test::Cluster->new('oldpub');
$oldpub->init(allows_streaming => 'logical');
+$oldpub->append_conf('postgresql.conf', 'autovacuum_naptime = 3');
# Initialize new cluster
my $newpub = PostgreSQL::Test::Cluster->new('newpub');
@@ -164,6 +165,7 @@ $sub->wait_for_subscription_sync($oldpub, 'regress_sub');
# 2. Temporarily disable the subscription
$sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub DISABLE");
+sleep 4;
$oldpub->stop;
# pg_upgrade should be successful
