Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Melanie Plageman Tue, 02 Jul 2024 16:08:09 -0700

On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
>
> Would it be possible to make it robust so that we could always run it
> with "make check"? This seems like an important corner case to
> regression test.


Okay, I've attached a new version of the patch and a new version of
the repro that may be fast and stable enough to commit. It is more
minimal than the previous version. I made the table as small as I
could to still trigger two rounds of index vacuuming. I tried to make
it as stable as possible. I also removed the cursor on the standby
that could trigger recovery conflicts. It would be super helpful if
someone could take a look at the test and point out any ways I could
make it even more likely to be stable.

- Melanie

From 7ecc180c4762c258dab01c7929c90a9c8e48c95b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 20 Jun 2024 17:21:11 -0400
Subject: [PATCH v2 2/2] Ensure vacuum removes all visibly dead tuples older
 than OldestXmin

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed,
it may attempt to freeze the tuple's xmax and then ERROR out in
pre-freeze checks with "cannot freeze committed xmax".

Fix this by having vacuum always remove tuples older than OldestXmin.

It is possible for GlobalVisState->maybe_needed to precede OldestXmin if
maybe_needed is forced to go backward while vacuum is running. This can
happen if a disconnected standby with a running transaction older than
VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin.

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread. After 1ccc1e05ae
removed the retry loop in lazy_scan_prune() and stopped comparing tuples
to OldestXmin, the hang could no longer happen, but we could still
attempt to freeze dead tuples with xmax older than OldestXmin --
resulting in an ERROR.

Fix this by always removing dead tuples with xmax older than
VacuumCutoffs->OldestXmin. This is okay because the standby won't replay
the tuple removal until the tuple is removable. Thus, the worst that can
happen is a recovery conflict.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee

Back-patch through 14

Author: Melanie Plageman
Reviewed-by: Peter Geoghegan, Robert Haas, Andres Freund, Heikki Linnakangas, and Noah Misch
Discussion: https://postgr.es/m/CAAKRu_bDD7oq9ZwB2OJqub5BovMG6UjEYsoK2LVttadjEqyRGg%40mail.gmail.com
---
 src/backend/access/heap/pruneheap.c  | 23 ++++++++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c | 14 +++++++-------
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 3cdfc5b7f1b..869d82ad667 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -325,6 +325,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * cutoffs contains the freeze cutoffs, established by VACUUM at the beginning
  * of vacuuming the relation.  Required if HEAP_PRUNE_FREEZE option is set.
+ * cutoffs->OldestXmin is also used to determine if dead tuples are
+ * HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
@@ -922,8 +924,27 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
 	if (res != HEAPTUPLE_RECENTLY_DEAD)
 		return res;
 
+	/*
+	 * For VACUUM, we must be sure to prune tuples with xmax older than
+	 * OldestXmin -- a visibility cutoff determined at the beginning of
+	 * vacuuming the relation. OldestXmin is used for freezing determination
+	 * and we cannot freeze dead tuples' xmaxes.
+	 */
+	if (prstate->cutoffs &&
+		TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
+		NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
+		return HEAPTUPLE_DEAD;
+
+	/*
+	 * Determine whether or not the tuple is considered dead when compared
+	 * with the provided GlobalVisState. On-access pruning does not provide
+	 * VacuumCutoffs. And for vacuum, even if the tuple's xmax is not older
+	 * than OldestXmin, GlobalVisTestIsRemovableXid() could find the row dead
+	 * if the GlobalVisState has been updated since the beginning of vacuuming
+	 * the relation.
+	 */
 	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
-		res = HEAPTUPLE_DEAD;
+		return HEAPTUPLE_DEAD;
 
 	return res;
 }
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8ef..abb47ae5960 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -438,13 +438,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * as an upper bound on the XIDs stored in the pages we'll actually scan
 	 * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
 	 *
-	 * Next acquire vistest, a related cutoff that's used in pruning.  We
-	 * expect vistest will always make heap_page_prune_and_freeze() remove any
-	 * deleted tuple whose xmax is < OldestXmin.  lazy_scan_prune must never
-	 * become confused about whether a tuple should be frozen or removed.  (In
-	 * the future we might want to teach lazy_scan_prune to recompute vistest
-	 * from time to time, to increase the number of dead tuples it can prune
-	 * away.)
+	 * Next acquire vistest, a related cutoff that's used in pruning.  We use
+	 * vistest in combination with OldestXmin to ensure that
+	 * heap_page_prune_and_freeze() always removes any deleted tuple whose
+	 * xmax is < OldestXmin.  lazy_scan_prune must never become confused about
+	 * whether a tuple should be frozen or removed.  (In the future we might
+	 * want to teach lazy_scan_prune to recompute vistest from time to time,
+	 * to increase the number of dead tuples it can prune away.)
 	 */
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
-- 
2.34.1

From aada45c02e996740bc544def6efbb6d324cf3633 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 20 Jun 2024 16:38:30 -0400
Subject: [PATCH v2 1/2] Test that vacuum removes tuples older than OldestXmin

If vacuum fails to prune a tuple killed before OldestXmin, it will
decide to freeze its xmax and later error out in pre-freeze checks.
---
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_vacuum_horizon_floor.pl    | 215 ++++++++++++++++++
 2 files changed, 216 insertions(+)
 create mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..1d55d6bf560 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_vacuum_horizon_floor.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl
new file mode 100644
index 00000000000..e22f0af26dd
--- /dev/null
+++ b/src/test/recovery/t/043_vacuum_horizon_floor.pl
@@ -0,0 +1,215 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Test that vacuum prunes away all dead tuples killed before OldestXmin
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 'physical');
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+hot_standby_feedback = on
+log_recovery_conflict_waits = true
+autovacuum = off
+maintenance_work_mem = 1024
+]);
+$node_primary->start;
+
+my $node_replica = PostgreSQL::Test::Cluster->new('standby');
+
+$node_primary->backup('my_backup');
+$node_replica->init_from_backup($node_primary, 'my_backup',
+	has_streaming => 1);
+
+$node_replica->start;
+
+my $test_db = "test_db";
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+my $orig_conninfo = $node_primary->connstr();
+
+my $table1 = "vac_horizon_floor_table";
+
+# Long-running Primary Session A
+my $psql_primaryA =
+  $node_primary->background_psql($test_db, on_error_stop => 0);
+
+# Long-running Primary Session B
+my $psql_primaryB  =
+  $node_primary->background_psql($test_db, on_error_stop => 0);
+
+# Insert and update enough rows that we force at least one round of index
+# vacuuming before getting to a dead tuple which was killed after the standby
+# is disconnected.
+#
+# We need multiple index vacuuming passes to repro because after the standby
+# reconnects to the primary, our backend's GlobalVisStates will not have been
+# updated with the new horizon until something forces them to be updated.
+#
+# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
+# end of a round of index vacuuming, updating the backend's GlobalVisState
+# and, in our case, moving maybe_needed backwards.
+#
+# Then vacuum's first pass will continue and pruning will find our later
+# inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
+# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
+$node_primary->safe_psql($test_db, qq[
+	CREATE TABLE ${table1}(col1 int) with (autovacuum_enabled=false, fillfactor=10);
+	INSERT INTO $table1 SELECT generate_series(1, 400000) % 3;
+	CREATE INDEX on ${table1}(col1);
+	UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
+]);
+
+# We will insert and update the tuple that vacuum will error out processing
+# later. For now, there is no reason not to wait for the standby to catch up.
+# Do so to make the test conditions as clear as possible.
+my $primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
+
+# Test that the WAL receiver is up and running.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 't');
+
+# Set primary_conninfo to something invalid on the replica and reload the
+# config. Once the config is reloaded, the startup process will force the WAL
+# receiver to restart and it will be unable to reconnect because of the invalid
+# connection information.
+$node_replica->safe_psql($test_db, qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+# Wait until the WAL receiver has shut down and been unable to start up again.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 'f');
+
+# Now insert and update a tuple which will be visible to the vacuum on the
+# primary but which will have xmax newer than the oldest xmin on the standby
+# that was recently disconnected.
+my $res = $psql_primaryA->query_safe(
+	qq[
+		INSERT INTO $table1 VALUES (99);
+		UPDATE $table1 SET col1 = 100 WHERE col1 = 99;
+		SELECT 'after_update';
+        ]
+	);
+
+# Make sure the UPDATE finished
+like($res, qr/^after_update$/m, "UPDATE occurred on primary session A");
+
+# Open a cursor on the primary whose lock will keep VACUUM from getting a
+# cleanup lock on the first page of the relation. We want VACUUM to be able to
+# start, calculate initial values for OldestXmin and GlobalVisState and then be
+# unable to proceed with pruning our dead tuples. This will allow us to
+# reconnect the standby and push the horizon back before we start actual
+# pruning and vacuuming.
+my $primary_cursor1 = "vac_horizon_floor_cursor";
+
+# The first value inserted into the table was a 1, so FETCH FORWARD should
+# return a 1. That's how we know the cursor has a pin and lock.
+$res = $psql_primaryB->query_until(
+	qr/^1$/m,
+	qq[
+        BEGIN;
+			DECLARE $primary_cursor1 CURSOR FOR SELECT col1 FROM $table1;
+			FETCH FORWARD FROM $primary_cursor1;
+        ]
+	);
+
+# Get the PID of the session which will run the VACUUM FREEZE so that we can
+# use it to filter pg_stat_activity later.
+my $vacuum_pid = $psql_primaryA->query_safe("SELECT pg_backend_pid();");
+
+# Now start a VACUUM FREEZE on the primary. It will call vacuum_get_cutoffs()
+# and establish values of OldestXmin and GlobalVisState which are newer than
+# all of our dead tuples. Then it will be unable to get a cleanup lock to start
+# pruning, so it will hang.
+$psql_primaryA->{stdin} .= qq[
+		VACUUM FREEZE $table1;
+		\\echo VACUUM
+        ];
+
+# Make sure the VACUUM command makes it to the server.
+$psql_primaryA->{run}->pump_nb();
+
+# Make sure that the VACUUM has already called vacuum_get_cutoffs() and is just
+# waiting on the lock to start vacuuming. We don't want the standby to
+# re-establish a connection to the primary and push the horizon back until
+# we've saved initial values in GlobalVisState and calculated OldestXmin.
+$node_primary->poll_query_until($test_db,
+	qq[
+	SELECT count(*) >= 1 FROM pg_stat_activity
+		WHERE pid = $vacuum_pid
+		AND wait_event = 'BufferPin';
+	],
+	't');
+
+# Ensure the WAL receiver is still not active on the replica.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 'f');
+
+# Allow the WAL receiver connection to re-establish.
+$node_replica->safe_psql(
+	$test_db, qq[
+		ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+# Ensure the new WAL receiver has connected.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 't');
+
+# Once the WAL sender is shown on the primary, the replica should have
+# connected with the primary and pushed the horizon backward. Primary Session A
+# won't see that until the VACUUM FREEZE proceeds and does its first round of
+# index vacuuming.
+$node_primary->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_replication);] , 't');
+
+# Commit the cursor so that the VACUUM can proceed.
+$psql_primaryB->query_until(
+		qr/^commit$/m,
+		qq[
+			COMMIT;
+			\\echo commit
+        ]
+	);
+
+# VACUUM proceeds with pruning and does a visibility check on each tuple.
+# Before [sha], pruning found our final dead tuple non-removable
+# (HEAPTUPLE_RECENTLY_DEAD) since its xmax is after the new value of
+# maybe_needed. Then heap_prepare_freeze_tuple() would decide the tuple
+# xmax should be frozen because it precedes OldestXmin. Vacuum would then
+# error out in heap_pre_freeze_checks() with "cannot freeze committed
+# xmax". After [sha], pruning finds the tuple HEAPTUPLE_DEAD because its
+# xmax precedes OldestXmin and the tuple is removed.
+
+# With the fix, VACUUM should finish successfully, incrementing the table
+# vacuum_count.
+$node_primary->poll_query_until($test_db,
+	qq[
+	SELECT vacuum_count > 0
+	FROM pg_stat_all_tables WHERE relname = '${table1}';
+	]
+	, 't');
+
+$primary_lsn = $node_primary->lsn('flush');
+
+# Make sure something causes us to flush
+$node_primary->safe_psql($test_db, "INSERT INTO $table1 VALUES (1);");
+
+# Nothing on the replica should cause a recovery conflict, so this should
+# finish successfully.
+$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
+
+## Shut down psqls
+$psql_primaryA->quit;
+$psql_primaryB->quit;
+
+$node_replica->stop();
+$node_primary->stop();
+
+done_testing();
-- 
2.34.1

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Reply via email to