Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-13 Thread Thomas Munro
On Sat, May 14, 2022 at 3:33 AM Robert Haas  wrote:
> This seems fine, but I think you should add a non-trivial comment about it.

Thanks for looking.  Done, and pushed.  Let's see if 180s per query is enough...




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-13 Thread Robert Haas
On Thu, May 12, 2022 at 10:20 PM Thomas Munro  wrote:
> As for skink failing, the timeout was hard coded 300s for the whole
> test, but apparently that wasn't enough under valgrind.  Let's use the
> standard PostgreSQL::Test::Utils::timeout_default (180s usually), but
> reset it for each query we send.

@@ -202,6 +198,9 @@ sub send_query_and_wait
  my ($psql, $query, $untl) = @_;
  my $ret;

+ $psql_timeout->reset();
+ $psql_timeout->start();
+
  # send query
  $$psql{stdin} .= $query;
  $$psql{stdin} .= "\n";

This seems fine, but I think you should add a non-trivial comment about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-12 Thread Thomas Munro
On Thu, May 12, 2022 at 4:57 PM Thomas Munro  wrote:
> On Thu, May 12, 2022 at 3:13 PM Thomas Munro  wrote:
> > error running SQL: 'psql::1: ERROR:  source database
> > "conflict_db_template" is being accessed by other users
> > DETAIL:  There is 1 other session using the database.'
>
> Oh, for this one I think it may just be that the autovacuum worker
> with PID 23757 took longer to exit than the 5 seconds
> CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.

In this test, autovacuum_naptime is set to 1s (per Andres, AV was
implicated when he first saw the problem with pg_upgrade, hence desire
to crank it up).  That's not necessary: commenting out the active line
in ProcessBarrierSmgrRelease() shows that the tests reliably reproduce
data corruption without it.  Let's just take that out.

As for skink failing, the timeout was hard coded 300s for the whole
test, but apparently that wasn't enough under valgrind.  Let's use the
standard PostgreSQL::Test::Utils::timeout_default (180s usually), but
reset it for each query we send.

See attached.
From ffcb61004fd06c9b2db56c7fe045c7c726d67a72 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 13 May 2022 13:40:03 +1200
Subject: [PATCH] Fix slow animal timeouts in 032_relfilenode_reuse.pl.

Per BF animal chipmunk:  CREATE DATABASE could apparently fail due to an
AV process being in the template database and not quitting fast enough
for the 5 second timeout in CountOtherDBBackends().  The test script had
autovacuum_naptime=1s to encourage more activity that opens fds, but
that wasn't strictly necessary for this test.  Take it out.

Per BF animal skink:  the test had a global 300s timeout, but apparently
that was not enough under valgrind.  Use the standard timeout
PostgreSQL::Test::Utils::timeout_default, but reset it for each query we
run.

Discussion:
---
 src/test/recovery/t/032_relfilenode_reuse.pl | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/src/test/recovery/t/032_relfilenode_reuse.pl b/src/test/recovery/t/032_relfilenode_reuse.pl
index ac9340b7dd..5a6a759aa5 100644
--- a/src/test/recovery/t/032_relfilenode_reuse.pl
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -14,7 +14,6 @@ log_connections=on
 # to avoid "repairing" corruption
 full_page_writes=off
 log_min_messages=debug2
-autovacuum_naptime=1s
 shared_buffers=1MB
 ]);
 $node_primary->start;
@@ -28,11 +27,8 @@ $node_standby->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
 $node_standby->start;
 
-# To avoid hanging while expecting some specific input from a psql
-# instance being driven by us, add a timeout high enough that it
-# should never trigger even on very slow machines, unless something
-# is really wrong.
-my $psql_timeout = IPC::Run::timer(300);
+# We'll reset this timeout for each individual query we run.
+my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
 
 my %psql_primary = (stdin => '', stdout => '', stderr => '');
 $psql_primary{run} = IPC::Run::start(
@@ -202,6 +198,9 @@ sub send_query_and_wait
 	my ($psql, $query, $untl) = @_;
 	my $ret;
 
+	$psql_timeout->reset();
+	$psql_timeout->start();
+
 	# send query
 	$$psql{stdin} .= $query;
 	$$psql{stdin} .= "\n";
-- 
2.36.0



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-11 Thread Thomas Munro
On Thu, May 12, 2022 at 3:13 PM Thomas Munro  wrote:
> Chipmunk, another little early model Raspberry Pi:
>
> error running SQL: 'psql::1: ERROR:  source database
> "conflict_db_template" is being accessed by other users
> DETAIL:  There is 1 other session using the database.'

Oh, for this one I think it may just be that the autovacuum worker
with PID 23757 took longer to exit than the 5 seconds
CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-11 Thread Thomas Munro
On Sat, May 7, 2022 at 9:37 PM Thomas Munro  wrote:
> So far "grison" failed.  I think it's probably just that the test
> forgot to wait for replay of CREATE EXTENSION before using pg_prewarm
> on the standby, hence "ERROR:  function pg_prewarm(oid) does not exist
> at character 12".  I'll wait for more animals to report before I try
> to fix that tomorrow.

That one was addressed by commit a22652e.  Unfortunately two new kinds
of failure showed up:

Chipmunk, another little early model Raspberry Pi:

error running SQL: 'psql::1: ERROR:  source database
"conflict_db_template" is being accessed by other users
DETAIL:  There is 1 other session using the database.'
while running 'psql -XAtq -d port=57394 host=/tmp/luyJopPv9L
dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CREATE DATABASE
conflict_db TEMPLATE conflict_db_template OID = 50001;' at
/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
line 1836.

I think that might imply that when you do two
$node_primary->safe_psql() calls in a row, the backend running the
second one might still see the ghost of the first backend in the
database, even though the first psql process has exited.  Hmmm.

Skink, the valgrind animal, also failed.  After first 8 tests, it times out:

[07:18:26.237](14.827s) ok 8 - standby: post move contents as expected
[07:18:42.877](16.641s) Bail out!  aborting wait: program timed out
stream contents: >><<
pattern searched for: (?^m:warmed_buffers)

That's be in the perl routine cause_eviction(), while waiting for a
pg_prewarm query to return, but I'm not yet sure what's going on (I
have to suspect it's in the perl scripting rather than anything in the
server, given the location of the failure).  Will try to repro
locally.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-11 Thread Thomas Munro
On Tue, May 10, 2022 at 1:07 AM Robert Haas  wrote:
> On Sun, May 8, 2022 at 7:30 PM Thomas Munro  wrote:
> > LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> > STATEMENT:  alter database mydb set tablespace ts1;

> This is a very good idea.

OK, I pushed this, after making the ereport call look a bit more like
others that talk about backend PIDs.

> > Another thought is that it might be nice to be able to test with a
> > dummy PSB that doesn't actually do anything.  You could use it to see
> > how fast your system processes it, while doing various other things,
> > and to find/debug problems in other code that fails to handle
> > interrupts correctly.  Here's an attempt at that.  I guess it could go
> > into a src/test/modules/something instead of core, but on the other
> > hand the PSB itself has to be in core anyway, so maybe not.  Thoughts?
> >  No documentation yet, just seeing if people think this is worth
> > having... better names/ideas welcome.
>
> I did this at one point, but I wasn't convinced it was going to find
> enough bugs to be worth committing. It's OK if you're convinced of
> things that didn't convince me, though.

I'll leave this here for now.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-09 Thread Robert Haas
On Sun, May 8, 2022 at 7:30 PM Thomas Munro  wrote:
> Simple idea: how about logging the PID of processes that block
> progress for too long?  In the attached, I arbitrarily picked 5
> seconds as the wait time between LOG messages.  Also, DEBUG1 messages
> let you see the processing speed on eg build farm animals.  Thoughts?
>
> To test this, kill -STOP a random backend, and then try an ALTER
> DATABASE SET TABLESPACE in another backend.  Example output:
>
> DEBUG:  waiting for all backends to process ProcSignalBarrier generation 1
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;

This is a very good idea.

> Another thought is that it might be nice to be able to test with a
> dummy PSB that doesn't actually do anything.  You could use it to see
> how fast your system processes it, while doing various other things,
> and to find/debug problems in other code that fails to handle
> interrupts correctly.  Here's an attempt at that.  I guess it could go
> into a src/test/modules/something instead of core, but on the other
> hand the PSB itself has to be in core anyway, so maybe not.  Thoughts?
>  No documentation yet, just seeing if people think this is worth
> having... better names/ideas welcome.

I did this at one point, but I wasn't convinced it was going to find
enough bugs to be worth committing. It's OK if you're convinced of
things that didn't convince me, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-08 Thread Thomas Munro
On Sat, May 7, 2022 at 4:52 PM Thomas Munro  wrote:
> I think we'll probably also want to invent a way
> to report which backend is holding up progress, since without that
> it's practically impossible for an end user to understand why their
> command is hanging.

Simple idea: how about logging the PID of processes that block
progress for too long?  In the attached, I arbitrarily picked 5
seconds as the wait time between LOG messages.  Also, DEBUG1 messages
let you see the processing speed on eg build farm animals.  Thoughts?

To test this, kill -STOP a random backend, and then try an ALTER
DATABASE SET TABLESPACE in another backend.  Example output:

DEBUG:  waiting for all backends to process ProcSignalBarrier generation 1
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;

... then kill -CONT:

DEBUG:  finished waiting for all backends to process ProcSignalBarrier
generation 1

Another thought is that it might be nice to be able to test with a
dummy PSB that doesn't actually do anything.  You could use it to see
how fast your system processes it, while doing various other things,
and to find/debug problems in other code that fails to handle
interrupts correctly.  Here's an attempt at that.  I guess it could go
into a src/test/modules/something instead of core, but on the other
hand the PSB itself has to be in core anyway, so maybe not.  Thoughts?
 No documentation yet, just seeing if people think this is worth
having... better names/ideas welcome.

To test this, just SELECT pg_test_procsignal_barrier().
From 9fe475b557143cc324ce14de4f29ad8bef206fc8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 9 May 2022 10:24:01 +1200
Subject: [PATCH 1/2] Add logging for ProcSignalBarrier mechanism.

To help with diagnosis of systems that are not processing
ProcSignalBarrier requests promptly, add a LOG message every 5 seconds
if we seem to be wedged.  Although you could already see this state via
wait events in pg_stat_activity, the log message also shows the PID of
the process that is preventing progress.

Also add DEBUG1 logging around the whole loop.
---
 src/backend/storage/ipc/procsignal.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 00d66902d8..4fe6f087db 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -393,6 +393,11 @@ WaitForProcSignalBarrier(uint64 generation)
 {
 	Assert(generation <= pg_atomic_read_u64(>psh_barrierGeneration));
 
+	elog(DEBUG1,
+		 "waiting for all backends to process ProcSignalBarrier generation "
+		 UINT64_FORMAT,
+		 generation);
+
 	for (int i = NumProcSignalSlots - 1; i >= 0; i--)
 	{
 		ProcSignalSlot *slot = >psh_slot[i];
@@ -407,13 +412,22 @@ WaitForProcSignalBarrier(uint64 generation)
 		oldval = pg_atomic_read_u64(>pss_barrierGeneration);
 		while (oldval < generation)
 		{
-			ConditionVariableSleep(>pss_barrierCV,
-   WAIT_EVENT_PROC_SIGNAL_BARRIER);
+			if (ConditionVariableTimedSleep(>pss_barrierCV,
+			5000,
+			WAIT_EVENT_PROC_SIGNAL_BARRIER))
+elog(LOG,
+	 "still waiting for pid %d to accept ProcSignalBarrier",
+	 slot->pss_pid);
 			oldval = pg_atomic_read_u64(>pss_barrierGeneration);
 		}
 		ConditionVariableCancelSleep();
 	}
 
+	elog(DEBUG1,
+		 "finished waiting for all backends to process ProcSignalBarrier generation "
+		 UINT64_FORMAT,
+		 generation);
+
 	/*
 	 * The caller is probably calling this function because it wants to read
 	 * the shared state or perform further writes to shared state once all
-- 
2.30.2

From 266528b553dcc128d2d38039ae821fe515ab1db1 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 9 May 2022 11:06:48 +1200
Subject: [PATCH 2/2] Add pg_test_procsignal_barrier().

A superuser-only test function to exercise the ProcSignalBarrier
mechanism.  It has no effect, other than to interrupt every process
connected to shared memory and wait for them all to acknowledge the
request.
---
 src/backend/storage/ipc/procsignal.c | 20 
 src/include/catalog/pg_proc.dat  |  4 
 src/include/storage/procsignal.h |  1 +
 3 files changed, 25 insertions(+)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4fe6f087db..93c31bc6b9 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -541,6 +541,10 @@ ProcessProcSignalBarrier(void)
 type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags);
 switch (type)
 {

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-07 Thread Thomas Munro
On Sat, May 7, 2022 at 4:52 PM Thomas Munro  wrote:
> Done.  Time to watch the build farm.

So far "grison" failed.  I think it's probably just that the test
forgot to wait for replay of CREATE EXTENSION before using pg_prewarm
on the standby, hence "ERROR:  function pg_prewarm(oid) does not exist
at character 12".  I'll wait for more animals to report before I try
to fix that tomorrow.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-06 Thread Thomas Munro
On Wed, May 4, 2022 at 2:23 PM Thomas Munro  wrote:
> Assuming no
> objections or CI failures show up, I'll consider pushing the first two
> patches tomorrow.

Done.  Time to watch the build farm.

It's possible that these changes will produce some blowback, now that
we're using PROCSIGNAL_BARRIER_SMGRRELEASE on common platforms.
Obviously the earlier work started down this path already, but that
was Windows-only, so it probably didn't get much attention from our
mostly Unix crowd.

For example, if there are bugs in the procsignal barrier system, or if
there are places that don't process interrupts at all or promptly, we
might hear complaints about that.  That bug surface includes, for
example, background workers created by extensions.  An example on my
mind is a place where we hold interrupts while cleaning up temporary
files (= a loop of arbitrary size with filesystem ops that might hang
on slow storage), so a concurrent DROP TABLESPACE will have to wait,
which is kinda strange because it would in fact be perfectly safe to
process this particular "interrupt".  In that case we really just
don't want the kinds of interrupts that perform non-local exits and
prevent our cleanup work from running to completion, but we don't have
a way to say that.  I think we'll probably also want to invent a way
to report which backend is holding up progress, since without that
it's practically impossible for an end user to understand why their
command is hanging.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Thomas Munro
On Wed, May 4, 2022 at 8:53 AM Thomas Munro  wrote:
> Got some off-list clues: that's just distracting Perl cleanup noise
> after something else went wrong (thanks Robert), and now I'm testing a
> theory from Andres that we're missing a barrier on the redo side when
> replaying XLOG_DBASE_CREATE_FILE_COPY.  More soon.

Yeah, looks like that was the explanation.  Presumably in older
releases, recovery can fail with EACCES here, and since commit
e2f0f8ed we get ENOENT, because someone's got an unlinked file open,
and ReadDir() can still see it.  (I've wondered before if ReadDir()
should also hide zombie Windows directory entries, but that's kinda
independent and would only get us one step further, a later rmdir()
would still fail.)  Adding the barrier fixes the problem.  Assuming no
objections or CI failures show up, I'll consider pushing the first two
patches tomorrow.
From 1c5e04e7704ab9d5bd2559338ce1b6bd8a397e51 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 2 Apr 2022 17:05:39 +1300
Subject: [PATCH v4 1/3] Rethink PROCSIGNAL_BARRIER_SMGRRELEASE.

With sufficiently bad luck, it was possible for IssuePendingWritebacks()
to reopen files after we'd processed PROCSIGNAL_BARRIER_SMGRRELEASE and
before the file was unlinked by some other backend, leading to potential
errors on Windows.

Defend against that by closing md.c's segments, instead of just closing
fd.c's descriptors, and then teaching smgrwriteback() not to open files
that aren't already open.

This fixes an overlooked edge case for commit 4eb21763.

Reported-by: Andres Freund 
Reviewed-by: Robert Haas 
Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de
---
 src/backend/storage/smgr/md.c   | 22 +---
 src/backend/storage/smgr/smgr.c | 45 +++--
 src/include/storage/md.h|  1 -
 src/include/storage/smgr.h  |  2 ++
 4 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 286dd3f755..41fa73087b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -116,6 +116,8 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * mdnblocks().
  */
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
+/* don't try to open a segment, if not already open */
+#define EXTENSION_DONT_OPEN			(1 << 5)
 
 
 /* local routines */
@@ -551,12 +553,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-void
-mdrelease(void)
-{
-	closeAllVfds();
-}
-
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
@@ -605,11 +601,14 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	segnum_end;
 
 		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
-		 EXTENSION_RETURN_NULL);
+		 EXTENSION_DONT_OPEN);
 
 		/*
 		 * We might be flushing buffers of already removed relations, that's
-		 * ok, just ignore that case.
+		 * ok, just ignore that case.  If the segment file wasn't open already
+		 * (ie from a recent mdwrite()), then we don't want to re-open it, to
+		 * avoid a race with PROCSIGNAL_BARRIER_SMGRRELEASE that might leave us
+		 * with a descriptor to a file that is about to be unlinked.
 		 */
 		if (!v)
 			return;
@@ -1202,7 +1201,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 	/* some way to handle non-existent segments needs to be specified */
 	Assert(behavior &
-		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL));
+		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
+			EXTENSION_DONT_OPEN));
 
 	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
 
@@ -1213,6 +1213,10 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		return v;
 	}
 
+	/* The caller only wants the segment if we already had it open. */
+	if (behavior & EXTENSION_DONT_OPEN)
+		return NULL;
+
 	/*
 	 * The target segment is not yet open. Iterate over all the segments
 	 * between the last opened and the target segment. This way missing
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 2c7a2b2857..9cbfaa86cb 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -41,7 +41,6 @@ typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_release) (void); /* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
@@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = {
 	{
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
-		.smgr_release = mdrelease,
 		.smgr_open = mdopen,
 		.smgr_close = mdclose,
 		.smgr_create = mdcreate,
@@ -281,6 +279,42 @@ smgrclose(SMgrRelation reln)
 		*owner = NULL;
 }
 
+/*
+ *	smgrrelease() -- Release all resources used by this object.
+ *
+ *	The object remains valid.
+ 

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Thomas Munro
On Wed, May 4, 2022 at 7:44 AM Thomas Munro  wrote:
> It passes sometimes and fails sometimes.  Here's the weird failure I
> need to debug:
>
> https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse
>
> Right at the end, it says:
>
> Warning: unable to close filehandle GEN26 properly: Bad file
> descriptor during global destruction.
> Warning: unable to close filehandle GEN21 properly: Bad file
> descriptor during global destruction.
> Warning: unable to close filehandle GEN6 properly: Bad file descriptor
> during global destruction.
>
> I don't know what it means (Windows' fd->handle mapping table got
> corrupted?) or even which program is printing it (you'd think maybe
> perl?  but how could that be affected by anything I did in
> postgres.exe, but if it's not perl why is it always at the end like
> that?).  Hrmph.

Got some off-list clues: that's just distracting Perl cleanup noise
after something else went wrong (thanks Robert), and now I'm testing a
theory from Andres that we're missing a barrier on the redo side when
replaying XLOG_DBASE_CREATE_FILE_COPY.  More soon.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Thomas Munro
On Wed, May 4, 2022 at 6:36 AM Robert Haas  wrote:
> On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro  wrote:
> > So, to summarise the new patch that I'm attaching to this email as 0001:
>
> This all makes sense to me, and I didn't see anything obviously wrong
> looking through the patch, either.

Thanks.

> > However it seems that I have something wrong, because CI is failing on
> > Windows; I ran out of time for looking into that today, but wanted to
> > post what I have so far since I know we have an open item or two to
> > close here ASAP...
>
> :-(

It passes sometimes and fails sometimes.  Here's the weird failure I
need to debug:

https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse

Right at the end, it says:

Warning: unable to close filehandle GEN26 properly: Bad file
descriptor during global destruction.
Warning: unable to close filehandle GEN21 properly: Bad file
descriptor during global destruction.
Warning: unable to close filehandle GEN6 properly: Bad file descriptor
during global destruction.

I don't know what it means (Windows' fd->handle mapping table got
corrupted?) or even which program is printing it (you'd think maybe
perl?  but how could that be affected by anything I did in
postgres.exe, but if it's not perl why is it always at the end like
that?).  Hrmph.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-03 Thread Robert Haas
On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro  wrote:
> So, to summarise the new patch that I'm attaching to this email as 0001:

This all makes sense to me, and I didn't see anything obviously wrong
looking through the patch, either.

> However it seems that I have something wrong, because CI is failing on
> Windows; I ran out of time for looking into that today, but wanted to
> post what I have so far since I know we have an open item or two to
> close here ASAP...

:-(

> Patches 0002-0004 are Andres's, with minor tweaks:
>
> v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch:
>
> I'm still slightly confused about whether we need an *extra* global
> barrier in DROP TABLESPACE, not just if destroy_tablespace_directory()
> failed.

Andres wrote this code, but it looks correct to me. Currently, the
reason why we use USE_BARRIER_SMGRRELEASE is that we want to make sure
that we don't fail to remove a tablespace because some other backend
might have files open. However, it might also be that no other backend
has files open, and in that case we don't need to do anything, so the
current placement of the call is correct. With this patch, though, we
want to make sure that no FD that is open before we start dropping
files remains open after we drop files - and that means we need to
force files to be closed before we even try to remove files the first
time. It seems to me that Andres's patch (your 0002) doesn't add a
second call - it moves the existing one earlier. And that seems right:
no new files should be getting opened once we force them closed the
first time, I hope anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-22 Thread Thomas Munro
On Wed, Apr 6, 2022 at 5:07 AM Robert Haas  wrote:
> On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro  wrote:
> > > The checkpointer never takes heavyweight locks, so the opportunity
> > > you're describing can't arise.
> >
> >   Hmm, oh, you probably meant the buffer interlocking
> > in SyncOneBuffer().  It's true that my most recent patch throws away
> > more requests than it could, by doing the level check at the end of
> > the loop over all buffers instead of adding some kind of
> > DropPendingWritebacks() in the barrier handler.  I guess I could find
> > a way to improve that, basically checking the level more often instead
> > of at the end, but I don't know if it's worth it; we're still throwing
> > out an arbitrary percentage of writeback requests.
>
> Doesn't every backend have its own set of pending writebacks?
> BufferAlloc() calls
> ScheduleBufferTagForWriteback(, ...)?

Yeah.  Alright, for my exaggeration above, s/can't arise/probably
won't often arise/, since when regular backends do this it's for
random dirty buffers, not necessarily stuff they hold a relation lock
on.

I think you are right that if we find ourselves calling smgrwrite() on
another buffer for that relfilenode a bit later, then it's safe to
"unzonk" the relation.  In the worst case, IssuePendingWritebacks()
might finish up calling sync_file_range() on a file that we originally
wrote to for generation 1 of that relfilenode, even though we now have
a file descriptor for generation 2 of that relfilenode that was
reopened by a later smgrwrite().  That would be acceptable, because a
bogus sync_file_range() call would be harmless.  The key point here is
that we absolutely must avoid re-opening files without careful
interlocking, because that could lead to later *writes* for generation
2 going to the defunct generation 1 file that we opened just as it was
being unlinked.

(For completeness:  we know of another way for smgrwrite() to write to
the wrong generation of a recycled relfilenode, but that's hopefully
very unlikely and will hopefully be addressed by adding more bits and
killing off OID-wraparound[1].  In this thread we're concerned only
with these weird "explicit OID reycling" cases we're trying to fix
with PROCSIGNAL_BARRIER_SMGRRELEASE sledgehammer-based cache
invalidation.)

My new attempt, attached, is closer to what Andres proposed, except at
the level of md.c segment objects instead of level of SMgrRelation
objects.  This avoids the dangling SMgrRelation pointer problem
discussed earlier, and is also a little like your "zonk" idea, except
instead of a zonk flag, the SMgrRelation object is reset to a state
where it knows nothing at all except its relfilenode.  Any access
through smgrXXX() functions *except* smgrwriteback() will rebuild the
state, just as you would clear the hypothetical zonk flag.

So, to summarise the new patch that I'm attaching to this email as 0001:

1.  When PROCSIGNAL_BARRIER_SMGRRELEASE is handled, we call
smgrreleaseall(), which calls smgrrelease() on all SMgrRelation
objects, telling them to close all their files.

2.  All SMgrRelation objects are still valid, and any pointers to them
that were acquired before a CFI() and then used in an smgrXXX() call
afterwards will still work (the example I know of being
RelationCopyStorage()[2]).

3.  mdwriteback() internally uses a new "behavior" flag
EXTENSION_DONT_OPEN when getting its hands on the internal segment
object, which says that we should just give up immediately if we don't
already have the file open (concretely: if
PROCSIGNAL_BARRIER_SMGRRELEASE came along between our recent
smgrwrite() call and the writeback machinery's smgrwriteback() call,
it'll do nothing at all).

Someone might say that it's weird that smgrwriteback() has that
behaviour internally, and we might eventually want to make it more
explicit by adding a "behavior" argument to the function itself, so
that it's the caller that controls it.  It didn't seem worth it for
now though; the whole thing is a hint to the operating system anyway.

However it seems that I have something wrong, because CI is failing on
Windows; I ran out of time for looking into that today, but wanted to
post what I have so far since I know we have an open item or two to
close here ASAP...

Patches 0002-0004 are Andres's, with minor tweaks:

v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch:

It seems useless to even keep the macro USE_BARRIER_SMGRRELEASE if
we're going to define it always, so I removed it.  I guess it's useful
to be able to disable that logic easily to see that the assertion in
the other patch fails, but you can do that by commenting out a line in
ProcessBarrierSmgrRelease().

I'm still slightly confused about whether we need an *extra* global
barrier in DROP TABLESPACE, not just if destroy_tablespace_directory()
failed.

v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch

Was missing:

-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding 

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-05 Thread Robert Haas
On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro  wrote:
> > The checkpointer never takes heavyweight locks, so the opportunity
> > you're describing can't arise.
>
>   Hmm, oh, you probably meant the buffer interlocking
> in SyncOneBuffer().  It's true that my most recent patch throws away
> more requests than it could, by doing the level check at the end of
> the loop over all buffers instead of adding some kind of
> DropPendingWritebacks() in the barrier handler.  I guess I could find
> a way to improve that, basically checking the level more often instead
> of at the end, but I don't know if it's worth it; we're still throwing
> out an arbitrary percentage of writeback requests.

Doesn't every backend have its own set of pending writebacks?
BufferAlloc() calls
ScheduleBufferTagForWriteback(, ...)?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Thomas Munro
On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro  wrote:
> On Tue, Apr 5, 2022 at 2:18 AM Robert Haas  wrote:
> > I'm not sure that it really matters, but with the idea that I proposed
> > it's possible to "save" a pending writeback, if we notice that we're
> > accessing the relation with a proper lock after the barrier arrives
> > and before we actually try to do the writeback. With this approach we
> > throw them out immediately, so they're just gone. I don't mind that
> > because I don't think this will happen often enough to matter, and I
> > don't think it will be very expensive when it does, but it's something
> > to think about.
>
> The checkpointer never takes heavyweight locks, so the opportunity
> you're describing can't arise.

  Hmm, oh, you probably meant the buffer interlocking
in SyncOneBuffer().  It's true that my most recent patch throws away
more requests than it could, by doing the level check at the end of
the loop over all buffers instead of adding some kind of
DropPendingWritebacks() in the barrier handler.  I guess I could find
a way to improve that, basically checking the level more often instead
of at the end, but I don't know if it's worth it; we're still throwing
out an arbitrary percentage of writeback requests.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Thomas Munro
On Tue, Apr 5, 2022 at 2:18 AM Robert Haas  wrote:
> On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro  wrote:
> > Another idea would be to call a new function DropPendingWritebacks(),
> > and also tell all the SMgrRelation objects to close all their internal
> > state (ie the fds + per-segment objects) but not free the main
> > SMgrRelationData object, and for good measure also invalidate the
> > small amount of cached state (which hasn't been mentioned in this
> > thread, but it kinda bothers me that that state currently survives, so
> > it was one unspoken reason I liked the smgrcloseall() idea).  Since
> > DropPendingWritebacks() is in a rather different module, perhaps if we
> > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> > something else, because it's not really a SMGR-only thing anymore.
>
> I'm not sure that it really matters, but with the idea that I proposed
> it's possible to "save" a pending writeback, if we notice that we're
> accessing the relation with a proper lock after the barrier arrives
> and before we actually try to do the writeback. With this approach we
> throw them out immediately, so they're just gone. I don't mind that
> because I don't think this will happen often enough to matter, and I
> don't think it will be very expensive when it does, but it's something
> to think about.

The checkpointer never takes heavyweight locks, so the opportunity
you're describing can't arise.

This stuff happens entirely within the scope of a call to
BufferSync(), which is called by CheckPointBuffer().  BufferSync()
does WritebackContextInit() to set up the object that accumulates the
writeback requests, and then runs for a while performing a checkpoint
(usually spread out over time), and then at the end does
IssuePendingWritebacks().  A CFI() can be processed any time up until
the IssuePendingWritebacks(), but not during IssuePendingWritebacks().
That's the size of the gap we need to cover.

I still like the patch I posted most recently.  Note that it's not
quite as I described above: there is no DropPendingWritebacks(),
because that turned out to be impossible to implement in a way that
the barrier handler could call -- it needs access to the writeback
context.  Instead I had to expose a counter so that
IssuePendingWritebacks() could detect whether any smgrreleaseall()
calls had happened while the writeback context was being filled up
with writeback requests, by comparing with the level recorded therein.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-04 Thread Robert Haas
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro  wrote:
> Another idea would be to call a new function DropPendingWritebacks(),
> and also tell all the SMgrRelation objects to close all their internal
> state (ie the fds + per-segment objects) but not free the main
> SMgrRelationData object, and for good measure also invalidate the
> small amount of cached state (which hasn't been mentioned in this
> thread, but it kinda bothers me that that state currently survives, so
> it was one unspoken reason I liked the smgrcloseall() idea).  Since
> DropPendingWritebacks() is in a rather different module, perhaps if we
> were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> something else, because it's not really a SMGR-only thing anymore.

I'm not sure that it really matters, but with the idea that I proposed
it's possible to "save" a pending writeback, if we notice that we're
accessing the relation with a proper lock after the barrier arrives
and before we actually try to do the writeback. With this approach we
throw them out immediately, so they're just gone. I don't mind that
because I don't think this will happen often enough to matter, and I
don't think it will be very expensive when it does, but it's something
to think about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro
On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro  wrote:
> Another idea would be to call a new function DropPendingWritebacks(),
> and also tell all the SMgrRelation objects to close all their internal
> state (ie the fds + per-segment objects) but not free the main
> SMgrRelationData object, and for good measure also invalidate the
> small amount of cached state (which hasn't been mentioned in this
> thread, but it kinda bothers me that that state currently survives, so
> it was one unspoken reason I liked the smgrcloseall() idea).  Since
> DropPendingWritebacks() is in a rather different module, perhaps if we
> were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> something else, because it's not really a SMGR-only thing anymore.

Here's a sketch of that idea.
From 58262253e167e25d4bbb6777574a98fb12850c29 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 2 Apr 2022 17:05:39 +1300
Subject: [PATCH 1/2] WIP: Rethink PROCSIGNAL_BARRIER_SMGRRELEASE.

With sufficiently bad luck, it was possible for IssuePendingWritebacks()
to reopen relation files just after we'd processed
PROCSIGNAL_BARRIER_SMGRRELEASE but before the file was unlinked by some
other backend.

While commit 4eb21763 initially appeared to have fixed the spate of test
failures we'd seen on Windows, testing with extra instrumentation to
look out for writes to unlinked files revealed the dangerous nature of
these deferred file references.

This second attempt introduces smgrrelease(reln), smgrreleaseall(), and
smgrreleaseallcount().  That last is used by IssuedPendingWritebacks()
to notice when pending writebacks have to be consider potentially
invalid, by watching out for level changes during the time writeback
instructions were accumulated.

XXX Experimental

Reported-by: Andres Freund 
Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c | 14 +++
 src/backend/storage/smgr/md.c   |  6 ---
 src/backend/storage/smgr/smgr.c | 58 +
 src/include/storage/buf_internals.h |  3 ++
 src/include/storage/md.h|  1 -
 src/include/storage/smgr.h  |  3 ++
 6 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d73a40c1bc..43652b757a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4873,6 +4873,7 @@ WritebackContextInit(WritebackContext *context, int *max_pending)
 
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
+	context->smgr_release_count = smgrreleaseallcount();
 }
 
 /*
@@ -4924,6 +4925,18 @@ IssuePendingWritebacks(WritebackContext *context)
 {
 	int			i;
 
+	/*
+	 * Throw away the contents if smgrreleaseall() has been invoked since
+	 * "context" was initialized or cleared.  Since the main loop below doesn't
+	 * check for interrupts or otherwise handle barrier requests, we don't have
+	 * to worry about invalidation if we get past this check.
+	 */
+	if (context->smgr_release_count != smgrreleaseallcount())
+	{
+		context->nr_pending = 0;
+		context->smgr_release_count = smgrreleaseallcount();
+	}
+
 	if (context->nr_pending == 0)
 		return;
 
@@ -4983,6 +4996,7 @@ IssuePendingWritebacks(WritebackContext *context)
 	}
 
 	context->nr_pending = 0;
+	context->smgr_release_count = smgrreleaseallcount();
 }
 
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..d26c915f90 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -549,12 +549,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-void
-mdrelease(void)
-{
-	closeAllVfds();
-}
-
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d71a557a35..087c8e76d1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -41,7 +41,6 @@ typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_release) (void); /* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
@@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = {
 	{
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
-		.smgr_release = mdrelease,
 		.smgr_open = mdopen,
 		.smgr_close = mdclose,
 		.smgr_create = mdcreate,
@@ -100,6 +98,8 @@ static dlist_head unowned_relns;
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 
+static uint64 releaseall_count = 0;
+
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -281,6 +281,44 @@ smgrclose(SMgrRelation reln)
 		*owner = NULL;
 }
 
+/*
+ *	smgrrelease() -- Release all resources used by this object.
+ *
+ *	The object remains 

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro
On Sat, Apr 2, 2022 at 2:52 AM Robert Haas  wrote:
> On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro  wrote:
> > The v1-0003 patch introduced smgropen_cond() to avoid the problem of
> > IssuePendingWritebacks(), which does desynchronised smgropen() calls
> > and could open files after the barrier but just before they are
> > unlinked.  Makes sense, but...
> >
> > 1.  For that to actually work, we'd better call smgrcloseall() when
> > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
> > closeAllVds().  Here's a patch for that.  But...
>
> What if we instead changed our approach to fixing
> IssuePendingWritebacks()?  Design sketch:
>
> 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's 
> false.
> 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
> 3. If you want, you can set it back to false any time you access the
> smgr while holding a relation lock.
> 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
> function now refuses either to create a new smgr or to return one
> where smgr_zonk is true.
>
> I think this would be sufficient to guarantee that we never try to
> write back to a relfilenode if we haven't had a lock on it since the
> last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
> need here?
>
> My thinking here is that it's a lot easier to get away with marking
> the content of a data structure invalid than it is to get away with
> invalidating pointers to that data structure. If we can tinker with
> the design so that the SMgrRelationData doesn't actually go away but
> just gets a little bit more thoroughly invalidated, we could avoid
> having to audit the entire code base for code that keeps smgr handles
> around longer than would be possible with the design you demonstrate
> here.

Another idea would be to call a new function DropPendingWritebacks(),
and also tell all the SMgrRelation objects to close all their internal
state (ie the fds + per-segment objects) but not free the main
SMgrRelationData object, and for good measure also invalidate the
small amount of cached state (which hasn't been mentioned in this
thread, but it kinda bothers me that that state currently survives, so
it was one unspoken reason I liked the smgrcloseall() idea).  Since
DropPendingWritebacks() is in a rather different module, perhaps if we
were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
something else, because it's not really a SMGR-only thing anymore.




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Robert Haas
On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro  wrote:
> The v1-0003 patch introduced smgropen_cond() to avoid the problem of
> IssuePendingWritebacks(), which does desynchronised smgropen() calls
> and could open files after the barrier but just before they are
> unlinked.  Makes sense, but...
>
> 1.  For that to actually work, we'd better call smgrcloseall() when
> PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
> closeAllVds().  Here's a patch for that.  But...

What if we instead changed our approach to fixing
IssuePendingWritebacks()?  Design sketch:

1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false.
2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
3. If you want, you can set it back to false any time you access the
smgr while holding a relation lock.
4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
function now refuses either to create a new smgr or to return one
where smgr_zonk is true.

I think this would be sufficient to guarantee that we never try to
write back to a relfilenode if we haven't had a lock on it since the
last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
need here?

My thinking here is that it's a lot easier to get away with marking
the content of a data structure invalid than it is to get away with
invalidating pointers to that data structure. If we can tinker with
the design so that the SMgrRelationData doesn't actually go away but
just gets a little bit more thoroughly invalidated, we could avoid
having to audit the entire code base for code that keeps smgr handles
around longer than would be possible with the design you demonstrate
here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-04-01 Thread Thomas Munro
Some thoughts:

The v1-0003 patch introduced smgropen_cond() to avoid the problem of
IssuePendingWritebacks(), which does desynchronised smgropen() calls
and could open files after the barrier but just before they are
unlinked.  Makes sense, but...

1.  For that to actually work, we'd better call smgrcloseall() when
PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
closeAllVds().  Here's a patch for that.  But...

2.  The reason I used closeAllVds() in 4eb21763 is because I didn't
want random CHECK_FOR_INTERRUPTS() calls to break code like this, from
RelationCopyStorage():

for (blkno = 0; blkno < nblocks; blkno++)
{
/* If we got a cancel signal during the copy of the data, quit */
CHECK_FOR_INTERRUPTS();

smgrread(src, forkNum, blkno, buf.data);

Perhaps we could change RelationCopyStorage() to take Relation, and
use CreateFakeRelCacheEntry() in various places to satisfy that, and
also extend that mechanism to work with temporary tables.  Here's an
initial sketch of that idea, to see what problems it crashes into...
Fake Relations are rather unpleasant though; I wonder if there is an
SMgrRelationHandle concept that could make this all nicer.  There is
possibly also some cleanup missing to avoid an SMgrRelation that
points to a defunct fake Relation on non-local exit (?).
From 943228b5d4e0dc42ad219c3fe97adb59e7262593 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 1 Apr 2022 17:02:51 +1300
Subject: [PATCH 1/2] WIP: Rethink PROCSIGNAL_BARRIER_SMGRRELEASE.

Commit 4eb21763 closed kernel file descriptors, but left SMgrRelations
in existence.  Unfortunately, IssuePendingWritebacks() can reopen them
at any time, including right before the a file is unlinked.  In order
for IssuePendingWritebacks() to be able to avoid this hazard in a later
commit, go further and close the SMgrRelations.

XXX We had better make sure that no code does:

  reln = smgropen(...);
  CHECK_FOR_INTERRUPTS();
  smgrXXX(reln, ...);
---
 src/backend/storage/smgr/md.c   |  6 --
 src/backend/storage/smgr/smgr.c | 10 ++
 src/include/storage/md.h|  1 -
 3 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..d26c915f90 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -549,12 +549,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-void
-mdrelease(void)
-{
-	closeAllVfds();
-}
-
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index efa83ae394..994728db04 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -41,7 +41,6 @@ typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_release) (void); /* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
@@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = {
 	{
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
-		.smgr_release = mdrelease,
 		.smgr_open = mdopen,
 		.smgr_close = mdclose,
 		.smgr_create = mdcreate,
@@ -717,17 +715,13 @@ AtEOXact_SMgr(void)
 }
 
 /*
- * This routine is called when we are ordered to release all open files by a
+ * This routine is called when we are ordered to close all SMgrRelations by a
  * ProcSignalBarrier.
  */
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	for (int i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_release)
-			smgrsw[i].smgr_release();
-	}
+	smgrcloseall();
 
 	return true;
 }
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 6e46d8d96a..a40db7 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,7 +23,6 @@
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
 extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdrelease(void);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-- 
2.35.1

From 7070d98f0287033119f7217cdcfe88d3a2eef937 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 1 Apr 2022 18:35:26 +1300
Subject: [PATCH 2/2] WIP: Handle invalidation in RelationCopyStorage().

Since the loop in RelationCopyStorage() processes interrupts, and since
the proposal is that interrupts might invalidate SMgrRelation objects,
this function had better deal in Relation objects instead.

Various codepaths now need to use CreateFakeRelcacheEntry() to obtain
one.  That function therefore now needs to support temporary relations
too, hence addition of BackendId parameter.

XXX it's now extra silly to have that stuff in xlogutils.c
XXX maybe what 

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Robert Haas
On Thu, Mar 3, 2022 at 1:28 PM Andres Freund  wrote:
> > I can't remember that verify() is the one that accesses conflict.db large
> > while cause_eviction() is the one that accesses postgres.replace_sb for more
> > than like 15 seconds.
>
> For more than 15seconds? The whole test runs in a few seconds for me...

I'm not talking about running the test. I'm talking about reading it
and trying to keep straight what is happening. The details of which
function is accessing which database keep going out of my head.
Quickly. Maybe because I'm dumb, but I think some better naming could
help, just in case other people are dumb, too.

> > As to (5), the identifiers are just primary and standby, without
> > mentioning the database name, which adds to the confusion, and there
> > are no comments explaining why we have two nodes.
>
> I don't follow this one - physical rep doesn't do anything below the database
> level?

Right but ... those handles are connected to a particular DB on that
node, not just the node in general.

> > I think it would be helpful to find a way to divide the test case up
> > into sections that are separated from each other visually, and explain
> > the purpose of each test a little more in a comment. For example I'm
> > struggling to understand the exact purpose of this bit:
> >
> > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
> > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
> > +
> > +verify($node_primary, $node_standby, 3,
> > +  "restored contents as expected");
> >
> > I'm all for having test coverage of VACUUM FULL, but it isn't clear to
> > me what this does that is related to FD reuse.
>
> It's just trying to clean up prior damage, so the test continues to later
> ones. Definitely not needed once there's a fix. But it's useful while trying
> to make the test actually detect various corruptions.

Ah, OK, I definitely did not understand that before.

> > Similarly for the later ones. I generally grasp that you are trying to
> > make sure that things are OK with respect to FD reuse in various
> > scenarios, but I couldn't explain to you what we'd be losing in terms
> > of test coverage if we removed this line:
> >
> > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
> >
> > I am not sure how much all of this potential cleanup matters because
> > I'm pretty skeptical that this would be stable in the buildfarm.
>
> Which "potential cleanup" are you referring to?

I mean, whatever improvements you might consider making based on my comments.

> > It relies on a lot of assumptions about timing and row sizes and details of
> > when invalidations are sent that feel like they might not be entirely
> > predictable at the level you would need to have this consistently pass
> > everywhere. Perhaps I am too pessimistic?
>
> I don't think the test passing should be dependent on row size, invalidations
> etc to a significant degree (unless the tables are too small to reach s_b
> etc)? The exact symptoms of failures are highly unstable, but obviously we'd
> fix them in-tree before committing a test. But maybe I'm missing a dependency?
>
> FWIW, the test did pass on freebsd, linux, macos and windows with the
> fix. After a few iterations of improving the fix ;)

Hmm. Well, I admit that's already more than I would have expected

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Andres Freund
Hi,

On 2022-03-03 13:11:17 -0500, Robert Haas wrote:
> On Wed, Mar 2, 2022 at 3:00 PM Andres Freund  wrote:
> > On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> > > - I am having some trouble understanding clearly what 0001 is doing.
> > > I'll try to study it further.
> >
> > It tests for the various scenarios I could think of that could lead to FD
> > reuse, to state the obvious ;). Anything particularly unclear.
> 
> As I understand it, the basic idea of this test is:
> 
> 1. create a database called 'conflict_db' containing a table called 'large'
> 2. also create a table called 'replace_sb' in the postgres db.
> 3. have a long running transaction open in 'postgres' database that
> repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
> table, sometimes causing write-outs
> 4. try to get it to write out the data to a stale file descriptor by
> fiddling with various things, and then detect the corruption that
> results
> 5. use both a primary and a standby not because they are intrinsically
> necessary but because you want to test that both cases work

Right.


I wasn't proposing to commit the test as-is, to be clear. It was meant as a
demonstration of the types of problems I can see, and what's needed to fix
them...


> I can't remember that verify() is the one that accesses conflict.db large
> while cause_eviction() is the one that accesses postgres.replace_sb for more
> than like 15 seconds.

For more than 15seconds? The whole test runs in a few seconds for me...


> As to (5), the identifiers are just primary and standby, without
> mentioning the database name, which adds to the confusion, and there
> are no comments explaining why we have two nodes.

I don't follow this one - physical rep doesn't do anything below the database
level?



> I think it would be helpful to find a way to divide the test case up
> into sections that are separated from each other visually, and explain
> the purpose of each test a little more in a comment. For example I'm
> struggling to understand the exact purpose of this bit:
> 
> +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
> +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
> +
> +verify($node_primary, $node_standby, 3,
> +  "restored contents as expected");
> 
> I'm all for having test coverage of VACUUM FULL, but it isn't clear to
> me what this does that is related to FD reuse.

It's just trying to clean up prior damage, so the test continues to later
ones. Definitely not needed once there's a fix. But it's useful while trying
to make the test actually detect various corruptions.


> Similarly for the later ones. I generally grasp that you are trying to
> make sure that things are OK with respect to FD reuse in various
> scenarios, but I couldn't explain to you what we'd be losing in terms
> of test coverage if we removed this line:
> 
> +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
> 
> I am not sure how much all of this potential cleanup matters because
> I'm pretty skeptical that this would be stable in the buildfarm.

Which "potential cleanup" are you referring to?


> It relies on a lot of assumptions about timing and row sizes and details of
> when invalidations are sent that feel like they might not be entirely
> predictable at the level you would need to have this consistently pass
> everywhere. Perhaps I am too pessimistic?

I don't think the test passing should be dependent on row size, invalidations
etc to a significant degree (unless the tables are too small to reach s_b
etc)? The exact symptoms of failures are highly unstable, but obviously we'd
fix them in-tree before committing a test. But maybe I'm missing a dependency?

FWIW, the test did pass on freebsd, linux, macos and windows with the
fix. After a few iterations of improving the fix ;)

Greetings,

Andres Freund




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-03 Thread Robert Haas
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund  wrote:
> On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> > - I am having some trouble understanding clearly what 0001 is doing.
> > I'll try to study it further.
>
> It tests for the various scenarios I could think of that could lead to FD
> reuse, to state the obvious ;). Anything particularly unclear.

As I understand it, the basic idea of this test is:

1. create a database called 'conflict_db' containing a table called 'large'
2. also create a table called 'replace_sb' in the postgres db.
3. have a long running transaction open in 'postgres' database that
repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
table, sometimes causing write-outs
4. try to get it to write out the data to a stale file descriptor by
fiddling with various things, and then detect the corruption that
results
5. use both a primary and a standby not because they are intrinsically
necessary but because you want to test that both cases work

As to (1) and (2), I note that 'large' is not a hugely informative
table name, especially when it's the smaller of the two test tables.
And also, the names are all over the place. The table names don't have
much to do with each other (large, replace_sb) and they're completely
unrelated to the database names (postgres, conflict_db). And then you
have Perl functions whose names also don't make it obvious what we're
talking about. I can't remember that verify() is the one that accesses
conflict.db large while cause_eviction() is the one that accesses
postgres.replace_sb for more than like 15 seconds. It'd make it a lot
easier if the table names, database names, and Perl function names had
some common elements.

As to (5), the identifiers are just primary and standby, without
mentioning the database name, which adds to the confusion, and there
are no comments explaining why we have two nodes.

Also, some of the comments seem like they might be in the wrong place:

+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.

Right after this comment, we create a table and then a template
database just after, so I think we're not avoiding any invalidations
at this point. We're avoiding them at later points where the comments
aren't talking about this.

I think it would be helpful to find a way to divide the test case up
into sections that are separated from each other visually, and explain
the purpose of each test a little more in a comment. For example I'm
struggling to understand the exact purpose of this bit:

+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+  "restored contents as expected");

I'm all for having test coverage of VACUUM FULL, but it isn't clear to
me what this does that is related to FD reuse.

Similarly for the later ones. I generally grasp that you are trying to
make sure that things are OK with respect to FD reuse in various
scenarios, but I couldn't explain to you what we'd be losing in terms
of test coverage if we removed this line:

+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");

I am not sure how much all of this potential cleanup matters because
I'm pretty skeptical that this would be stable in the buildfarm. It
relies on a lot of assumptions about timing and row sizes and details
of when invalidations are sent that feel like they might not be
entirely predictable at the level you would need to have this
consistently pass everywhere. Perhaps I am too pessimistic?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Robert Haas
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund  wrote:
> What I am stuck on is what we can do for the released branches. Data
> corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
> something we need to address.

I think we should consider back-porting the ProcSignalBarrier stuff
eventually. I realize that it's not really battle-tested yet and I am
not saying we should do it right now, but I think that if we get these
changes into v15, back-porting it in let's say May of next year could
be a reasonable thing to do. Sure, there is some risk there, but on
the other hand, coming up with completely different fixes for the
back-branches is not risk-free either, nor is it clear that there is
any alternative fix that is nearly as good. In the long run, I am
fairly convinced that ProcSignalBarrier is the way forward not only
for this purpose but for other things as well, and everybody's got to
get on the train or be left behind.

Also, I am aware of multiple instances where the project waited a
long, long time to fix bugs because we didn't have a back-patchable
fix. I disagree with that on principle. A master-only fix now is
better than a back-patchable fix two or three years from now. Of
course a back-patchable fix now is better still, but we have to pick
from the options we have, not the ones we'd like to have.



It seems to me that if we were going to try to construct an
alternative fix for the back-branches, it would have to be something
that didn't involve a new invalidation mechanism -- because the
ProcSignalBarrier stuff is an invalidation mechanism in effect, and I
feel that it can't be better to invent two new invalidation mechanisms
rather than one. And the only idea I have is trying to detect a
dangerous sequence of operations and just outright block it. We have
some cases sort of like that already - e.g. you can't prepare a
transaction if it's done certain things. But, the existing precedents
that occur to me are, I think, all cases where all of the related
actions are being performed in the same backend. It doesn't sound
crazy to me to have some rule like "you can't ALTER TABLESPACE on the
same tablespace in the same backend twice in a row without an
intervening checkpoint", or whatever, and install the book-keeping to
enforce that. But I don't think anything like that can work, both
because the two ALTER TABLESPACE commands could be performed in
different sessions, and also because an intervening checkpoint is no
guarantee of safety anyway, IIUC. So I'm just not really seeing a
reasonable strategy that isn't basically the barrier stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Andres Freund
Hi,

On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> - I am having some trouble understanding clearly what 0001 is doing.
> I'll try to study it further.

It tests for the various scenarios I could think of that could lead to FD
reuse, to state the obvious ;). Anything particularly unclear.


> - In general, 0003 makes a lot of sense to me.
> 
> +   /*
> +* Finally tell the kernel to write the data to
> storage. Don't smgr if
> +* previously closed, otherwise we could end up evading 
> fd-reuse
> +* protection.
> +*/
> 
> - I think the above hunk is missing a word, because it uses smgr as a
> verb. I also think that it's easy to imagine this explanation being
> insufficient for some future hacker to understand the issue.

Yea, it's definitely not sufficient or gramattically correct. I think I wanted
to send something out, but was very tired by that point..


> - While 0004 seems useful for testing, it's an awfully big hammer. I'm
> not sure we should be thinking about committing something like that,
> or at least not as a default part of the build. But ... maybe?

Aggreed. I think it's racy anyway, because further files could get deleted
(e.g. segments > 0) after the barrier has been processed.


What I am stuck on is what we can do for the released branches. Data
corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
something we need to address.

Greetings,

Andres Freund




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-03-02 Thread Robert Haas
On Tue, Feb 22, 2022 at 4:40 AM Andres Freund  wrote:
> On 2022-02-22 01:11:21 -0800, Andres Freund wrote:
> > I've started to work on a few debugging aids to find problem like
> > these. Attached are two WIP patches:
>
> Forgot to attach. Also importantly includes a tap test for several of these
> issues

Hi,

Just a few very preliminary comments:

- I am having some trouble understanding clearly what 0001 is doing.
I'll try to study it further.

- 0002 seems like it's generally a good idea. I haven't yet dug into
why the call sites for AssertFileNotDeleted() are where they are, or
whether that's a complete set of places to check.

- In general, 0003 makes a lot of sense to me.

+   /*
+* Finally tell the kernel to write the data to
storage. Don't smgr if
+* previously closed, otherwise we could end up evading fd-reuse
+* protection.
+*/

- I think the above hunk is missing a word, because it uses smgr as a
verb. I also think that it's easy to imagine this explanation being
insufficient for some future hacker to understand the issue.

- While 0004 seems useful for testing, it's an awfully big hammer. I'm
not sure we should be thinking about committing something like that,
or at least not as a default part of the build. But ... maybe?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund
Hi,

On 2022-02-22 01:11:21 -0800, Andres Freund wrote:
> I've started to work on a few debugging aids to find problem like
> these. Attached are two WIP patches:

Forgot to attach. Also importantly includes a tap test for several of these
issues

Greetings,

Andres Freund
>From 0bc64874f8e5faae9a38731a83aa7b001095cc35 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 21 Feb 2022 15:44:02 -0800
Subject: [PATCH v1 1/4] WIP: test for file reuse dangers around database and
 tablespace commands.

---
 src/test/recovery/t/029_relfilenode_reuse.pl | 233 +++
 1 file changed, 233 insertions(+)
 create mode 100644 src/test/recovery/t/029_relfilenode_reuse.pl

diff --git a/src/test/recovery/t/029_relfilenode_reuse.pl b/src/test/recovery/t/029_relfilenode_reuse.pl
new file mode 100644
index 000..22d8e85614c
--- /dev/null
+++ b/src/test/recovery/t/029_relfilenode_reuse.pl
@@ -0,0 +1,233 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Basename;
+
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', q[
+allow_in_place_tablespaces = true
+log_connections=on
+# to avoid "repairing" corruption
+full_page_writes=off
+log_min_messages=debug2
+autovacuum_naptime=1s
+shared_buffers=1MB
+]);
+$node_primary->start;
+
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(300);
+
+my %psql_primary = (stdin => '', stdout => '', stderr => '');
+$psql_primary{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres') ],
+	'<',
+	\$psql_primary{stdin},
+	'>',
+	\$psql_primary{stdout},
+	'2>',
+	\$psql_primary{stderr},
+	$psql_timeout);
+
+my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_standby->connstr('postgres') ],
+	'<',
+	\$psql_standby{stdin},
+	'>',
+	\$psql_standby{stdout},
+	'2>',
+	\$psql_standby{stderr},
+	$psql_timeout);
+
+
+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.
+
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db_template OID = 5;");
+$node_primary->safe_psql('conflict_db_template', q[
+CREATE TABLE large(id serial primary key, dataa text, datab text);
+INSERT INTO large(dataa, datab) SELECT g.i::text, 1 FROM generate_series(1, 4000) g(i);]);
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+$node_primary->safe_psql('postgres', q[
+CREATE EXTENSION pg_prewarm;
+CREATE TABLE replace_sb(data text);
+INSERT INTO replace_sb(data) SELECT random()::text FROM generate_series(1, 15000);]);
+
+# Use longrunning transactions, so that AtEOXact_SMgr doesn't close files
+send_query_and_wait(
+	\%psql_primary,
+	q[BEGIN;],
+	qr/BEGIN/m);
+send_query_and_wait(
+	\%psql_standby,
+	q[BEGIN;],
+	qr/BEGIN/m);
+
+# Cause lots of dirty rows in shared_buffers
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 1;");
+
+# Now do a bunch of work in another database. That will end up needing to
+# write back dirty data from the previous step, opening the relevant file
+# descriptors
+cause_eviction(\%psql_primary, \%psql_standby);
+
+# drop and recreate database
+$node_primary->safe_psql('postgres', "DROP DATABASE conflict_db;");
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+verify($node_primary, $node_standby, 1,
+	   "initial contents as expected");
+
+# Again cause lots of dirty rows in shared_buffers, but use a different update
+# value so we can check everything is OK
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 2;");
+
+# Again cause a lot of IO. That'll again write back dirty data, but uses (XXX
+# adjust after bugfix) the already opened file descriptor.
+# FIXME
+cause_eviction(\%psql_primary, \%psql_standby);
+
+verify($node_primary, $node_standby, 2,
+	   "update to reused relfilenode (due to DB oid conflict) is not lost");
+
+
+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+	   "restored 

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-22 Thread Andres Freund
Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
> 
> Yes.

I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=1

these are just to give more time / to require smaller tables.


Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 
4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior 
dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr 
handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the 
wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE 
pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───┬───┐
│ datab │ count │
├───┼───┤
│ 1 │  2117 │
│ 2 │67 │
└───┴───┘
(2 rows)


oops.


I don't yet have a good idea how to tackle this in the backbranches.



I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
   deleted files. That seems to work (almost) everywhere. Optionally, on
   linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
   /proc/$pid/fd/ to check for deleted files. This only can be done after
   we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Andres Freund
Hi,

On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> I was about to commit that, because the original Windows problem it
> solved is showing up occasionally in CI failures (that is, it already
> solves a live problem, albeit a different and non-data-corrupting
> one):

+1

> It seems like I should go ahead and do that today, and we can study
> further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

Yes.

I wonder whether we really should make the barriers be conditional on
defined(WIN32) || defined(USE_ASSERT_CHECKING) as done in that patch, even
leaving wraparound dangers aside. On !windows we still have the issues of the
space for the dropped / moved files not being released if there are processes
having them open. That can be a lot of space if there's long-lived connections
in a cluster that doesn't fit into s_b (because processes will have random fds
open for writing back dirty buffers). And we don't truncate the files before
unlinking when done as part of a DROP DATABASE...

But that's something we can fine-tune later as well...

Greetings,

Andres Freund




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Andres Freund
Hi,

On 2022-02-10 13:49:50 -0500, Robert Haas wrote:
> I agree. While I feel sort of bad about missing this issue in review,
> I also feel like it's pretty surprising that there isn't something
> plugging this hole already. It feels unexpected that our FD management
> layer might hand you an FD that references the previous file that had
> the same name instead of the one that exists right now, and I don't
> think it's too much of a stretch of the imagination to suppose that
> this won't be the last problem we have if we don't fix it.

Arguably it's not really the FD layer that's the issue - it's on the smgr
level. But that's semantics...


> The main question in my mind is who is going to actually make that
> happen. It was your idea (I think), Thomas coded it, and my commit
> made it a live problem. So who's going to get something committed
> here?

I think it'd make sense for Thomas to commit the current patch for the
tablespace issues first. And then we can go from there?


> > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are 
> > foreclosed.
> 
> What about movedb()?

My first reaction was that it should be fine, because RelFileNode.spcNode
would differ. But of course that doesn't protect against moving a database
back and forth. So yes, you're right, it's needed there too.


Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE
should be fine, that allocates a new relfilenode.


> I definitely think it would be nice to cover this issue not only for
> database OIDs and tablespace OIDs but also for relfilenode OIDs.
> Otherwise we're right on the cusp of having the same bug in that case.
> pg_upgrade doesn't drop and recreate tables the way it does databases,
> but if somebody made it do that in the future, would they think about
> this? I bet not.

And even just plugging the wraparound aspect would be great. It's not actually
*that* hard to wrap oids around, because of toast.


> Dilip's been working on your idea of making relfilenode into a 56-bit
> quantity. That would make the problem of detecting wraparound go away.
> In the meantime, I suppose we could do think of trying to do something
> here:
> 
> /* wraparound, or first post-initdb
> assignment, in normal mode */
> ShmemVariableCache->nextOid = FirstNormalObjectId;
> ShmemVariableCache->oidCount = 0;
> 
> That's a bit sketch, because this is the OID counter, which isn't only
> used for relfilenodes.

I can see a few ways to deal with that:

1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound
   behaviour only in the nextRelFileNode case.  That'd be a nice improvement,
   but not entirely trivial, because we currently use GetNewRelFileNode() for
   oid assignment as well (c.f. heap_create_with_catalog()).

2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait
   for a barrier in GetNewRelFileNode() if the assigned oid is wrapped
   around.  While there would be a chance of "missing" a ->nextOid wraparound,
   e.g. because of >2**32 toast oids being allocated, I think that could only
   happen if there were no "wrapped around relfilenodes" allocated, so that
   should be ok.

3) Any form of oid wraparound might cause problems, not just relfilenode
   issues. So having a place to emit barriers on oid wraparound might be a
   good idea.


One thing I wonder is how to make sure the path would be tested. Perhaps we
could emit the anti-wraparound barriers more frequently when assertions are
enabled?


> I'm not sure whether there would be problems waiting for a barrier here - I
> think we might be holding some lwlocks in some code paths.

It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what
I can see the two callers of GetNewObjectId() both have loops that can go on
for a long time after a wraparound.



> > I think we need to add some debugging code to detect "fd to deleted file of
> > same name" style problems. Especially during regression tests they're quite
> > hard to notice, because most of the contents read/written are identical 
> > across
> > recycling.
> 
> Maybe. If we just plug the holes here, I am not sure we need permanent
> instrumentation. But I'm also not violently opposed to it.

The question is how we can find all the places we need to add barriers emit &
wait to. I e.g. didn't initially didn't realize that there's wraparound danger
in WAL replay as well.

Greetings,

Andres Freund




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Robert Haas
On Thu, Feb 10, 2022 at 3:11 PM Thomas Munro  wrote:
> On Fri, Feb 11, 2022 at 7:50 AM Robert Haas  wrote:
> > The main question in my mind is who is going to actually make that
> > happen. It was your idea (I think), Thomas coded it, and my commit
> > made it a live problem. So who's going to get something committed
> > here?
>
> I was about to commit that, because the original Windows problem it
> solved is showing up occasionally in CI failures (that is, it already
> solves a live problem, albeit a different and non-data-corrupting
> one):
>
> https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com
>
> It seems like I should go ahead and do that today, and we can study
> further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

A bird in the hand is worth two in the bush. Unless it's a vicious,
hand-biting bird.

In other words, if you think what you've got is better than what we
have now and won't break anything worse than it is today, +1 for
committing, and more improvements can come when we get enough round
tuits.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Thomas Munro
On Fri, Feb 11, 2022 at 7:50 AM Robert Haas  wrote:
> The main question in my mind is who is going to actually make that
> happen. It was your idea (I think), Thomas coded it, and my commit
> made it a live problem. So who's going to get something committed
> here?

I was about to commit that, because the original Windows problem it
solved is showing up occasionally in CI failures (that is, it already
solves a live problem, albeit a different and non-data-corrupting
one):

https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com

It seems like I should go ahead and do that today, and we can study
further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-10 Thread Robert Haas
On Wed, Feb 9, 2022 at 5:00 PM Andres Freund  wrote:
> The problem starts with
>
> commit aa01051418f10afbdfa781b8dc109615ca785ff9
> Author: Robert Haas 
> Date:   2022-01-24 14:23:15 -0500
>
> pg_upgrade: Preserve database OIDs.

Well, that's sad.

> I think the most realistic way to address this is the mechanism is to use the
> mechanism from
> https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

I agree. While I feel sort of bad about missing this issue in review,
I also feel like it's pretty surprising that there isn't something
plugging this hole already. It feels unexpected that our FD management
layer might hand you an FD that references the previous file that had
the same name instead of the one that exists right now, and I don't
think it's too much of a stretch of the imagination to suppose that
this won't be the last problem we have if we don't fix it. I also
agree that the mechanism proposed in that thread is the best way to
fix the problem. The sinval mechanism won't work, since there's
nothing to ensure that the invalidation messages are processed in a
sufficient timely fashion, and there's no obvious way to get around
that problem. But the ProcSignalBarrier mechanism is not intertwined
with heavyweight locking in the way that sinval is, so it should be a
good fit for this problem.

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

> If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

What about movedb()?

> We perhaps could avoid all relfilenode recycling, issues on the fd level, by
> adding a barrier whenever relfilenode allocation wraps around. But I'm not
> sure how easy that is to detect.

I definitely think it would be nice to cover this issue not only for
database OIDs and tablespace OIDs but also for relfilenode OIDs.
Otherwise we're right on the cusp of having the same bug in that case.
pg_upgrade doesn't drop and recreate tables the way it does databases,
but if somebody made it do that in the future, would they think about
this? I bet not.

Dilip's been working on your idea of making relfilenode into a 56-bit
quantity. That would make the problem of detecting wraparound go away.
In the meantime, I suppose we could do think of trying to do something
here:

/* wraparound, or first post-initdb
assignment, in normal mode */
ShmemVariableCache->nextOid = FirstNormalObjectId;
ShmemVariableCache->oidCount = 0;

That's a bit sketch, because this is the OID counter, which isn't only
used for relfilenodes. I'm not sure whether there would be problems
waiting for a barrier here - I think we might be holding some lwlocks
in some code paths.

> I think we might actually be able to remove the checkpoint from dropdb() by
> using barriers?

That looks like it might work. We'd need to be sure that the
checkpointer would both close any open FDs and also absorb fsync
requests before acknowledging the barrier.

> I think we need to add some debugging code to detect "fd to deleted file of
> same name" style problems. Especially during regression tests they're quite
> hard to notice, because most of the contents read/written are identical across
> recycling.

Maybe. If we just plug the holes here, I am not sure we need permanent
instrumentation. But I'm also not violently opposed to it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-02-09 Thread Justin Pryzby
On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:
> On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
> a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
> == 0.

You could also stat() the file in proc/self/fd/N and compare st_ino.  It
"looks" like a symlink (in which case that wouldn't work) but it's actually a
Very Special File.  You can even recover deleted, still-opened files that way..

PS. I didn't know pg_upgrade knew about Reply-To ;)