Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, May 14, 2022 at 3:33 AM Robert Haas wrote: > This seems fine, but I think you should add a non-trivial comment about it. Thanks for looking. Done, and pushed. Let's see if 180s per query is enough...
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Thu, May 12, 2022 at 10:20 PM Thomas Munro wrote: > As for skink failing, the timeout was hard coded 300s for the whole > test, but apparently that wasn't enough under valgrind. Let's use the > standard PostgreSQL::Test::Utils::timeout_default (180s usually), but > reset it for each query we send. @@ -202,6 +198,9 @@ sub send_query_and_wait my ($psql, $query, $untl) = @_; my $ret; + $psql_timeout->reset(); + $psql_timeout->start(); + # send query $$psql{stdin} .= $query; $$psql{stdin} .= "\n"; This seems fine, but I think you should add a non-trivial comment about it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Thu, May 12, 2022 at 4:57 PM Thomas Munro wrote: > On Thu, May 12, 2022 at 3:13 PM Thomas Munro wrote: > > error running SQL: 'psql::1: ERROR: source database > > "conflict_db_template" is being accessed by other users > > DETAIL: There is 1 other session using the database.' > > Oh, for this one I think it may just be that the autovacuum worker > with PID 23757 took longer to exit than the 5 seconds > CountOtherDBBackends() is prepared to wait, after sending it SIGTERM. In this test, autovacuum_naptime is set to 1s (per Andres, AV was implicated when he first saw the problem with pg_upgrade, hence desire to crank it up). That's not necessary: commenting out the active line in ProcessBarrierSmgrRelease() shows that the tests reliably reproduce data corruption without it. Let's just take that out. As for skink failing, the timeout was hard coded 300s for the whole test, but apparently that wasn't enough under valgrind. Let's use the standard PostgreSQL::Test::Utils::timeout_default (180s usually), but reset it for each query we send. See attached. From ffcb61004fd06c9b2db56c7fe045c7c726d67a72 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 13 May 2022 13:40:03 +1200 Subject: [PATCH] Fix slow animal timeouts in 032_relfilenode_reuse.pl. Per BF animal chipmunk: CREATE DATABASE could apparently fail due to an AV process being in the template database and not quitting fast enough for the 5 second timeout in CountOtherDBBackends(). The test script had autovacuum_naptime=1s to encourage more activity that opens fds, but that wasn't strictly necessary for this test. Take it out. Per BF animal skink: the test had a global 300s timeout, but apparently that was not enough under valgrind. Use the standard timeout PostgreSQL::Test::Utils::timeout_default, but reset it for each query we run. Discussion: --- src/test/recovery/t/032_relfilenode_reuse.pl | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/src/test/recovery/t/032_relfilenode_reuse.pl b/src/test/recovery/t/032_relfilenode_reuse.pl index ac9340b7dd..5a6a759aa5 100644 --- a/src/test/recovery/t/032_relfilenode_reuse.pl +++ b/src/test/recovery/t/032_relfilenode_reuse.pl @@ -14,7 +14,6 @@ log_connections=on # to avoid "repairing" corruption full_page_writes=off log_min_messages=debug2 -autovacuum_naptime=1s shared_buffers=1MB ]); $node_primary->start; @@ -28,11 +27,8 @@ $node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1); $node_standby->start; -# To avoid hanging while expecting some specific input from a psql -# instance being driven by us, add a timeout high enough that it -# should never trigger even on very slow machines, unless something -# is really wrong. -my $psql_timeout = IPC::Run::timer(300); +# We'll reset this timeout for each individual query we run. +my $psql_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default); my %psql_primary = (stdin => '', stdout => '', stderr => ''); $psql_primary{run} = IPC::Run::start( @@ -202,6 +198,9 @@ sub send_query_and_wait my ($psql, $query, $untl) = @_; my $ret; + $psql_timeout->reset(); + $psql_timeout->start(); + # send query $$psql{stdin} .= $query; $$psql{stdin} .= "\n"; -- 2.36.0
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Thu, May 12, 2022 at 3:13 PM Thomas Munro wrote: > Chipmunk, another little early model Raspberry Pi: > > error running SQL: 'psql::1: ERROR: source database > "conflict_db_template" is being accessed by other users > DETAIL: There is 1 other session using the database.' Oh, for this one I think it may just be that the autovacuum worker with PID 23757 took longer to exit than the 5 seconds CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, May 7, 2022 at 9:37 PM Thomas Munro wrote: > So far "grison" failed. I think it's probably just that the test > forgot to wait for replay of CREATE EXTENSION before using pg_prewarm > on the standby, hence "ERROR: function pg_prewarm(oid) does not exist > at character 12". I'll wait for more animals to report before I try > to fix that tomorrow. That one was addressed by commit a22652e. Unfortunately two new kinds of failure showed up: Chipmunk, another little early model Raspberry Pi: error running SQL: 'psql::1: ERROR: source database "conflict_db_template" is being accessed by other users DETAIL: There is 1 other session using the database.' while running 'psql -XAtq -d port=57394 host=/tmp/luyJopPv9L dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;' at /home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1836. I think that might imply that when you do two $node_primary->safe_psql() calls in a row, the backend running the second one might still see the ghost of the first backend in the database, even though the first psql process has exited. Hmmm. Skink, the valgrind animal, also failed. After first 8 tests, it times out: [07:18:26.237](14.827s) ok 8 - standby: post move contents as expected [07:18:42.877](16.641s) Bail out! aborting wait: program timed out stream contents: >><< pattern searched for: (?^m:warmed_buffers) That's be in the perl routine cause_eviction(), while waiting for a pg_prewarm query to return, but I'm not yet sure what's going on (I have to suspect it's in the perl scripting rather than anything in the server, given the location of the failure). Will try to repro locally.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Tue, May 10, 2022 at 1:07 AM Robert Haas wrote: > On Sun, May 8, 2022 at 7:30 PM Thomas Munro wrote: > > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > > STATEMENT: alter database mydb set tablespace ts1; > This is a very good idea. OK, I pushed this, after making the ereport call look a bit more like others that talk about backend PIDs. > > Another thought is that it might be nice to be able to test with a > > dummy PSB that doesn't actually do anything. You could use it to see > > how fast your system processes it, while doing various other things, > > and to find/debug problems in other code that fails to handle > > interrupts correctly. Here's an attempt at that. I guess it could go > > into a src/test/modules/something instead of core, but on the other > > hand the PSB itself has to be in core anyway, so maybe not. Thoughts? > > No documentation yet, just seeing if people think this is worth > > having... better names/ideas welcome. > > I did this at one point, but I wasn't convinced it was going to find > enough bugs to be worth committing. It's OK if you're convinced of > things that didn't convince me, though. I'll leave this here for now.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sun, May 8, 2022 at 7:30 PM Thomas Munro wrote: > Simple idea: how about logging the PID of processes that block > progress for too long? In the attached, I arbitrarily picked 5 > seconds as the wait time between LOG messages. Also, DEBUG1 messages > let you see the processing speed on eg build farm animals. Thoughts? > > To test this, kill -STOP a random backend, and then try an ALTER > DATABASE SET TABLESPACE in another backend. Example output: > > DEBUG: waiting for all backends to process ProcSignalBarrier generation 1 > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; This is a very good idea. > Another thought is that it might be nice to be able to test with a > dummy PSB that doesn't actually do anything. You could use it to see > how fast your system processes it, while doing various other things, > and to find/debug problems in other code that fails to handle > interrupts correctly. Here's an attempt at that. I guess it could go > into a src/test/modules/something instead of core, but on the other > hand the PSB itself has to be in core anyway, so maybe not. Thoughts? > No documentation yet, just seeing if people think this is worth > having... better names/ideas welcome. I did this at one point, but I wasn't convinced it was going to find enough bugs to be worth committing. It's OK if you're convinced of things that didn't convince me, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, May 7, 2022 at 4:52 PM Thomas Munro wrote: > I think we'll probably also want to invent a way > to report which backend is holding up progress, since without that > it's practically impossible for an end user to understand why their > command is hanging. Simple idea: how about logging the PID of processes that block progress for too long? In the attached, I arbitrarily picked 5 seconds as the wait time between LOG messages. Also, DEBUG1 messages let you see the processing speed on eg build farm animals. Thoughts? To test this, kill -STOP a random backend, and then try an ALTER DATABASE SET TABLESPACE in another backend. Example output: DEBUG: waiting for all backends to process ProcSignalBarrier generation 1 LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; ... then kill -CONT: DEBUG: finished waiting for all backends to process ProcSignalBarrier generation 1 Another thought is that it might be nice to be able to test with a dummy PSB that doesn't actually do anything. You could use it to see how fast your system processes it, while doing various other things, and to find/debug problems in other code that fails to handle interrupts correctly. Here's an attempt at that. I guess it could go into a src/test/modules/something instead of core, but on the other hand the PSB itself has to be in core anyway, so maybe not. Thoughts? No documentation yet, just seeing if people think this is worth having... better names/ideas welcome. To test this, just SELECT pg_test_procsignal_barrier(). From 9fe475b557143cc324ce14de4f29ad8bef206fc8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 9 May 2022 10:24:01 +1200 Subject: [PATCH 1/2] Add logging for ProcSignalBarrier mechanism. To help with diagnosis of systems that are not processing ProcSignalBarrier requests promptly, add a LOG message every 5 seconds if we seem to be wedged. Although you could already see this state via wait events in pg_stat_activity, the log message also shows the PID of the process that is preventing progress. Also add DEBUG1 logging around the whole loop. --- src/backend/storage/ipc/procsignal.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 00d66902d8..4fe6f087db 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -393,6 +393,11 @@ WaitForProcSignalBarrier(uint64 generation) { Assert(generation <= pg_atomic_read_u64(>psh_barrierGeneration)); + elog(DEBUG1, + "waiting for all backends to process ProcSignalBarrier generation " + UINT64_FORMAT, + generation); + for (int i = NumProcSignalSlots - 1; i >= 0; i--) { ProcSignalSlot *slot = >psh_slot[i]; @@ -407,13 +412,22 @@ WaitForProcSignalBarrier(uint64 generation) oldval = pg_atomic_read_u64(>pss_barrierGeneration); while (oldval < generation) { - ConditionVariableSleep(>pss_barrierCV, - WAIT_EVENT_PROC_SIGNAL_BARRIER); + if (ConditionVariableTimedSleep(>pss_barrierCV, + 5000, + WAIT_EVENT_PROC_SIGNAL_BARRIER)) +elog(LOG, + "still waiting for pid %d to accept ProcSignalBarrier", + slot->pss_pid); oldval = pg_atomic_read_u64(>pss_barrierGeneration); } ConditionVariableCancelSleep(); } + elog(DEBUG1, + "finished waiting for all backends to process ProcSignalBarrier generation " + UINT64_FORMAT, + generation); + /* * The caller is probably calling this function because it wants to read * the shared state or perform further writes to shared state once all -- 2.30.2 From 266528b553dcc128d2d38039ae821fe515ab1db1 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 9 May 2022 11:06:48 +1200 Subject: [PATCH 2/2] Add pg_test_procsignal_barrier(). A superuser-only test function to exercise the ProcSignalBarrier mechanism. It has no effect, other than to interrupt every process connected to shared memory and wait for them all to acknowledge the request. --- src/backend/storage/ipc/procsignal.c | 20 src/include/catalog/pg_proc.dat | 4 src/include/storage/procsignal.h | 1 + 3 files changed, 25 insertions(+) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 4fe6f087db..93c31bc6b9 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -541,6 +541,10 @@ ProcessProcSignalBarrier(void) type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags); switch (type) {
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, May 7, 2022 at 4:52 PM Thomas Munro wrote: > Done. Time to watch the build farm. So far "grison" failed. I think it's probably just that the test forgot to wait for replay of CREATE EXTENSION before using pg_prewarm on the standby, hence "ERROR: function pg_prewarm(oid) does not exist at character 12". I'll wait for more animals to report before I try to fix that tomorrow.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, May 4, 2022 at 2:23 PM Thomas Munro wrote: > Assuming no > objections or CI failures show up, I'll consider pushing the first two > patches tomorrow. Done. Time to watch the build farm. It's possible that these changes will produce some blowback, now that we're using PROCSIGNAL_BARRIER_SMGRRELEASE on common platforms. Obviously the earlier work started down this path already, but that was Windows-only, so it probably didn't get much attention from our mostly Unix crowd. For example, if there are bugs in the procsignal barrier system, or if there are places that don't process interrupts at all or promptly, we might hear complaints about that. That bug surface includes, for example, background workers created by extensions. An example on my mind is a place where we hold interrupts while cleaning up temporary files (= a loop of arbitrary size with filesystem ops that might hang on slow storage), so a concurrent DROP TABLESPACE will have to wait, which is kinda strange because it would in fact be perfectly safe to process this particular "interrupt". In that case we really just don't want the kinds of interrupts that perform non-local exits and prevent our cleanup work from running to completion, but we don't have a way to say that. I think we'll probably also want to invent a way to report which backend is holding up progress, since without that it's practically impossible for an end user to understand why their command is hanging.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, May 4, 2022 at 8:53 AM Thomas Munro wrote: > Got some off-list clues: that's just distracting Perl cleanup noise > after something else went wrong (thanks Robert), and now I'm testing a > theory from Andres that we're missing a barrier on the redo side when > replaying XLOG_DBASE_CREATE_FILE_COPY. More soon. Yeah, looks like that was the explanation. Presumably in older releases, recovery can fail with EACCES here, and since commit e2f0f8ed we get ENOENT, because someone's got an unlinked file open, and ReadDir() can still see it. (I've wondered before if ReadDir() should also hide zombie Windows directory entries, but that's kinda independent and would only get us one step further, a later rmdir() would still fail.) Adding the barrier fixes the problem. Assuming no objections or CI failures show up, I'll consider pushing the first two patches tomorrow. From 1c5e04e7704ab9d5bd2559338ce1b6bd8a397e51 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 2 Apr 2022 17:05:39 +1300 Subject: [PATCH v4 1/3] Rethink PROCSIGNAL_BARRIER_SMGRRELEASE. With sufficiently bad luck, it was possible for IssuePendingWritebacks() to reopen files after we'd processed PROCSIGNAL_BARRIER_SMGRRELEASE and before the file was unlinked by some other backend, leading to potential errors on Windows. Defend against that by closing md.c's segments, instead of just closing fd.c's descriptors, and then teaching smgrwriteback() not to open files that aren't already open. This fixes an overlooked edge case for commit 4eb21763. Reported-by: Andres Freund Reviewed-by: Robert Haas Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de --- src/backend/storage/smgr/md.c | 22 +--- src/backend/storage/smgr/smgr.c | 45 +++-- src/include/storage/md.h| 1 - src/include/storage/smgr.h | 2 ++ 4 files changed, 52 insertions(+), 18 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 286dd3f755..41fa73087b 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -116,6 +116,8 @@ static MemoryContext MdCxt; /* context for all MdfdVec objects */ * mdnblocks(). */ #define EXTENSION_DONT_CHECK_SIZE (1 << 4) +/* don't try to open a segment, if not already open */ +#define EXTENSION_DONT_OPEN (1 << 5) /* local routines */ @@ -551,12 +553,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum) } } -void -mdrelease(void) -{ - closeAllVfds(); -} - /* * mdprefetch() -- Initiate asynchronous read of the specified block of a relation */ @@ -605,11 +601,14 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum, segnum_end; v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ , - EXTENSION_RETURN_NULL); + EXTENSION_DONT_OPEN); /* * We might be flushing buffers of already removed relations, that's - * ok, just ignore that case. + * ok, just ignore that case. If the segment file wasn't open already + * (ie from a recent mdwrite()), then we don't want to re-open it, to + * avoid a race with PROCSIGNAL_BARRIER_SMGRRELEASE that might leave us + * with a descriptor to a file that is about to be unlinked. */ if (!v) return; @@ -1202,7 +1201,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, /* some way to handle non-existent segments needs to be specified */ Assert(behavior & - (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL)); + (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL | + EXTENSION_DONT_OPEN)); targetseg = blkno / ((BlockNumber) RELSEG_SIZE); @@ -1213,6 +1213,10 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, return v; } + /* The caller only wants the segment if we already had it open. */ + if (behavior & EXTENSION_DONT_OPEN) + return NULL; + /* * The target segment is not yet open. Iterate over all the segments * between the last opened and the target segment. This way missing diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 2c7a2b2857..9cbfaa86cb 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -41,7 +41,6 @@ typedef struct f_smgr { void (*smgr_init) (void); /* may be NULL */ void (*smgr_shutdown) (void); /* may be NULL */ - void (*smgr_release) (void); /* may be NULL */ void (*smgr_open) (SMgrRelation reln); void (*smgr_close) (SMgrRelation reln, ForkNumber forknum); void (*smgr_create) (SMgrRelation reln, ForkNumber forknum, @@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = { { .smgr_init = mdinit, .smgr_shutdown = NULL, - .smgr_release = mdrelease, .smgr_open = mdopen, .smgr_close = mdclose, .smgr_create = mdcreate, @@ -281,6 +279,42 @@ smgrclose(SMgrRelation reln) *owner = NULL; } +/* + * smgrrelease() -- Release all resources used by this object. + * + * The object remains valid. +
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, May 4, 2022 at 7:44 AM Thomas Munro wrote: > It passes sometimes and fails sometimes. Here's the weird failure I > need to debug: > > https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse > > Right at the end, it says: > > Warning: unable to close filehandle GEN26 properly: Bad file > descriptor during global destruction. > Warning: unable to close filehandle GEN21 properly: Bad file > descriptor during global destruction. > Warning: unable to close filehandle GEN6 properly: Bad file descriptor > during global destruction. > > I don't know what it means (Windows' fd->handle mapping table got > corrupted?) or even which program is printing it (you'd think maybe > perl? but how could that be affected by anything I did in > postgres.exe, but if it's not perl why is it always at the end like > that?). Hrmph. Got some off-list clues: that's just distracting Perl cleanup noise after something else went wrong (thanks Robert), and now I'm testing a theory from Andres that we're missing a barrier on the redo side when replaying XLOG_DBASE_CREATE_FILE_COPY. More soon.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, May 4, 2022 at 6:36 AM Robert Haas wrote: > On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro wrote: > > So, to summarise the new patch that I'm attaching to this email as 0001: > > This all makes sense to me, and I didn't see anything obviously wrong > looking through the patch, either. Thanks. > > However it seems that I have something wrong, because CI is failing on > > Windows; I ran out of time for looking into that today, but wanted to > > post what I have so far since I know we have an open item or two to > > close here ASAP... > > :-( It passes sometimes and fails sometimes. Here's the weird failure I need to debug: https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse Right at the end, it says: Warning: unable to close filehandle GEN26 properly: Bad file descriptor during global destruction. Warning: unable to close filehandle GEN21 properly: Bad file descriptor during global destruction. Warning: unable to close filehandle GEN6 properly: Bad file descriptor during global destruction. I don't know what it means (Windows' fd->handle mapping table got corrupted?) or even which program is printing it (you'd think maybe perl? but how could that be affected by anything I did in postgres.exe, but if it's not perl why is it always at the end like that?). Hrmph.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro wrote: > So, to summarise the new patch that I'm attaching to this email as 0001: This all makes sense to me, and I didn't see anything obviously wrong looking through the patch, either. > However it seems that I have something wrong, because CI is failing on > Windows; I ran out of time for looking into that today, but wanted to > post what I have so far since I know we have an open item or two to > close here ASAP... :-( > Patches 0002-0004 are Andres's, with minor tweaks: > > v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch: > > I'm still slightly confused about whether we need an *extra* global > barrier in DROP TABLESPACE, not just if destroy_tablespace_directory() > failed. Andres wrote this code, but it looks correct to me. Currently, the reason why we use USE_BARRIER_SMGRRELEASE is that we want to make sure that we don't fail to remove a tablespace because some other backend might have files open. However, it might also be that no other backend has files open, and in that case we don't need to do anything, so the current placement of the call is correct. With this patch, though, we want to make sure that no FD that is open before we start dropping files remains open after we drop files - and that means we need to force files to be closed before we even try to remove files the first time. It seems to me that Andres's patch (your 0002) doesn't add a second call - it moves the existing one earlier. And that seems right: no new files should be getting opened once we force them closed the first time, I hope anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, Apr 6, 2022 at 5:07 AM Robert Haas wrote: > On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro wrote: > > > The checkpointer never takes heavyweight locks, so the opportunity > > > you're describing can't arise. > > > > Hmm, oh, you probably meant the buffer interlocking > > in SyncOneBuffer(). It's true that my most recent patch throws away > > more requests than it could, by doing the level check at the end of > > the loop over all buffers instead of adding some kind of > > DropPendingWritebacks() in the barrier handler. I guess I could find > > a way to improve that, basically checking the level more often instead > > of at the end, but I don't know if it's worth it; we're still throwing > > out an arbitrary percentage of writeback requests. > > Doesn't every backend have its own set of pending writebacks? > BufferAlloc() calls > ScheduleBufferTagForWriteback(, ...)? Yeah. Alright, for my exaggeration above, s/can't arise/probably won't often arise/, since when regular backends do this it's for random dirty buffers, not necessarily stuff they hold a relation lock on. I think you are right that if we find ourselves calling smgrwrite() on another buffer for that relfilenode a bit later, then it's safe to "unzonk" the relation. In the worst case, IssuePendingWritebacks() might finish up calling sync_file_range() on a file that we originally wrote to for generation 1 of that relfilenode, even though we now have a file descriptor for generation 2 of that relfilenode that was reopened by a later smgrwrite(). That would be acceptable, because a bogus sync_file_range() call would be harmless. The key point here is that we absolutely must avoid re-opening files without careful interlocking, because that could lead to later *writes* for generation 2 going to the defunct generation 1 file that we opened just as it was being unlinked. (For completeness: we know of another way for smgrwrite() to write to the wrong generation of a recycled relfilenode, but that's hopefully very unlikely and will hopefully be addressed by adding more bits and killing off OID-wraparound[1]. In this thread we're concerned only with these weird "explicit OID reycling" cases we're trying to fix with PROCSIGNAL_BARRIER_SMGRRELEASE sledgehammer-based cache invalidation.) My new attempt, attached, is closer to what Andres proposed, except at the level of md.c segment objects instead of level of SMgrRelation objects. This avoids the dangling SMgrRelation pointer problem discussed earlier, and is also a little like your "zonk" idea, except instead of a zonk flag, the SMgrRelation object is reset to a state where it knows nothing at all except its relfilenode. Any access through smgrXXX() functions *except* smgrwriteback() will rebuild the state, just as you would clear the hypothetical zonk flag. So, to summarise the new patch that I'm attaching to this email as 0001: 1. When PROCSIGNAL_BARRIER_SMGRRELEASE is handled, we call smgrreleaseall(), which calls smgrrelease() on all SMgrRelation objects, telling them to close all their files. 2. All SMgrRelation objects are still valid, and any pointers to them that were acquired before a CFI() and then used in an smgrXXX() call afterwards will still work (the example I know of being RelationCopyStorage()[2]). 3. mdwriteback() internally uses a new "behavior" flag EXTENSION_DONT_OPEN when getting its hands on the internal segment object, which says that we should just give up immediately if we don't already have the file open (concretely: if PROCSIGNAL_BARRIER_SMGRRELEASE came along between our recent smgrwrite() call and the writeback machinery's smgrwriteback() call, it'll do nothing at all). Someone might say that it's weird that smgrwriteback() has that behaviour internally, and we might eventually want to make it more explicit by adding a "behavior" argument to the function itself, so that it's the caller that controls it. It didn't seem worth it for now though; the whole thing is a hint to the operating system anyway. However it seems that I have something wrong, because CI is failing on Windows; I ran out of time for looking into that today, but wanted to post what I have so far since I know we have an open item or two to close here ASAP... Patches 0002-0004 are Andres's, with minor tweaks: v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch: It seems useless to even keep the macro USE_BARRIER_SMGRRELEASE if we're going to define it always, so I removed it. I guess it's useful to be able to disable that logic easily to see that the assertion in the other patch fails, but you can do that by commenting out a line in ProcessBarrierSmgrRelease(). I'm still slightly confused about whether we need an *extra* global barrier in DROP TABLESPACE, not just if destroy_tablespace_directory() failed. v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch Was missing: -EXTRA_INSTALL=contrib/test_decoding +EXTRA_INSTALL=contrib/test_decoding
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro wrote: > > The checkpointer never takes heavyweight locks, so the opportunity > > you're describing can't arise. > > Hmm, oh, you probably meant the buffer interlocking > in SyncOneBuffer(). It's true that my most recent patch throws away > more requests than it could, by doing the level check at the end of > the loop over all buffers instead of adding some kind of > DropPendingWritebacks() in the barrier handler. I guess I could find > a way to improve that, basically checking the level more often instead > of at the end, but I don't know if it's worth it; we're still throwing > out an arbitrary percentage of writeback requests. Doesn't every backend have its own set of pending writebacks? BufferAlloc() calls ScheduleBufferTagForWriteback(, ...)? -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro wrote: > On Tue, Apr 5, 2022 at 2:18 AM Robert Haas wrote: > > I'm not sure that it really matters, but with the idea that I proposed > > it's possible to "save" a pending writeback, if we notice that we're > > accessing the relation with a proper lock after the barrier arrives > > and before we actually try to do the writeback. With this approach we > > throw them out immediately, so they're just gone. I don't mind that > > because I don't think this will happen often enough to matter, and I > > don't think it will be very expensive when it does, but it's something > > to think about. > > The checkpointer never takes heavyweight locks, so the opportunity > you're describing can't arise. Hmm, oh, you probably meant the buffer interlocking in SyncOneBuffer(). It's true that my most recent patch throws away more requests than it could, by doing the level check at the end of the loop over all buffers instead of adding some kind of DropPendingWritebacks() in the barrier handler. I guess I could find a way to improve that, basically checking the level more often instead of at the end, but I don't know if it's worth it; we're still throwing out an arbitrary percentage of writeback requests.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Tue, Apr 5, 2022 at 2:18 AM Robert Haas wrote: > On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro wrote: > > Another idea would be to call a new function DropPendingWritebacks(), > > and also tell all the SMgrRelation objects to close all their internal > > state (ie the fds + per-segment objects) but not free the main > > SMgrRelationData object, and for good measure also invalidate the > > small amount of cached state (which hasn't been mentioned in this > > thread, but it kinda bothers me that that state currently survives, so > > it was one unspoken reason I liked the smgrcloseall() idea). Since > > DropPendingWritebacks() is in a rather different module, perhaps if we > > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > > something else, because it's not really a SMGR-only thing anymore. > > I'm not sure that it really matters, but with the idea that I proposed > it's possible to "save" a pending writeback, if we notice that we're > accessing the relation with a proper lock after the barrier arrives > and before we actually try to do the writeback. With this approach we > throw them out immediately, so they're just gone. I don't mind that > because I don't think this will happen often enough to matter, and I > don't think it will be very expensive when it does, but it's something > to think about. The checkpointer never takes heavyweight locks, so the opportunity you're describing can't arise. This stuff happens entirely within the scope of a call to BufferSync(), which is called by CheckPointBuffer(). BufferSync() does WritebackContextInit() to set up the object that accumulates the writeback requests, and then runs for a while performing a checkpoint (usually spread out over time), and then at the end does IssuePendingWritebacks(). A CFI() can be processed any time up until the IssuePendingWritebacks(), but not during IssuePendingWritebacks(). That's the size of the gap we need to cover. I still like the patch I posted most recently. Note that it's not quite as I described above: there is no DropPendingWritebacks(), because that turned out to be impossible to implement in a way that the barrier handler could call -- it needs access to the writeback context. Instead I had to expose a counter so that IssuePendingWritebacks() could detect whether any smgrreleaseall() calls had happened while the writeback context was being filled up with writeback requests, by comparing with the level recorded therein.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for good measure also invalidate the > small amount of cached state (which hasn't been mentioned in this > thread, but it kinda bothers me that that state currently survives, so > it was one unspoken reason I liked the smgrcloseall() idea). Since > DropPendingWritebacks() is in a rather different module, perhaps if we > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > something else, because it's not really a SMGR-only thing anymore. I'm not sure that it really matters, but with the idea that I proposed it's possible to "save" a pending writeback, if we notice that we're accessing the relation with a proper lock after the barrier arrives and before we actually try to do the writeback. With this approach we throw them out immediately, so they're just gone. I don't mind that because I don't think this will happen often enough to matter, and I don't think it will be very expensive when it does, but it's something to think about. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for good measure also invalidate the > small amount of cached state (which hasn't been mentioned in this > thread, but it kinda bothers me that that state currently survives, so > it was one unspoken reason I liked the smgrcloseall() idea). Since > DropPendingWritebacks() is in a rather different module, perhaps if we > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > something else, because it's not really a SMGR-only thing anymore. Here's a sketch of that idea. From 58262253e167e25d4bbb6777574a98fb12850c29 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 2 Apr 2022 17:05:39 +1300 Subject: [PATCH 1/2] WIP: Rethink PROCSIGNAL_BARRIER_SMGRRELEASE. With sufficiently bad luck, it was possible for IssuePendingWritebacks() to reopen relation files just after we'd processed PROCSIGNAL_BARRIER_SMGRRELEASE but before the file was unlinked by some other backend. While commit 4eb21763 initially appeared to have fixed the spate of test failures we'd seen on Windows, testing with extra instrumentation to look out for writes to unlinked files revealed the dangerous nature of these deferred file references. This second attempt introduces smgrrelease(reln), smgrreleaseall(), and smgrreleaseallcount(). That last is used by IssuedPendingWritebacks() to notice when pending writebacks have to be consider potentially invalid, by watching out for level changes during the time writeback instructions were accumulated. XXX Experimental Reported-by: Andres Freund Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de --- src/backend/storage/buffer/bufmgr.c | 14 +++ src/backend/storage/smgr/md.c | 6 --- src/backend/storage/smgr/smgr.c | 58 + src/include/storage/buf_internals.h | 3 ++ src/include/storage/md.h| 1 - src/include/storage/smgr.h | 3 ++ 6 files changed, 70 insertions(+), 15 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index d73a40c1bc..43652b757a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4873,6 +4873,7 @@ WritebackContextInit(WritebackContext *context, int *max_pending) context->max_pending = max_pending; context->nr_pending = 0; + context->smgr_release_count = smgrreleaseallcount(); } /* @@ -4924,6 +4925,18 @@ IssuePendingWritebacks(WritebackContext *context) { int i; + /* + * Throw away the contents if smgrreleaseall() has been invoked since + * "context" was initialized or cleared. Since the main loop below doesn't + * check for interrupts or otherwise handle barrier requests, we don't have + * to worry about invalidation if we get past this check. + */ + if (context->smgr_release_count != smgrreleaseallcount()) + { + context->nr_pending = 0; + context->smgr_release_count = smgrreleaseallcount(); + } + if (context->nr_pending == 0) return; @@ -4983,6 +4996,7 @@ IssuePendingWritebacks(WritebackContext *context) } context->nr_pending = 0; + context->smgr_release_count = smgrreleaseallcount(); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 879f647dbc..d26c915f90 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -549,12 +549,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum) } } -void -mdrelease(void) -{ - closeAllVfds(); -} - /* * mdprefetch() -- Initiate asynchronous read of the specified block of a relation */ diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index d71a557a35..087c8e76d1 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -41,7 +41,6 @@ typedef struct f_smgr { void (*smgr_init) (void); /* may be NULL */ void (*smgr_shutdown) (void); /* may be NULL */ - void (*smgr_release) (void); /* may be NULL */ void (*smgr_open) (SMgrRelation reln); void (*smgr_close) (SMgrRelation reln, ForkNumber forknum); void (*smgr_create) (SMgrRelation reln, ForkNumber forknum, @@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = { { .smgr_init = mdinit, .smgr_shutdown = NULL, - .smgr_release = mdrelease, .smgr_open = mdopen, .smgr_close = mdclose, .smgr_create = mdcreate, @@ -100,6 +98,8 @@ static dlist_head unowned_relns; /* local function prototypes */ static void smgrshutdown(int code, Datum arg); +static uint64 releaseall_count = 0; + /* * smgrinit(), smgrshutdown() -- Initialize or shut down storage @@ -281,6 +281,44 @@ smgrclose(SMgrRelation reln) *owner = NULL; } +/* + * smgrrelease() -- Release all resources used by this object. + * + * The object remains
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, Apr 2, 2022 at 2:52 AM Robert Haas wrote: > On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro wrote: > > The v1-0003 patch introduced smgropen_cond() to avoid the problem of > > IssuePendingWritebacks(), which does desynchronised smgropen() calls > > and could open files after the barrier but just before they are > > unlinked. Makes sense, but... > > > > 1. For that to actually work, we'd better call smgrcloseall() when > > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls > > closeAllVds(). Here's a patch for that. But... > > What if we instead changed our approach to fixing > IssuePendingWritebacks()? Design sketch: > > 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's > false. > 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true. > 3. If you want, you can set it back to false any time you access the > smgr while holding a relation lock. > 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that > function now refuses either to create a new smgr or to return one > where smgr_zonk is true. > > I think this would be sufficient to guarantee that we never try to > write back to a relfilenode if we haven't had a lock on it since the > last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we > need here? > > My thinking here is that it's a lot easier to get away with marking > the content of a data structure invalid than it is to get away with > invalidating pointers to that data structure. If we can tinker with > the design so that the SMgrRelationData doesn't actually go away but > just gets a little bit more thoroughly invalidated, we could avoid > having to audit the entire code base for code that keeps smgr handles > around longer than would be possible with the design you demonstrate > here. Another idea would be to call a new function DropPendingWritebacks(), and also tell all the SMgrRelation objects to close all their internal state (ie the fds + per-segment objects) but not free the main SMgrRelationData object, and for good measure also invalidate the small amount of cached state (which hasn't been mentioned in this thread, but it kinda bothers me that that state currently survives, so it was one unspoken reason I liked the smgrcloseall() idea). Since DropPendingWritebacks() is in a rather different module, perhaps if we were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to something else, because it's not really a SMGR-only thing anymore.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro wrote: > The v1-0003 patch introduced smgropen_cond() to avoid the problem of > IssuePendingWritebacks(), which does desynchronised smgropen() calls > and could open files after the barrier but just before they are > unlinked. Makes sense, but... > > 1. For that to actually work, we'd better call smgrcloseall() when > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls > closeAllVds(). Here's a patch for that. But... What if we instead changed our approach to fixing IssuePendingWritebacks()? Design sketch: 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false. 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true. 3. If you want, you can set it back to false any time you access the smgr while holding a relation lock. 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that function now refuses either to create a new smgr or to return one where smgr_zonk is true. I think this would be sufficient to guarantee that we never try to write back to a relfilenode if we haven't had a lock on it since the last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we need here? My thinking here is that it's a lot easier to get away with marking the content of a data structure invalid than it is to get away with invalidating pointers to that data structure. If we can tinker with the design so that the SMgrRelationData doesn't actually go away but just gets a little bit more thoroughly invalidated, we could avoid having to audit the entire code base for code that keeps smgr handles around longer than would be possible with the design you demonstrate here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Some thoughts: The v1-0003 patch introduced smgropen_cond() to avoid the problem of IssuePendingWritebacks(), which does desynchronised smgropen() calls and could open files after the barrier but just before they are unlinked. Makes sense, but... 1. For that to actually work, we'd better call smgrcloseall() when PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls closeAllVds(). Here's a patch for that. But... 2. The reason I used closeAllVds() in 4eb21763 is because I didn't want random CHECK_FOR_INTERRUPTS() calls to break code like this, from RelationCopyStorage(): for (blkno = 0; blkno < nblocks; blkno++) { /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); smgrread(src, forkNum, blkno, buf.data); Perhaps we could change RelationCopyStorage() to take Relation, and use CreateFakeRelCacheEntry() in various places to satisfy that, and also extend that mechanism to work with temporary tables. Here's an initial sketch of that idea, to see what problems it crashes into... Fake Relations are rather unpleasant though; I wonder if there is an SMgrRelationHandle concept that could make this all nicer. There is possibly also some cleanup missing to avoid an SMgrRelation that points to a defunct fake Relation on non-local exit (?). From 943228b5d4e0dc42ad219c3fe97adb59e7262593 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 1 Apr 2022 17:02:51 +1300 Subject: [PATCH 1/2] WIP: Rethink PROCSIGNAL_BARRIER_SMGRRELEASE. Commit 4eb21763 closed kernel file descriptors, but left SMgrRelations in existence. Unfortunately, IssuePendingWritebacks() can reopen them at any time, including right before the a file is unlinked. In order for IssuePendingWritebacks() to be able to avoid this hazard in a later commit, go further and close the SMgrRelations. XXX We had better make sure that no code does: reln = smgropen(...); CHECK_FOR_INTERRUPTS(); smgrXXX(reln, ...); --- src/backend/storage/smgr/md.c | 6 -- src/backend/storage/smgr/smgr.c | 10 ++ src/include/storage/md.h| 1 - 3 files changed, 2 insertions(+), 15 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 879f647dbc..d26c915f90 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -549,12 +549,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum) } } -void -mdrelease(void) -{ - closeAllVfds(); -} - /* * mdprefetch() -- Initiate asynchronous read of the specified block of a relation */ diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index efa83ae394..994728db04 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -41,7 +41,6 @@ typedef struct f_smgr { void (*smgr_init) (void); /* may be NULL */ void (*smgr_shutdown) (void); /* may be NULL */ - void (*smgr_release) (void); /* may be NULL */ void (*smgr_open) (SMgrRelation reln); void (*smgr_close) (SMgrRelation reln, ForkNumber forknum); void (*smgr_create) (SMgrRelation reln, ForkNumber forknum, @@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = { { .smgr_init = mdinit, .smgr_shutdown = NULL, - .smgr_release = mdrelease, .smgr_open = mdopen, .smgr_close = mdclose, .smgr_create = mdcreate, @@ -717,17 +715,13 @@ AtEOXact_SMgr(void) } /* - * This routine is called when we are ordered to release all open files by a + * This routine is called when we are ordered to close all SMgrRelations by a * ProcSignalBarrier. */ bool ProcessBarrierSmgrRelease(void) { - for (int i = 0; i < NSmgr; i++) - { - if (smgrsw[i].smgr_release) - smgrsw[i].smgr_release(); - } + smgrcloseall(); return true; } diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 6e46d8d96a..a40db7 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,7 +23,6 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); -extern void mdrelease(void); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); -- 2.35.1 From 7070d98f0287033119f7217cdcfe88d3a2eef937 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 1 Apr 2022 18:35:26 +1300 Subject: [PATCH 2/2] WIP: Handle invalidation in RelationCopyStorage(). Since the loop in RelationCopyStorage() processes interrupts, and since the proposal is that interrupts might invalidate SMgrRelation objects, this function had better deal in Relation objects instead. Various codepaths now need to use CreateFakeRelcacheEntry() to obtain one. That function therefore now needs to support temporary relations too, hence addition of BackendId parameter. XXX it's now extra silly to have that stuff in xlogutils.c XXX maybe what
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Thu, Mar 3, 2022 at 1:28 PM Andres Freund wrote: > > I can't remember that verify() is the one that accesses conflict.db large > > while cause_eviction() is the one that accesses postgres.replace_sb for more > > than like 15 seconds. > > For more than 15seconds? The whole test runs in a few seconds for me... I'm not talking about running the test. I'm talking about reading it and trying to keep straight what is happening. The details of which function is accessing which database keep going out of my head. Quickly. Maybe because I'm dumb, but I think some better naming could help, just in case other people are dumb, too. > > As to (5), the identifiers are just primary and standby, without > > mentioning the database name, which adds to the confusion, and there > > are no comments explaining why we have two nodes. > > I don't follow this one - physical rep doesn't do anything below the database > level? Right but ... those handles are connected to a particular DB on that node, not just the node in general. > > I think it would be helpful to find a way to divide the test case up > > into sections that are separated from each other visually, and explain > > the purpose of each test a little more in a comment. For example I'm > > struggling to understand the exact purpose of this bit: > > > > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); > > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); > > + > > +verify($node_primary, $node_standby, 3, > > + "restored contents as expected"); > > > > I'm all for having test coverage of VACUUM FULL, but it isn't clear to > > me what this does that is related to FD reuse. > > It's just trying to clean up prior damage, so the test continues to later > ones. Definitely not needed once there's a fix. But it's useful while trying > to make the test actually detect various corruptions. Ah, OK, I definitely did not understand that before. > > Similarly for the later ones. I generally grasp that you are trying to > > make sure that things are OK with respect to FD reuse in various > > scenarios, but I couldn't explain to you what we'd be losing in terms > > of test coverage if we removed this line: > > > > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;"); > > > > I am not sure how much all of this potential cleanup matters because > > I'm pretty skeptical that this would be stable in the buildfarm. > > Which "potential cleanup" are you referring to? I mean, whatever improvements you might consider making based on my comments. > > It relies on a lot of assumptions about timing and row sizes and details of > > when invalidations are sent that feel like they might not be entirely > > predictable at the level you would need to have this consistently pass > > everywhere. Perhaps I am too pessimistic? > > I don't think the test passing should be dependent on row size, invalidations > etc to a significant degree (unless the tables are too small to reach s_b > etc)? The exact symptoms of failures are highly unstable, but obviously we'd > fix them in-tree before committing a test. But maybe I'm missing a dependency? > > FWIW, the test did pass on freebsd, linux, macos and windows with the > fix. After a few iterations of improving the fix ;) Hmm. Well, I admit that's already more than I would have expected -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Hi, On 2022-03-03 13:11:17 -0500, Robert Haas wrote: > On Wed, Mar 2, 2022 at 3:00 PM Andres Freund wrote: > > On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > > > - I am having some trouble understanding clearly what 0001 is doing. > > > I'll try to study it further. > > > > It tests for the various scenarios I could think of that could lead to FD > > reuse, to state the obvious ;). Anything particularly unclear. > > As I understand it, the basic idea of this test is: > > 1. create a database called 'conflict_db' containing a table called 'large' > 2. also create a table called 'replace_sb' in the postgres db. > 3. have a long running transaction open in 'postgres' database that > repeatedly accesses the 'replace_sb' table to evict the 'conflict_db' > table, sometimes causing write-outs > 4. try to get it to write out the data to a stale file descriptor by > fiddling with various things, and then detect the corruption that > results > 5. use both a primary and a standby not because they are intrinsically > necessary but because you want to test that both cases work Right. I wasn't proposing to commit the test as-is, to be clear. It was meant as a demonstration of the types of problems I can see, and what's needed to fix them... > I can't remember that verify() is the one that accesses conflict.db large > while cause_eviction() is the one that accesses postgres.replace_sb for more > than like 15 seconds. For more than 15seconds? The whole test runs in a few seconds for me... > As to (5), the identifiers are just primary and standby, without > mentioning the database name, which adds to the confusion, and there > are no comments explaining why we have two nodes. I don't follow this one - physical rep doesn't do anything below the database level? > I think it would be helpful to find a way to divide the test case up > into sections that are separated from each other visually, and explain > the purpose of each test a little more in a comment. For example I'm > struggling to understand the exact purpose of this bit: > > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); > + > +verify($node_primary, $node_standby, 3, > + "restored contents as expected"); > > I'm all for having test coverage of VACUUM FULL, but it isn't clear to > me what this does that is related to FD reuse. It's just trying to clean up prior damage, so the test continues to later ones. Definitely not needed once there's a fix. But it's useful while trying to make the test actually detect various corruptions. > Similarly for the later ones. I generally grasp that you are trying to > make sure that things are OK with respect to FD reuse in various > scenarios, but I couldn't explain to you what we'd be losing in terms > of test coverage if we removed this line: > > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;"); > > I am not sure how much all of this potential cleanup matters because > I'm pretty skeptical that this would be stable in the buildfarm. Which "potential cleanup" are you referring to? > It relies on a lot of assumptions about timing and row sizes and details of > when invalidations are sent that feel like they might not be entirely > predictable at the level you would need to have this consistently pass > everywhere. Perhaps I am too pessimistic? I don't think the test passing should be dependent on row size, invalidations etc to a significant degree (unless the tables are too small to reach s_b etc)? The exact symptoms of failures are highly unstable, but obviously we'd fix them in-tree before committing a test. But maybe I'm missing a dependency? FWIW, the test did pass on freebsd, linux, macos and windows with the fix. After a few iterations of improving the fix ;) Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund wrote: > On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > > - I am having some trouble understanding clearly what 0001 is doing. > > I'll try to study it further. > > It tests for the various scenarios I could think of that could lead to FD > reuse, to state the obvious ;). Anything particularly unclear. As I understand it, the basic idea of this test is: 1. create a database called 'conflict_db' containing a table called 'large' 2. also create a table called 'replace_sb' in the postgres db. 3. have a long running transaction open in 'postgres' database that repeatedly accesses the 'replace_sb' table to evict the 'conflict_db' table, sometimes causing write-outs 4. try to get it to write out the data to a stale file descriptor by fiddling with various things, and then detect the corruption that results 5. use both a primary and a standby not because they are intrinsically necessary but because you want to test that both cases work As to (1) and (2), I note that 'large' is not a hugely informative table name, especially when it's the smaller of the two test tables. And also, the names are all over the place. The table names don't have much to do with each other (large, replace_sb) and they're completely unrelated to the database names (postgres, conflict_db). And then you have Perl functions whose names also don't make it obvious what we're talking about. I can't remember that verify() is the one that accesses conflict.db large while cause_eviction() is the one that accesses postgres.replace_sb for more than like 15 seconds. It'd make it a lot easier if the table names, database names, and Perl function names had some common elements. As to (5), the identifiers are just primary and standby, without mentioning the database name, which adds to the confusion, and there are no comments explaining why we have two nodes. Also, some of the comments seem like they might be in the wrong place: +# Create template database with a table that we'll update, to trigger dirty +# rows. Using a template database + preexisting rows makes it a bit easier to +# reproduce, because there's no cache invalidations generated. Right after this comment, we create a table and then a template database just after, so I think we're not avoiding any invalidations at this point. We're avoiding them at later points where the comments aren't talking about this. I think it would be helpful to find a way to divide the test case up into sections that are separated from each other visually, and explain the purpose of each test a little more in a comment. For example I'm struggling to understand the exact purpose of this bit: +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); + +verify($node_primary, $node_standby, 3, + "restored contents as expected"); I'm all for having test coverage of VACUUM FULL, but it isn't clear to me what this does that is related to FD reuse. Similarly for the later ones. I generally grasp that you are trying to make sure that things are OK with respect to FD reuse in various scenarios, but I couldn't explain to you what we'd be losing in terms of test coverage if we removed this line: +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;"); I am not sure how much all of this potential cleanup matters because I'm pretty skeptical that this would be stable in the buildfarm. It relies on a lot of assumptions about timing and row sizes and details of when invalidations are sent that feel like they might not be entirely predictable at the level you would need to have this consistently pass everywhere. Perhaps I am too pessimistic? -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund wrote: > What I am stuck on is what we can do for the released branches. Data > corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like > something we need to address. I think we should consider back-porting the ProcSignalBarrier stuff eventually. I realize that it's not really battle-tested yet and I am not saying we should do it right now, but I think that if we get these changes into v15, back-porting it in let's say May of next year could be a reasonable thing to do. Sure, there is some risk there, but on the other hand, coming up with completely different fixes for the back-branches is not risk-free either, nor is it clear that there is any alternative fix that is nearly as good. In the long run, I am fairly convinced that ProcSignalBarrier is the way forward not only for this purpose but for other things as well, and everybody's got to get on the train or be left behind. Also, I am aware of multiple instances where the project waited a long, long time to fix bugs because we didn't have a back-patchable fix. I disagree with that on principle. A master-only fix now is better than a back-patchable fix two or three years from now. Of course a back-patchable fix now is better still, but we have to pick from the options we have, not the ones we'd like to have. It seems to me that if we were going to try to construct an alternative fix for the back-branches, it would have to be something that didn't involve a new invalidation mechanism -- because the ProcSignalBarrier stuff is an invalidation mechanism in effect, and I feel that it can't be better to invent two new invalidation mechanisms rather than one. And the only idea I have is trying to detect a dangerous sequence of operations and just outright block it. We have some cases sort of like that already - e.g. you can't prepare a transaction if it's done certain things. But, the existing precedents that occur to me are, I think, all cases where all of the related actions are being performed in the same backend. It doesn't sound crazy to me to have some rule like "you can't ALTER TABLESPACE on the same tablespace in the same backend twice in a row without an intervening checkpoint", or whatever, and install the book-keeping to enforce that. But I don't think anything like that can work, both because the two ALTER TABLESPACE commands could be performed in different sessions, and also because an intervening checkpoint is no guarantee of safety anyway, IIUC. So I'm just not really seeing a reasonable strategy that isn't basically the barrier stuff. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Hi, On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > - I am having some trouble understanding clearly what 0001 is doing. > I'll try to study it further. It tests for the various scenarios I could think of that could lead to FD reuse, to state the obvious ;). Anything particularly unclear. > - In general, 0003 makes a lot of sense to me. > > + /* > +* Finally tell the kernel to write the data to > storage. Don't smgr if > +* previously closed, otherwise we could end up evading > fd-reuse > +* protection. > +*/ > > - I think the above hunk is missing a word, because it uses smgr as a > verb. I also think that it's easy to imagine this explanation being > insufficient for some future hacker to understand the issue. Yea, it's definitely not sufficient or gramattically correct. I think I wanted to send something out, but was very tired by that point.. > - While 0004 seems useful for testing, it's an awfully big hammer. I'm > not sure we should be thinking about committing something like that, > or at least not as a default part of the build. But ... maybe? Aggreed. I think it's racy anyway, because further files could get deleted (e.g. segments > 0) after the barrier has been processed. What I am stuck on is what we can do for the released branches. Data corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like something we need to address. Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Tue, Feb 22, 2022 at 4:40 AM Andres Freund wrote: > On 2022-02-22 01:11:21 -0800, Andres Freund wrote: > > I've started to work on a few debugging aids to find problem like > > these. Attached are two WIP patches: > > Forgot to attach. Also importantly includes a tap test for several of these > issues Hi, Just a few very preliminary comments: - I am having some trouble understanding clearly what 0001 is doing. I'll try to study it further. - 0002 seems like it's generally a good idea. I haven't yet dug into why the call sites for AssertFileNotDeleted() are where they are, or whether that's a complete set of places to check. - In general, 0003 makes a lot of sense to me. + /* +* Finally tell the kernel to write the data to storage. Don't smgr if +* previously closed, otherwise we could end up evading fd-reuse +* protection. +*/ - I think the above hunk is missing a word, because it uses smgr as a verb. I also think that it's easy to imagine this explanation being insufficient for some future hacker to understand the issue. - While 0004 seems useful for testing, it's an awfully big hammer. I'm not sure we should be thinking about committing something like that, or at least not as a default part of the build. But ... maybe? -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Hi, On 2022-02-22 01:11:21 -0800, Andres Freund wrote: > I've started to work on a few debugging aids to find problem like > these. Attached are two WIP patches: Forgot to attach. Also importantly includes a tap test for several of these issues Greetings, Andres Freund >From 0bc64874f8e5faae9a38731a83aa7b001095cc35 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 21 Feb 2022 15:44:02 -0800 Subject: [PATCH v1 1/4] WIP: test for file reuse dangers around database and tablespace commands. --- src/test/recovery/t/029_relfilenode_reuse.pl | 233 +++ 1 file changed, 233 insertions(+) create mode 100644 src/test/recovery/t/029_relfilenode_reuse.pl diff --git a/src/test/recovery/t/029_relfilenode_reuse.pl b/src/test/recovery/t/029_relfilenode_reuse.pl new file mode 100644 index 000..22d8e85614c --- /dev/null +++ b/src/test/recovery/t/029_relfilenode_reuse.pl @@ -0,0 +1,233 @@ +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; +use File::Basename; + + +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); +$node_primary->init(allows_streaming => 1); +$node_primary->append_conf('postgresql.conf', q[ +allow_in_place_tablespaces = true +log_connections=on +# to avoid "repairing" corruption +full_page_writes=off +log_min_messages=debug2 +autovacuum_naptime=1s +shared_buffers=1MB +]); +$node_primary->start; + + +# Create streaming standby linking to primary +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); +my $node_standby = PostgreSQL::Test::Cluster->new('standby'); +$node_standby->init_from_backup($node_primary, $backup_name, + has_streaming => 1); +$node_standby->start; + +# To avoid hanging while expecting some specific input from a psql +# instance being driven by us, add a timeout high enough that it +# should never trigger even on very slow machines, unless something +# is really wrong. +my $psql_timeout = IPC::Run::timer(300); + +my %psql_primary = (stdin => '', stdout => '', stderr => ''); +$psql_primary{run} = IPC::Run::start( + [ 'psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres') ], + '<', + \$psql_primary{stdin}, + '>', + \$psql_primary{stdout}, + '2>', + \$psql_primary{stderr}, + $psql_timeout); + +my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => ''); +$psql_standby{run} = IPC::Run::start( + [ 'psql', '-XA', '-f', '-', '-d', $node_standby->connstr('postgres') ], + '<', + \$psql_standby{stdin}, + '>', + \$psql_standby{stdout}, + '2>', + \$psql_standby{stderr}, + $psql_timeout); + + +# Create template database with a table that we'll update, to trigger dirty +# rows. Using a template database + preexisting rows makes it a bit easier to +# reproduce, because there's no cache invalidations generated. + +$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db_template OID = 5;"); +$node_primary->safe_psql('conflict_db_template', q[ +CREATE TABLE large(id serial primary key, dataa text, datab text); +INSERT INTO large(dataa, datab) SELECT g.i::text, 1 FROM generate_series(1, 4000) g(i);]); +$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;"); + +$node_primary->safe_psql('postgres', q[ +CREATE EXTENSION pg_prewarm; +CREATE TABLE replace_sb(data text); +INSERT INTO replace_sb(data) SELECT random()::text FROM generate_series(1, 15000);]); + +# Use longrunning transactions, so that AtEOXact_SMgr doesn't close files +send_query_and_wait( + \%psql_primary, + q[BEGIN;], + qr/BEGIN/m); +send_query_and_wait( + \%psql_standby, + q[BEGIN;], + qr/BEGIN/m); + +# Cause lots of dirty rows in shared_buffers +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 1;"); + +# Now do a bunch of work in another database. That will end up needing to +# write back dirty data from the previous step, opening the relevant file +# descriptors +cause_eviction(\%psql_primary, \%psql_standby); + +# drop and recreate database +$node_primary->safe_psql('postgres', "DROP DATABASE conflict_db;"); +$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;"); + +verify($node_primary, $node_standby, 1, + "initial contents as expected"); + +# Again cause lots of dirty rows in shared_buffers, but use a different update +# value so we can check everything is OK +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 2;"); + +# Again cause a lot of IO. That'll again write back dirty data, but uses (XXX +# adjust after bugfix) the already opened file descriptor. +# FIXME +cause_eviction(\%psql_primary, \%psql_standby); + +verify($node_primary, $node_standby, 2, + "update to reused relfilenode (due to DB oid conflict) is not lost"); + + +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); + +verify($node_primary, $node_standby, 3, + "restored
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Hi, On 2022-02-10 14:26:59 -0800, Andres Freund wrote: > On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > > It seems like I should go ahead and do that today, and we can study > > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? > > Yes. I wrote a test to show the problem. While looking at the code I unfortunately found that CREATE DATABASE ... OID isn't the only problem. Two consecutive ALTER DATABASE ... SET TABLESPACE also can cause corruption. The test doesn't work on < 15 (due to PostgresNode renaming and use of allow_in_place_tablespaces). But manually it's unfortunately quite reproducible :( Start a server with shared_buffers = 1MB autovacuum=off bgwriter_lru_maxpages=1 bgwriter_delay=1 these are just to give more time / to require smaller tables. Start two psql sessions s1: \c postgres s1: CREATE DATABASE conflict_db; s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg'; s1: CREATE EXTENSION pg_prewarm; s2: \c conflict_db s2: CREATE TABLE large(id serial primary key, dataa text, datab text); s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i); s2: UPDATE large SET datab = 1; -- prevent AtEOXact_SMgr s1: BEGIN; -- Fill shared_buffers with other contents. This needs to write out the prior dirty contents -- leading to opening smgr rels / file descriptors s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0; s2: \c postgres -- unlinks the files s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace; -- now new files with the same relfilenode exist s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default; s2: \c conflict_db -- dirty buffers again s2: UPDATE large SET datab = 2; -- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed, -- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0; -- verify that everything is [not] ok s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10; ┌───┬───┐ │ datab │ count │ ├───┼───┤ │ 1 │ 2117 │ │ 2 │67 │ └───┴───┘ (2 rows) oops. I don't yet have a good idea how to tackle this in the backbranches. I've started to work on a few debugging aids to find problem like these. Attached are two WIP patches: 1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with deleted files. That seems to work (almost) everywhere. Optionally, on linux, use readlink(/proc/$pid/fd/$fd) to get the filename. 2) On linux, iterate over PGPROC to get a list of pids. Then iterate over /proc/$pid/fd/ to check for deleted files. This only can be done after we've just made sure there's no fds open to deleted files. Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Hi, On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > I was about to commit that, because the original Windows problem it > solved is showing up occasionally in CI failures (that is, it already > solves a live problem, albeit a different and non-data-corrupting > one): +1 > It seems like I should go ahead and do that today, and we can study > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? Yes. I wonder whether we really should make the barriers be conditional on defined(WIN32) || defined(USE_ASSERT_CHECKING) as done in that patch, even leaving wraparound dangers aside. On !windows we still have the issues of the space for the dropped / moved files not being released if there are processes having them open. That can be a lot of space if there's long-lived connections in a cluster that doesn't fit into s_b (because processes will have random fds open for writing back dirty buffers). And we don't truncate the files before unlinking when done as part of a DROP DATABASE... But that's something we can fine-tune later as well... Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Hi, On 2022-02-10 13:49:50 -0500, Robert Haas wrote: > I agree. While I feel sort of bad about missing this issue in review, > I also feel like it's pretty surprising that there isn't something > plugging this hole already. It feels unexpected that our FD management > layer might hand you an FD that references the previous file that had > the same name instead of the one that exists right now, and I don't > think it's too much of a stretch of the imagination to suppose that > this won't be the last problem we have if we don't fix it. Arguably it's not really the FD layer that's the issue - it's on the smgr level. But that's semantics... > The main question in my mind is who is going to actually make that > happen. It was your idea (I think), Thomas coded it, and my commit > made it a live problem. So who's going to get something committed > here? I think it'd make sense for Thomas to commit the current patch for the tablespace issues first. And then we can go from there? > > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), > > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are > > foreclosed. > > What about movedb()? My first reaction was that it should be fine, because RelFileNode.spcNode would differ. But of course that doesn't protect against moving a database back and forth. So yes, you're right, it's needed there too. Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE should be fine, that allocates a new relfilenode. > I definitely think it would be nice to cover this issue not only for > database OIDs and tablespace OIDs but also for relfilenode OIDs. > Otherwise we're right on the cusp of having the same bug in that case. > pg_upgrade doesn't drop and recreate tables the way it does databases, > but if somebody made it do that in the future, would they think about > this? I bet not. And even just plugging the wraparound aspect would be great. It's not actually *that* hard to wrap oids around, because of toast. > Dilip's been working on your idea of making relfilenode into a 56-bit > quantity. That would make the problem of detecting wraparound go away. > In the meantime, I suppose we could do think of trying to do something > here: > > /* wraparound, or first post-initdb > assignment, in normal mode */ > ShmemVariableCache->nextOid = FirstNormalObjectId; > ShmemVariableCache->oidCount = 0; > > That's a bit sketch, because this is the OID counter, which isn't only > used for relfilenodes. I can see a few ways to deal with that: 1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound behaviour only in the nextRelFileNode case. That'd be a nice improvement, but not entirely trivial, because we currently use GetNewRelFileNode() for oid assignment as well (c.f. heap_create_with_catalog()). 2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait for a barrier in GetNewRelFileNode() if the assigned oid is wrapped around. While there would be a chance of "missing" a ->nextOid wraparound, e.g. because of >2**32 toast oids being allocated, I think that could only happen if there were no "wrapped around relfilenodes" allocated, so that should be ok. 3) Any form of oid wraparound might cause problems, not just relfilenode issues. So having a place to emit barriers on oid wraparound might be a good idea. One thing I wonder is how to make sure the path would be tested. Perhaps we could emit the anti-wraparound barriers more frequently when assertions are enabled? > I'm not sure whether there would be problems waiting for a barrier here - I > think we might be holding some lwlocks in some code paths. It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what I can see the two callers of GetNewObjectId() both have loops that can go on for a long time after a wraparound. > > I think we need to add some debugging code to detect "fd to deleted file of > > same name" style problems. Especially during regression tests they're quite > > hard to notice, because most of the contents read/written are identical > > across > > recycling. > > Maybe. If we just plug the holes here, I am not sure we need permanent > instrumentation. But I'm also not violently opposed to it. The question is how we can find all the places we need to add barriers emit & wait to. I e.g. didn't initially didn't realize that there's wraparound danger in WAL replay as well. Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Thu, Feb 10, 2022 at 3:11 PM Thomas Munro wrote: > On Fri, Feb 11, 2022 at 7:50 AM Robert Haas wrote: > > The main question in my mind is who is going to actually make that > > happen. It was your idea (I think), Thomas coded it, and my commit > > made it a live problem. So who's going to get something committed > > here? > > I was about to commit that, because the original Windows problem it > solved is showing up occasionally in CI failures (that is, it already > solves a live problem, albeit a different and non-data-corrupting > one): > > https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com > > It seems like I should go ahead and do that today, and we can study > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? A bird in the hand is worth two in the bush. Unless it's a vicious, hand-biting bird. In other words, if you think what you've got is better than what we have now and won't break anything worse than it is today, +1 for committing, and more improvements can come when we get enough round tuits. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Fri, Feb 11, 2022 at 7:50 AM Robert Haas wrote: > The main question in my mind is who is going to actually make that > happen. It was your idea (I think), Thomas coded it, and my commit > made it a live problem. So who's going to get something committed > here? I was about to commit that, because the original Windows problem it solved is showing up occasionally in CI failures (that is, it already solves a live problem, albeit a different and non-data-corrupting one): https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com It seems like I should go ahead and do that today, and we can study further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, Feb 9, 2022 at 5:00 PM Andres Freund wrote: > The problem starts with > > commit aa01051418f10afbdfa781b8dc109615ca785ff9 > Author: Robert Haas > Date: 2022-01-24 14:23:15 -0500 > > pg_upgrade: Preserve database OIDs. Well, that's sad. > I think the most realistic way to address this is the mechanism is to use the > mechanism from > https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com I agree. While I feel sort of bad about missing this issue in review, I also feel like it's pretty surprising that there isn't something plugging this hole already. It feels unexpected that our FD management layer might hand you an FD that references the previous file that had the same name instead of the one that exists right now, and I don't think it's too much of a stretch of the imagination to suppose that this won't be the last problem we have if we don't fix it. I also agree that the mechanism proposed in that thread is the best way to fix the problem. The sinval mechanism won't work, since there's nothing to ensure that the invalidation messages are processed in a sufficient timely fashion, and there's no obvious way to get around that problem. But the ProcSignalBarrier mechanism is not intertwined with heavyweight locking in the way that sinval is, so it should be a good fit for this problem. The main question in my mind is who is going to actually make that happen. It was your idea (I think), Thomas coded it, and my commit made it a live problem. So who's going to get something committed here? > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed. What about movedb()? > We perhaps could avoid all relfilenode recycling, issues on the fd level, by > adding a barrier whenever relfilenode allocation wraps around. But I'm not > sure how easy that is to detect. I definitely think it would be nice to cover this issue not only for database OIDs and tablespace OIDs but also for relfilenode OIDs. Otherwise we're right on the cusp of having the same bug in that case. pg_upgrade doesn't drop and recreate tables the way it does databases, but if somebody made it do that in the future, would they think about this? I bet not. Dilip's been working on your idea of making relfilenode into a 56-bit quantity. That would make the problem of detecting wraparound go away. In the meantime, I suppose we could do think of trying to do something here: /* wraparound, or first post-initdb assignment, in normal mode */ ShmemVariableCache->nextOid = FirstNormalObjectId; ShmemVariableCache->oidCount = 0; That's a bit sketch, because this is the OID counter, which isn't only used for relfilenodes. I'm not sure whether there would be problems waiting for a barrier here - I think we might be holding some lwlocks in some code paths. > I think we might actually be able to remove the checkpoint from dropdb() by > using barriers? That looks like it might work. We'd need to be sure that the checkpointer would both close any open FDs and also absorb fsync requests before acknowledging the barrier. > I think we need to add some debugging code to detect "fd to deleted file of > same name" style problems. Especially during regression tests they're quite > hard to notice, because most of the contents read/written are identical across > recycling. Maybe. If we just plug the holes here, I am not sure we need permanent instrumentation. But I'm also not violently opposed to it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote: > On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to > a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink > == 0. You could also stat() the file in proc/self/fd/N and compare st_ino. It "looks" like a symlink (in which case that wouldn't work) but it's actually a Very Special File. You can even recover deleted, still-opened files that way.. PS. I didn't know pg_upgrade knew about Reply-To ;)