Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Fri, Jun 7, 2024 at 8:05 PM Alvaro Herrera wrote: > In passing, I noticed that WaitReadBuffers has zero comments, which > seems an insufficient number of them. Ack. Here is a patch for that. I guess I hadn't put a comment there because it's hard to write anything without sort of duplicating what is already said by the StartReadBuffers() comment and doubling down on descriptions of future plans... but, in for a penny, in for a pound as they say. From db5e56825e4b4c595885373c413011c50fedd3e8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 9 Jun 2024 09:51:02 +1200 Subject: [PATCH] Add comment for WaitReadBuffers(). Per complaint from Alvaro while reviewing something else. Reported-by: Alvaro Herrera Discussion: https://postgr.es/m/202406070805.icofqromovvn%40alvherre.pgsql --- src/backend/storage/buffer/bufmgr.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index a8e3377263f..409210a6ed2 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1371,6 +1371,17 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait) return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait); } +/* + * The second step of a StartReadBuffers(), WaitReadBuffers() sequence. It is + * only necessary to call this if StartReadBuffers() returned true, indicating + * that I/O was necessary. + * + * Currently, this function performs synchronous I/O system calls. The earlier + * StartReadBuffers() call might have issued advice to give the OS a head + * start so that it completes quickly. In future work, a true asynchronous I/O + * could be started by StartReadBuffers(), and this function would wait for it + * to complete. + */ void WaitReadBuffers(ReadBuffersOperation *operation) { -- 2.45.1
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
New version. Same code as v2, but comments improved to explain the reasoning, with reference to README's buffer access rules. I'm planning to push this soon if there are no objections. From 1fa26f407622cd69d82f3b4ea68fddf2c357cf06 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 7 Jun 2024 17:49:19 +1200 Subject: [PATCH v3] Fix RBM_ZERO_AND_LOCK. Commit 210622c6 accidentally zeroed out pages even if they were found in the buffer pool. It should always lock the page, but it should only zero pages that were not already found as an optimization to avoid I/O. Otherwise, concurrent readers that hold only a pin might see corrupted page contents changing under their feet. This is done with the standard BM_IO_IN_PROGRESS infrastructure to test for BM_VALID and wake concurrent waiters without races (as it was in earlier releases). Reported-by: Noah Misch Reported-by: Alexander Lakhin Reviewed-by: Alvaro Herrera (earlier version) Reviewed-by: Robert Haas (earlier version) Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com --- src/backend/storage/buffer/bufmgr.c | 64 - 1 file changed, 45 insertions(+), 19 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 49637284f91..a8e3377263f 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1010,43 +1010,69 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, } /* - * Zero a buffer and lock it, as part of the implementation of + * Lock and optionally zero a buffer, as part of the implementation of * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK. The buffer must be already - * pinned. It does not have to be valid, but it is valid and locked on - * return. + * pinned. If the buffer is not already valid, it is zeroed and made valid. */ static void -ZeroBuffer(Buffer buffer, ReadBufferMode mode) +ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid) { BufferDesc *bufHdr; - uint32 buf_state; + bool need_to_zero; Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK); - if (BufferIsLocal(buffer)) + if (already_valid) + { + /* + * If the caller already knew the buffer was valid, we can skip some + * header interaction. The caller just wants to lock the buffer. + */ + need_to_zero = false; + } + else if (BufferIsLocal(buffer)) + { + /* Simple case for non-shared buffers. */ bufHdr = GetLocalBufferDescriptor(-buffer - 1); + need_to_zero = (pg_atomic_read_u32(>state) & BM_VALID) == 0; + } else { + /* + * Take BM_IO_IN_PROGRESS, or discover that BM_VALID has been set + * concurrently. Even though we aren't doing I/O, that protocol + * ensures that we don't zero a page that someone else has pinned. An + * exclusive content lock wouldn't be enough, because readers are + * allowed to drop the content lock after determining that a tuple is + * visible (see buffer access rules in README). + */ bufHdr = GetBufferDescriptor(buffer - 1); + need_to_zero = StartBufferIO(bufHdr, true, false); + } + + if (need_to_zero) + memset(BufferGetPage(buffer), 0, BLCKSZ); + + /* Acquire the requested lock before setting BM_VALID. */ + if (!BufferIsLocal(buffer)) + { if (mode == RBM_ZERO_AND_LOCK) LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); else LockBufferForCleanup(buffer); } - memset(BufferGetPage(buffer), 0, BLCKSZ); - - if (BufferIsLocal(buffer)) - { - buf_state = pg_atomic_read_u32(>state); - buf_state |= BM_VALID; - pg_atomic_unlocked_write_u32(>state, buf_state); - } - else + /* + * Set BM_VALID, and wake anyone waiting for BM_IO_IN_PROGRESS to be + * cleared. + */ + if (need_to_zero) { - buf_state = LockBufHdr(bufHdr); - buf_state |= BM_VALID; - UnlockBufHdr(bufHdr, buf_state); + if (BufferIsLocal(buffer)) + pg_atomic_write_u32(>state, +pg_atomic_read_u32(>state) | BM_VALID); + else + TerminateBufferIO(bufHdr, false, BM_VALID, true); } } @@ -1185,7 +1211,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence, buffer = PinBufferForBlock(rel, smgr, smgr_persistence, forkNum, blockNum, strategy, ); - ZeroBuffer(buffer, mode); + ZeroAndLockBuffer(buffer, mode, found); return buffer; } -- 2.45.1
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Sat, Jun 8, 2024 at 12:47 AM Robert Haas wrote: > > On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera wrote: > > > static void > > > -ZeroBuffer(Buffer buffer, ReadBufferMode mode) > > > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero) > > > > This change makes the API very strange. Should the function be called > > ZeroAndLockBuffer() instead? Then the addition of a "bool zero" > > argument makes a lot more sense. > > I agree that's better, but it still looks a bit weird. You have to > realize that 'bool zero' means 'is already zeroed' here -- or at > least, I guess that's the intention. But then I wonder why you'd call > a function called ZeroAndLockBuffer if all you need to do is > LockBuffer. The name weirdness comes directly from RBM_ZERO_AND_LOCK (the fact that it doesn't always zero despite shouting ZERO is probably what temporarily confused me). But coming up with a better name is hard and I certainly don't propose to change it now. I think it's reasonable for this internal helper function to have that matching name as Alvaro suggested, with a good comment about that. Even though that quick-demonstration change fixed the two reported repros, I think it is still probably racy (or if it isn't, it relies on higher level interlocking that I don't want to rely on). This case really should be using the standard StartBufferIO/TerminateBufferIO infrastructure as it was before. I had moved that around to deal with multi-block I/O, but dropped the ball on the zero case... sorry. Here's a version like that. The "zero" argument (yeah that was not a good name) is now inverted and called "already_valid", but it's only a sort of optimisation for the case where we already know for sure that it's valid. If it isn't, we do the standard BM_IO_IN_PROGRESS/BM_VALID/CV dance, for correct interaction with any concurrent read or zero operation. v2-0001-Fix-RBM_ZERO_AND_LOCK.patch Description: Binary data
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro wrote: > On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin wrote: > > My bisect run ended with: > > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit > > > > Author: Thomas Munro > > Date: Wed Apr 3 00:03:08 2024 +1300 > > > > Provide vectored variant of ReadBuffer(). > > > > Other buildfarm failures with this Assert I could find kind of confirm this: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-04-03%2003%3A32%3A18 > > (presumably a first failure of this sort) > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-04-04%2015%3A38%3A16 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-05-07%2004%3A00%3A08 > > Looking... What Noah described[1] is what should be happening already, I think, but 210622c6 unconditionally zeroed the page. Oops. The attached seems to cure his repro for me. Does it also cure your test? I couldn't see that variant myself for some reason, but it seems to make sense as the explanation. I would probably adjust the function name or perhaps consider refactoring slightly, but first let's confirm that this is the same issue and fix. [1] https://www.postgresql.org/message-id/flat/20240512171658.7e.nmi...@google.com From f3bb1d69a57bea820895efaf366371463e62235d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 7 Jun 2024 17:49:19 +1200 Subject: [PATCH] Fix RBM_ZERO_AND_LOCK. Commit 210622c6 accidentally zeroed out pages even if they were found in the buffer pool. It should always lock the page, but it should only zero pages that were not already found as an optimization to avoid I/O. Otherwise, concurrent readers that hold only a pin might see corrupted page contents changing under their feet. Reported-by: Noah Misch Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com --- src/backend/storage/buffer/bufmgr.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 49637284f91..238fc0e3547 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1016,7 +1016,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * return. */ static void -ZeroBuffer(Buffer buffer, ReadBufferMode mode) +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero) { BufferDesc *bufHdr; uint32 buf_state; @@ -1034,7 +1034,8 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode) LockBufferForCleanup(buffer); } - memset(BufferGetPage(buffer), 0, BLCKSZ); + if (zero) + memset(BufferGetPage(buffer), 0, BLCKSZ); if (BufferIsLocal(buffer)) { @@ -1185,7 +1186,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence, buffer = PinBufferForBlock(rel, smgr, smgr_persistence, forkNum, blockNum, strategy, ); - ZeroBuffer(buffer, mode); + ZeroBuffer(buffer, mode, !found); return buffer; } -- 2.45.1
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin wrote: > My bisect run ended with: > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit > > Author: Thomas Munro > Date: Wed Apr 3 00:03:08 2024 +1300 > > Provide vectored variant of ReadBuffer(). > > Other buildfarm failures with this Assert I could find kind of confirm this: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-04-03%2003%3A32%3A18 > (presumably a first failure of this sort) > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-04-04%2015%3A38%3A16 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-05-07%2004%3A00%3A08 Looking...
Re: Streaming I/O, vectored I/O (WIP)
On Wed, Nov 29, 2023 at 1:17 AM Thomas Munro wrote: > Done. I like it, I just feel a bit bad about moving the p*v() > replacement functions around a couple of times already! I figured it > might as well be static inline even if we use the fallback (= Solaris > and Windows). Just for the record, since I'd said things like the above a few times while writing about this stuff: Solaris 11.4.69 has gained preadv() and pwritev(). That's interesting because it means that there will soon be no liive Unixoid operating systems left without them, and the fallback code in src/include/port/pg_iovec.h will, in practice, be only for Windows. I wondered if that might have implications for how we code or comment stuff like that, but it still seems to make sense as we have it. (I don't think Windows can have a real synchronous implementation; the kernel knows how to do scatter/gather, a feature implemented specifically for databases, but only in asynchronous ("overlapped") + direct I/O mode, a difference I don't know how to hide at this level. In later AIO work we should be able to use it as intended, but not by pretending to be Unix like this.)
Re: processes stuck in shutdown following OOM/recovery
On Thu, May 23, 2024 at 9:58 AM Martijn Wallet wrote: > The following review has been posted through the commitfest application: > make installcheck-world: not tested > Implements feature: not tested > Spec compliant: not tested > Documentation:not tested > > Hi, I somehow fail to be able to mark all checkboxes on this review page... > However, build and tested with all passed successfully on Rocky Linux release > 8.9 (Green Obsidian). > Not sure of more reviewing is needed on other Operating Systems since this is > only my second review. Thanks! I'm also hoping to get review of the rather finickity state machine logic involved from people familiar with that; I think it's right, but I'd hate to break some other edge case... > nb: second mail to see spf is fixed and Thomas receives this message. FTR 171641337152.1103.7326466732639994038.p...@coridan.postgresql.org and 171641505305.1105.9868637944637520353.p...@coridan.postgresql.org both showed up in my inbox, and they both have headers "Received-SPF: pass ...".
Re: Streaming read-ready sequential scan code
On Tue, May 21, 2024 at 9:11 AM Melanie Plageman wrote: > So, if you are seeing the slow-down mostly go away by reducing > blocknums array size, does the regression only appear when the scan > data is fully in shared buffers? Or is this blocknums other use > (dealing with short reads)? That must be true (that blocknums array is normally only "filled" in the "fast path", where all buffers are found in cache). > Is your theory that one worker ends up reading 16 blocks that should > have been distributed across multiple workers? Yes, it just jiggles the odds around a bit, introducing a bit of extra unfairness by calling the callback in a tighter loop to build a little batch, revealing a pre-existing problem. The mistake in PHJ (problem #2 above) is that, once a worker decides it would like all workers to stop inserting so it can increase the number of buckets, it sets a flag to ask them to do that, and waits for them to see it, but if there is a worker filtering all tuples out, it never checks the "growth" flag. So it scans all the way to the end while the other guy waits. Normally it checks that flag when it is time to allocate a new chunk of memory, which seemed to make sense to me at the time: if we've hit the needs-more-buckets (or needs-more-batches) logic, then surely workers are inserting tuples and will soon allocate a new chunk! But, of course, here is the edge case where that isn't true: we had bad estimates so hash table too small (problem #1), we got lots of tuples clustered over a few heap pages and decided to expand the hash table, but right at that moment, matching tuples ran out so somebody had to finish the whole scan without ever checking the flag (problem #2), and that someone happened to have all the rest of the pages because we made the lookahead a bit less fair (problem #3). Nice confluence of problems. I expect #2 and #3 to be easy to fix, and I didn't look at the estimation problem #1 at all (perhaps a stats puzzle designed by the TPC to trip us up?).
Re: commitfest.postgresql.org is no longer fit for purpose
On Mon, May 20, 2024 at 1:09 PM Tom Lane wrote: > (The cfbot tends to discourage this, since as soon as one of the > patches is committed it no longer knows how to apply the rest. > Can we improve on that tooling somehow?) Cfbot currently applies patches with (GNU) patch --no-backup-if-mismatch -p1 -V none -f. The -f means that any questions of the form "Reversed (or previously applied) patch detected! Assume -R? [y]" is answered with "yes", and the operation fails. I wondered if --forward would be better. It's the right idea, but it seems to be useless in practice for this purpose because the command still fails: tmunro@phonebox postgresql % patch --forward -p1 < x.patch || echo XXX failed $? patching file 'src/backend/postmaster/postmaster.c' Ignoring previously applied (or reversed) patch. 1 out of 1 hunks ignored--saving rejects to 'src/backend/postmaster/postmaster.c.rej' XXX failed 1 I wondered if it might be distinguishable from other kinds of failure that should stop progress, but nope: patch's exit status is 0 if all hunks are applied successfully, 1 if some hunks cannot be applied or there were merge conflicts, and 2 if there is more serious trouble. When applying a set of patches in a loop it behooves you to check this exit status so you don't apply a later patch to a partially patched file. I guess I could parse stdout or whatever that is and detect all-hunks-ignored condition, but that doesn't sound like fun... Perhaps cfbot should test explicitly for patches that have already been applied with something like "git apply --reverse --check", and skip them. That would work for exact patches, but of course it would be confused by any tweaks made before committing. If going that way, it might make sense to switch to git apply/am (instead of GNU patch), to avoid contradictory conclusions. The reason I was using GNU patch in the first place is that it is/was a little more tolerant of some of the patches people used to send a few years back, but now I expect everyone uses git format-patch and would be prepared to change their ways if not. In the past we had a couple of cases of the reverse, that is, GNU patch couldn't apply something that format-patch produced (some edge case of renaming, IIRC) and I'm sorry that I never got around to changing that. Sometimes I question the sanity of the whole thing. Considering cfbot's original "zero-effort CI" goal (or I guess "zero-extra-effort" would be better), I was curious about what other projects had the same idea, or whether we're really just starting at the "wrong end", and came up with: https://github.com/getpatchwork/patchwork http://vger.kernel.org/bpfconf2022_material/lsfmmbpf2022-bpf-ci.pdf <-- example user https://github.com/patchew-project/patchew Actually cfbot requires more effort than those, because it's driven first by Commitfest app registration. Those projects are extremists IIUC: just write to a mailing list, no other bureaucracy at all (at least for most participants, presumably administrators can adjust the status in some database when things go wrong?). We're actually halfway to Gitlab et al already, with a web account and interaction required to start the process of submitting a patch for consideration. What I'm less clear on is who else has come up with the "bitrot" test idea, either at the mailing list or web extremist ends of the scale. Those are also generic tools, and cfbot obviously knows lots of things about PostgreSQL, like the "highlights" and probably more things I'm forgetting.
Re: Streaming read-ready sequential scan code
On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin wrote: > With blocknums[1], timing is changed, but the effect is not persistent. > 10 query15 executions in a row, b7b0f3f27: > 277.932 ms > 281.805 ms > 278.335 ms > 281.565 ms > 284.167 ms > 283.171 ms > 281.165 ms > 281.615 ms > 285.394 ms > 277.301 ms The bad time 10/10. > b7b0f3f27~1: > 159.789 ms > 165.407 ms > 160.893 ms > 159.343 ms > 160.936 ms > 161.577 ms > 161.637 ms > 163.421 ms > 163.143 ms > 167.109 ms The good time 10/10. > b7b0f3f27 + blocknums[1]: > 164.133 ms > 280.920 ms > 160.748 ms > 163.182 ms > 161.709 ms > 161.998 ms > 161.239 ms > 276.256 ms > 161.601 ms > 160.384 ms The good time 8/10, the bad time 2/10. Thanks for checking! I bet all branches can show that flip/flop instability in these adverse conditions, depending on random scheduling details. I will start a new thread with a patch for the root cause of that, ie problem #2 (this will need back-patching), and post a fix for #3 (v17 blocknums[N] tweak affecting fairness/likelihood, which was probably basically a bit of ill-advised premature optimisation) here in a few days.
Re: Requiring LLVM 14+ in PostgreSQL 18
On Sun, May 19, 2024 at 10:46 AM Ole Peder Brandtzæg wrote: > On Wed, May 15, 2024 at 07:20:09AM +0200, Peter Eisentraut wrote: > > Yes, let's get that v3-0001 patch into PG17. > > Upon seeing this get committed in 4dd29b6833, I noticed that the docs > still advertise the llvm-config-$version search dance. That's still > correct for Meson-based builds since we use their config-tool machinery, > but no longer holds for configure-based builds. The attached patch > updates the docs accordingly. Oops, right I didn't know we had that documented. Thanks. Will hold off doing anything until the thaw. Hmm, I also didn't know that Meson had its own list like our just-removed one: https://github.com/mesonbuild/meson/blob/master/mesonbuild/environment.py#L183 Unsurprisingly, it suffers from maintenance lag, priority issues etc (new major versions pop out every 6 months): https://github.com/mesonbuild/meson/issues/10483
Re: speed up a logical replica setup
040_pg_createsubscriber.pl seems to be failing occasionally on culicidae near "--dry-run on node S". I couldn't immediately see why. That animal is using EXEC_BACKEND and I also saw a one-off failure a bit like that on my own local Linux + EXEC_BACKEND test run (sorry I didn't keep the details around). Coincidence?
Re: race condition when writing pg_control
On Fri, May 17, 2024 at 4:46 PM Thomas Munro wrote: > The specific problem here is that LocalProcessControlFile() runs in > every launched child for EXEC_BACKEND builds. Windows uses > EXEC_BACKEND, and Windows' NTFS file system is one of the two file > systems known to this list to have the concurrent read/write data > mashing problem (the other being ext4). Phngh... this is surprisingly difficult to fix. Things that don't work: We "just" need to acquire ControlFileLock while reading the file or examining the object in shared memory, or get a copy of it, passed through the EXEC_BACKEND BackendParameters that was acquired while holding the lock, but the current location of this code in child startup is too early to use LWLocks, and the postmaster can't acquire locks either so it can't even safely take a copy to pass on. You could reorder startup so that we are allowed to acquire LWLocks in children at that point, but then you'd need to convince yourself that there is no danger of breaking some ordering requirement in external preload libraries, and figure out what to do about children that don't even attach to shared memory. Maybe that's possible, but that doesn't sound like a good idea to back-patch. First idea idea I've come up with to avoid all of that: pass a copy of the "proto-controlfile", to coin a term for the one read early in postmaster startup by LocalProcessControlFile(). As far as I know, the only reason we need it is to suck some settings out of it that don't change while a cluster is running (mostly can't change after initdb, and checksums can only be {en,dis}abled while down). Right? Children can just "import" that sucker instead of calling LocalProcessControlFile() to figure out the size of WAL segments yada yada, I think? Later they will attach to the real one in shared memory for all future purposes, once normal interlocking is allowed. I dunno. Draft patch attached. Better plans welcome. This passes CI on Linux systems afflicted by EXEC_BACKEND, and Windows. Thoughts? From 48c2de14bd9368b4708a99ecbb75452dc327e608 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 18 May 2024 13:41:09 +1200 Subject: [PATCH v1 1/3] Fix pg_control corruption in EXEC_BACKEND startup. When backend processes were launched in EXEC_BACKEND builds, they would run LocalProcessControlFile() to read in pg_control and extract several important settings. This happens too early to acquire ControlFileLock, and the postmaster is also not allowed to acquire ControlFileLock, so it can't safely take a copy to give to the child. Instead, pass down the "proto-controlfile" that was read by the postmaster in LocalProcessControlFile(). Introduce functions ExportProtoControlFile() and ImportProtoControlFile() to allow that. Subprocesses will extract information from that, and then later attach to the current control file in shared memory. Reported-by: Melanie Plageman per Windows CI failure Discussion: https://postgr.es/m/CAAKRu_YNGwEYrorQYza_W8tU%2B%3DtoXRHG8HpyHC-KDbZqA_ZVSA%40mail.gmail.com --- src/backend/access/transam/xlog.c | 46 +++-- src/backend/postmaster/launch_backend.c | 19 ++ src/include/access/xlog.h | 5 +++ 3 files changed, 62 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 330e058c5f2..b69a0d95af9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -568,6 +568,10 @@ static WALInsertLockPadded *WALInsertLocks = NULL; */ static ControlFileData *ControlFile = NULL; +#ifdef EXEC_BACKEND +static ControlFileData *ProtoControlFile = NULL; +#endif + /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -686,6 +690,7 @@ static bool PerformRecoveryXLogAction(void); static void InitControlFile(uint64 sysidentifier); static void WriteControlFile(void); static void ReadControlFile(void); +static void ScanControlFile(void); static void UpdateControlFile(void); static char *str_time(pg_time_t tnow); @@ -4309,9 +4314,7 @@ WriteControlFile(void) static void ReadControlFile(void) { - pg_crc32c crc; int fd; - static char wal_segsz_str[20]; int r; /* @@ -4344,6 +4347,15 @@ ReadControlFile(void) close(fd); + ScanControlFile(); +} + +static void +ScanControlFile(void) +{ + static char wal_segsz_str[20]; + pg_crc32c crc; + /* * Check for expected pg_control format version. If this is wrong, the * CRC check will likely fail because we'll be checking the wrong number @@ -4815,8 +4827,33 @@ LocalProcessControlFile(bool reset) Assert(reset || ControlFile == NULL); ControlFile = palloc(sizeof(ControlFileData)); ReadControlFile(); + +#ifdef EXEC_BACKEND + /* We need to be able to give this to subprocesses. */ + ProtoControlFile = ControlFile; +#endif +} + +#ifdef EXEC_BACKEND +void +E
Re: Refactoring backend fork+exec code
On Mon, Mar 18, 2024 at 10:41 PM Heikki Linnakangas wrote: > Committed, with some final cosmetic cleanups. Thanks everyone! Nitpicking from UBSan with EXEC_BACKEND on Linux (line numbers may be a bit off, from a branch of mine): ../src/backend/postmaster/launch_backend.c:772:2: runtime error: null pointer passed as argument 2, which is declared to never be null ==13303==Using libbacktrace symbolizer. #0 0x564b0202 in save_backend_variables ../src/backend/postmaster/launch_backend.c:772 #1 0x564b0242 in internal_forkexec ../src/backend/postmaster/launch_backend.c:311 #2 0x564b0bdd in postmaster_child_launch ../src/backend/postmaster/launch_backend.c:244 #3 0x564b3121 in StartChildProcess ../src/backend/postmaster/postmaster.c:3928 #4 0x564b933a in PostmasterMain ../src/backend/postmaster/postmaster.c:1357 #5 0x562de4ad in main ../src/backend/main/main.c:197 #6 0x7667ad09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09) #7 0x55e34279 in _start (/tmp/cirrus-ci-build/build/tmp_install/usr/local/pgsql/bin/postgres+0x8e0279) This silences it: - memcpy(param->startup_data, startup_data, startup_data_len); + if (startup_data_len > 0) + memcpy(param->startup_data, startup_data, startup_data_len); (I found that out by testing EXEC_BACKEND on CI. I also learned that the Mac and FreeBSD tasks fail with EXEC_BACKEND because of SysV shmem bleating. We probably should go and crank up the relevant sysctls in the .cirrus.tasks.yml...)
Re: Streaming read-ready sequential scan code
On Sat, May 18, 2024 at 11:30 AM Thomas Munro wrote: > Andres happened to have TPC-DS handy, and reproduced that regression > in q15. We tried some stuff and figured out that it requires > parallel_leader_participation=on, ie that this looks like some kind of > parallel fairness and/or timing problem. It seems to be a question of > which worker finishes up processing matching rows, and the leader gets > a ~10ms head start but may be a little more greedy with the new > streaming code. He tried reordering the table contents and then saw > 17 beat 16. So for q15, initial indications are that this isn't a > fundamental regression, it's just a test that is sensitive to some > arbitrary conditions. > > I'll try to figure out some more details about that, ie is it being > too greedy on small-ish tables, After more debugging, we learned a lot more things... 1. That query produces spectacularly bad estimates, so we finish up having to increase the number of buckets in a parallel hash join many times. That is quite interesting, but unrelated to new code. 2. Parallel hash join is quite slow at negotiating an increase in the number of hash bucket, if all of the input tuples are being filtered out by quals, because of the choice of where workers check for PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I think. I have put that on my todo list 'cause that's also my code, but it's not a new issue it's just one that is now highlighted... 3. This bit of read_stream.c is exacerbating unfairness in the underlying scan, so that 1 and 2 come together and produce a nasty slowdown, which goes away if you change it like so: - BlockNumber blocknums[16]; + BlockNumber blocknums[1]; I will follow up after some more study.
Re: Streaming read-ready sequential scan code
On Sat, May 18, 2024 at 8:09 AM Thomas Munro wrote: > On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin wrote: > > I decided to compare v17 vs v16 performance (as I did the last year [1]) > > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) > > benchmark, query15 (and several others, but I focused on this one): > > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > > > 151.03): pg_tpcds.query15 > > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > > > 152.64): pg_tpcds.query15 > > Please look at the full html report attached in case you're interested. > > > > (I used my pg-mark tool to measure/analyze performance, but I believe the > > same results can be seen without it.) > > Will investigate, but if it's easy for you to rerun, does it help if > you increase Linux readahead, eg blockdev --setra setting? Andres happened to have TPC-DS handy, and reproduced that regression in q15. We tried some stuff and figured out that it requires parallel_leader_participation=on, ie that this looks like some kind of parallel fairness and/or timing problem. It seems to be a question of which worker finishes up processing matching rows, and the leader gets a ~10ms head start but may be a little more greedy with the new streaming code. He tried reordering the table contents and then saw 17 beat 16. So for q15, initial indications are that this isn't a fundamental regression, it's just a test that is sensitive to some arbitrary conditions. I'll try to figure out some more details about that, ie is it being too greedy on small-ish tables, and generally I do wonder about the interactions between the heuristics and batching working at different levels (OS, seq scan, read stream, hence my earlier ra question which is likely a red herring) and how there might be unintended consequences/interference patterns, but this particular case seems more data dependent.
Re: Streaming read-ready sequential scan code
On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin wrote: > I decided to compare v17 vs v16 performance (as I did the last year [1]) > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) > benchmark, query15 (and several others, but I focused on this one): > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > > 151.03): pg_tpcds.query15 > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > > 152.64): pg_tpcds.query15 > Please look at the full html report attached in case you're interested. > > (I used my pg-mark tool to measure/analyze performance, but I believe the > same results can be seen without it.) Will investigate, but if it's easy for you to rerun, does it help if you increase Linux readahead, eg blockdev --setra setting?
Re: race condition when writing pg_control
The specific problem here is that LocalProcessControlFile() runs in every launched child for EXEC_BACKEND builds. Windows uses EXEC_BACKEND, and Windows' NTFS file system is one of the two file systems known to this list to have the concurrent read/write data mashing problem (the other being ext4).
Re: Potential stack overflow in incremental base backup
On Tue, Apr 16, 2024 at 4:10 AM Robert Haas wrote: > On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro wrote: > > To rescue my initdb --rel-segsize project[1] for v18, I will have a go > > at making that dynamic. It looks like we don't actually need to > > allocate that until we get to the GetFileBackupMethod() call, and at > > that point we have the file size. If I understand correctly, > > statbuf.st_size / BLCKSZ would be enough, so we could embiggen our > > block number buffer there if required, right? > > Yes. Here is a first attempt at that. Better factoring welcome. New observations made along the way: the current coding can exceed MaxAllocSize and error out, or overflow 32 bit size_t and allocate nonsense. Do you think it'd be better to raise an error explaining that, or silently fall back to full backup (tried that way in the attached), or that + log messages? Another option would be to use huge allocations, so we only have to deal with that sort of question for 32 bit systems (i.e. effectively hypothetical/non-relevant scenarios), but I don't love that idea. > ... > I do understand that a 1GB segment size is not that big in 2024, and > that filesystems with a 2GB limit are thought to have died out a long > time ago, and I'm not against using larger segments. I do think, > though, that increasing the segment size by 32768x in one shot is > likely to be overdoing it. My intuition is that the primary interesting lines to cross are at 2GB and 4GB due to data type stuff. I defend against the most basic problem in my proposal: I don't let you exceed your off_t type, but that doesn't mean we don't have off_t/ssize_t/size_t/long snafus lurking in our code that could bite a 32 bit system with large files. If you make it past those magic numbers and your tools are all happy, I think you should be home free until you hit file system limits, which are effectively unhittable on most systems except ext4's already bemoaned 16TB limit AFAIK. But all the same, I'm contemplating limiting the range to 1TB in the first version, not because of general fear of unknown unknowns, but just because it means we don't need to use "huge" allocations for this known place, maybe until we can address that. From 30efb6d39c83e9d4f338e10e1b8944c64f8799c5 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 15 May 2024 17:23:21 +1200 Subject: [PATCH v1] Limit block number buffer size in incremental backup. Previously, basebackup.c would allocate an array big enough to hold a block number for every block in a full sized md.c segment file. That works out to 512kB by default, which should be no problem. However, users can change the segment size at compile time, and the required space for the array could be many gigabytes, posing problems: 1. For segment sizes over 2TB, MaxAllocSize would be exceeded, raising an error. 2. For segment sizes over 8TB, size_t arithmetic would overflow on 32 bit systems, leading to the wrong size allocation. 3. For any very large segment size, it's non-ideal to allocate a huge amount of memory if you're not actually going to use it. This isn't a fundamental fix for the high memory requirement of the algorithm as coded, but it seems like a good idea to avoid those limits with a fallback strategy, and defer allocating until we see how big the segments actually are in the cluster being backed up, upsizing as required. These are mostly theoretical problems at the moment, because it is so hard to change the segment size in practice that people don't do it. A new proposal would make it changeable at run time, hence interest in tidying these rough edges up. Discussion: https://postgr.es/m/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 9a2bf59e84e..0703b77af94 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -34,6 +34,7 @@ #include "pgstat.h" #include "pgtar.h" #include "port.h" +#include "port/pg_bitutils.h" #include "postmaster/syslogger.h" #include "postmaster/walsummarizer.h" #include "replication/walsender.h" @@ -45,6 +46,7 @@ #include "storage/reinit.h" #include "utils/builtins.h" #include "utils/guc.h" +#include "utils/memutils.h" #include "utils/ps_status.h" #include "utils/relcache.h" #include "utils/resowner.h" @@ -1198,13 +1200,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool isGlobalDir = false; Oid dboid = InvalidOid; BlockNumber *relative_block_numbers = NULL; - - /* - * Since this array is relatively large, avoid putting it on the stack. - * But we don't need it at all if this is not an incremental backup. - */ - if (ib != NULL) - relative_block_numbers
Re: Requiring LLVM 14+ in PostgreSQL 18
On Fri, May 17, 2024 at 3:17 AM Nazir Bilal Yavuz wrote: > Actually, 32 bit builds are working but the Perl version needs to be > updated to 'perl5.36-i386-linux-gnu' in .cirrus.tasks.yml. I changed > 0001 with the working version of 32 bit builds [1] and the rest is the > same. All tests pass now [2]. Ahh, right, thanks! I will look at committing your CI/fixup patches.
Re: Requiring LLVM 14+ in PostgreSQL 18
On Wed, May 15, 2024 at 5:20 PM Peter Eisentraut wrote: > Yes, let's get that v3-0001 patch into PG17. Done. Bilal recently created the CI images for Debian Bookworm[1]. You can try them with s/bullseye/bookworm/ in .cirrus.tasks.yml, but it looks like he is still wrestling with a perl installation problem[2] in the 32 bit build, so here is a temporary patch to do that and also delete the 32 bit tests for now. This way cfbot should succeed with the remaining patches. Parked here for v18. [1] https://github.com/anarazel/pg-vm-images/commit/685ca7ccb7b3adecb11d948ac677d54cd9599e6c [2] https://cirrus-ci.com/task/5459439048720384 From c0a05c2929e03558c730b148bdeb5d301dbc4312 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 16 May 2024 14:10:09 +1200 Subject: [PATCH v4 1/3] XXX CI kludge: bullseye->bookworm Temporarily removed 32 bit tests, as the CI image is not fully baked yet. --- .cirrus.tasks.yml | 37 + 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index a2388cd5036..5ff6b6a0556 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -65,7 +65,7 @@ task: CPUS: 4 BUILD_JOBS: 8 TEST_JOBS: 8 -IMAGE_FAMILY: pg-ci-bullseye +IMAGE_FAMILY: pg-ci-bookworm CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir # no options enabled, should be small CCACHE_MAXSIZE: "150M" @@ -243,7 +243,7 @@ task: CPUS: 4 BUILD_JOBS: 4 TEST_JOBS: 8 # experimentally derived to be a decent choice -IMAGE_FAMILY: pg-ci-bullseye +IMAGE_FAMILY: pg-ci-bookworm CCACHE_DIR: /tmp/ccache_dir DEBUGINFOD_URLS: "https://debuginfod.debian.net; @@ -314,7 +314,7 @@ task: #DEBIAN_FRONTEND=noninteractive apt-get -y install ... matrix: -- name: Linux - Debian Bullseye - Autoconf +- name: Linux - Debian Bookworm - Autoconf env: SANITIZER_FLAGS: -fsanitize=address @@ -348,7 +348,7 @@ task: on_failure: <<: *on_failure_ac -- name: Linux - Debian Bullseye - Meson +- name: Linux - Debian Bookworm - Meson env: CCACHE_MAXSIZE: "400M" # tests two different builds @@ -364,24 +364,7 @@ task: build EOF - # Also build & test in a 32bit build - it's gotten rare to test that - # locally. - configure_32_script: | -su postgres <<-EOF - export CC='ccache gcc -m32' - meson setup \ ---buildtype=debug \ --Dcassert=true -Dinjection_points=true \ -${LINUX_MESON_FEATURES} \ --Dllvm=disabled \ ---pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \ --DPERL=perl5.32-i386-linux-gnu \ --DPG_TEST_EXTRA="$PG_TEST_EXTRA" \ -build-32 -EOF - build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS}' - build_32_script: su postgres -c 'ninja -C build-32 -j${BUILD_JOBS}' upload_caches: ccache @@ -393,16 +376,6 @@ task: # so that we don't upload 64bit logs if 32bit fails rm -rf build/ - # There's currently no coverage of icu with LANG=C in the buildfarm. We - # can easily provide some here by running one of the sets of tests that - # way. Newer versions of python insist on changing the LC_CTYPE away - # from C, prevent that with PYTHONCOERCECLOCALE. - test_world_32_script: | -su postgres <<-EOF - ulimit -c unlimited - PYTHONCOERCECLOCALE=0 LANG=C meson test $MTEST_ARGS -C build-32 --num-processes ${TEST_JOBS} -EOF - on_failure: <<: *on_failure_meson @@ -652,7 +625,7 @@ task: env: CPUS: 4 BUILD_JOBS: 4 -IMAGE_FAMILY: pg-ci-bullseye +IMAGE_FAMILY: pg-ci-bookworm # Use larger ccache cache, as this task compiles with multiple compilers / # flag combinations -- 2.39.2 From bcdb3453b50cd37077ad71f012d7f92c9842495f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 19 Oct 2023 04:45:46 +1300 Subject: [PATCH v4 2/3] jit: Require at least LLVM 14, if enabled. Remove support for LLVM versions 10-13. The default on all non-EOL'd OSes represented in our build farm will be at least LLVM 14 when PostgreSQL 18 ships. Reviewed-by: Peter Eisentraut Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com --- config/llvm.m4 | 4 +- configure | 4 +- doc/src/sgml/installation.sgml | 4 +- meson.build | 2 +- src/backend/jit/llvm/llvmjit.c | 101 src/backend/jit/llvm/llvmjit_error.cpp | 25 -- src/backend/jit/llvm/llvmjit_inline.cpp | 13 --- src/backend/jit/llvm/llvmjit_wrap.cpp | 4 - 8 files changed, 7 insertions(+), 150 deletions(-) diff --git a/config/llvm.m4 b/conf
Re: Why is citext/regress failing on hamerkop?
On Thu, May 16, 2024 at 10:43 AM Thomas Munro wrote: > Any chance you could test this version please Alexander? Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI thing and is superficially better, but it doesn't handle code that calls twice in a row and ignores the first result (I know that PostgreSQL does that occasionally in a few places), and it's also broken if someone gets recv() = 0 (EOF), and then decides to wait anyway. The only ways I can think of to get full reliable poll()-like semantics is to do that peek every time, OR the complicated patch (per-socket-workspace + intercepting recv etc). So I'm back to v2.
Re: Why is citext/regress failing on hamerkop?
On Thu, May 16, 2024 at 9:46 AM Thomas Munro wrote: > Alright, unless anyone has an objection or ideas for improvements, I'm > going to go ahead and back-patch this slightly tidied up version > everywhere. Of course as soon as I wrote that I thought of a useful improvement myself: as far as I can tell, you only need to do the extra poll on the first wait for WL_SOCKET_READABLE for any given WaitEventSet. I don't think it's needed for later waits done by long-lived WaitEventSet objects, so we can track that with a flag. That avoids adding new overhead for regular backend socket waits after authentication, it's just in these code paths that do a bunch of WaitLatchOrSocket() calls that we need to consider FD_CLOSE events lost between the cracks. I also don't know if the condition should include "&& received == 0". It probably doesn't make much difference, but by leaving that off we don't have to wonder how peeking interacts with events, ie if it's a problem that we didn't do the "reset" step. Thinking about that, I realised that I should probably set reset = true in this new return path, just like the "normal" WL_SOCKET_READABLE notification path, just to be paranoid. (Programming computers you don't have requires extra paranoia.) Any chance you could test this version please Alexander? v3-0001-Fix-FD_CLOSE-socket-event-hangs-on-Windows.patch Description: Binary data
Re: Why is citext/regress failing on hamerkop?
On Wed, May 15, 2024 at 6:00 PM Alexander Lakhin wrote: > 15.05.2024 01:26, Thomas Munro wrote: > > OK, so we know what the problem is here. Here is the simplest > > solution I know of for that problem. I have proposed this in the past > > and received negative feedback because it's a really gross hack. But > > I don't personally know what else to do about the back-branches (or > > even if that complex solution is the right way forward for master). > > The attached kludge at least has the [de]merit of being a mirror image > > of the kludge that follows it for the "opposite" event. Does this fix > > it? > > Yes, I see that abandoned GSS connections are closed immediately, as > expected. I have also confirmed that `meson test` with the basic > configuration passes on REL_16_STABLE. So from the outside, the fix > looks good to me. Alright, unless anyone has an objection or ideas for improvements, I'm going to go ahead and back-patch this slightly tidied up version everywhere. v2-0001-Add-kludge-to-make-FD_CLOSE-level-triggered.patch Description: Binary data
Re: Requiring LLVM 14+ in PostgreSQL 18
On Mon, May 13, 2024 at 2:33 AM Peter Eisentraut wrote: > These patches look fine to me. The new cut-off makes sense, and it does > save quite a bit of code. We do need to get the Cirrus CI Debian images > updated first, as you had already written. Thanks for looking! > As part of this patch, you also sneak in support for LLVM 18 > (llvm-config-18, clang-18 in configure). Should this be a separate patch? Yeah, right, I didn't really think too hard about why we have that, and now that you question it... > And as I'm looking up how this was previously handled, I notice that > this list of clang-NN versions was last updated equally sneakily as part > of your patch to trim off LLVM <10 (820b5af73dc). I wonder if the > original intention of that configure code was that maintaining the > versioned list above clang-7/llvm-config-7 was not needed, because the > unversioning programs could be used, or maybe because pkg-config could > be used. It would be nice if we could get rid of having to update that. I probably misunderstood why we were doing that, perhaps something to do with the way some distro (Debian?) was doing things with older versions, and yeah I see that we went a long time after 7 without touching it and nobody cared. Yeah, it would be nice to get rid of it. Here's a patch. Meson didn't have that. From 025f8b0106821b5b6f2ab7992da388404e3e406c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 15 May 2024 15:51:28 +1200 Subject: [PATCH v3 1/3] jit: Remove configure probes for {lvm-config,clang}-N. Previously we searched for llvm-config-N and clang-N as well as the unversioned names, and maintained a list of expected values of N. There doesn't seem to be any reason to think that the default llvm-config and clang won't be good enough, and if they aren't, they can be overridden with LLVM_CONFIG and CLANG, so let's stop maintaining that list. The Meson build system didn't do that, so this brings them into sync. Suggested-by: Peter Eisentraut Discussion: https://postgr.es/m/CA%2BhUKG%2BSOP-aR%3DYF_n0dtXGWeCy6x%2BCn-RMWURU5ySQdmeKW1Q%40mail.gmail.com --- config/llvm.m4 | 4 ++-- configure | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/config/llvm.m4 b/config/llvm.m4 index 44769d819a0..c6cf8858f64 100644 --- a/config/llvm.m4 +++ b/config/llvm.m4 @@ -13,7 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], AC_REQUIRE([AC_PROG_AWK]) AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command]) - PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10) + PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config) # no point continuing if llvm wasn't found if test -z "$LLVM_CONFIG"; then @@ -32,7 +32,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], # need clang to create some bitcode files AC_ARG_VAR(CLANG, [path to clang compiler to generate bitcode]) - PGAC_PATH_PROGS(CLANG, clang clang-17 clang-16 clang-15 clang-14 clang-13 clang-12 clang-11 clang-10) + PGAC_PATH_PROGS(CLANG, clang) if test -z "$CLANG"; then AC_MSG_ERROR([clang not found, but required when compiling --with-llvm, specify with CLANG=]) fi diff --git a/configure b/configure index 89644f2249e..8e7704d54bd 100755 --- a/configure +++ b/configure @@ -5065,7 +5065,7 @@ if test "$with_llvm" = yes; then : if test -z "$LLVM_CONFIG"; then - for ac_prog in llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10 + for ac_prog in llvm-config do # Extract the first word of "$ac_prog", so it can be a program name with args. set dummy $ac_prog; ac_word=$2 @@ -5138,7 +5138,7 @@ $as_echo "$as_me: using llvm $pgac_llvm_version" >&6;} # need clang to create some bitcode files if test -z "$CLANG"; then - for ac_prog in clang clang-17 clang-16 clang-15 clang-14 clang-13 clang-12 clang-11 clang-10 + for ac_prog in clang do # Extract the first word of "$ac_prog", so it can be a program name with args. set dummy $ac_prog; ac_word=$2 -- 2.39.2 From ef1c38dbb1ec8a6a762c88ee50c55a87693d44f6 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 19 Oct 2023 04:45:46 +1300 Subject: [PATCH v3 2/3] jit: Require at least LLVM 14, if enabled. Remove support for LLVM versions 10-13. The default on all non-EOL'd OSes represented in our build farm will be at least LLVM 14 when PostgreSQL 18 ships. Reviewed-by: Peter Eisentraut Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com --- config/llvm.m4 | 4 +- configure | 4 +- doc/src/sgml/installation.sgml | 4 +- meson.build | 2 +- src/backend/jit/llvm/llvmjit.c | 101
Re: CFbot does not recognize patch contents
On Sun, May 12, 2024 at 10:11 PM Tatsuo Ishii wrote: > > I am able to apply all your patches. I found that a similar thing > > happened before [0] and I guess your case is similar. Adding Thomas to > > CC, he may be able to help more. > > Ok. Thanks for the info. This obviously fixed itself automatically soon after this message, but I figured out what happened: I had not actually fixed that referenced bug in cfbot :-(. It was checking for HTTP error codes correctly in the place that reads emails from the archives, but not the place that downloads patches, so in this case I think when it tried to follow the link[1] to download the patch, I guess it must have pulled down a transient Varnish error message (I don't know what, I don't store it anywhere), and tried to apply that as a patch. Oops. Fixed[2]. [1] https://www.postgresql.org/message-id/attachment/160138/v18-0001-Row-pattern-recognition-patch-for-raw-parser.patch [2] https://github.com/macdice/cfbot/commit/ec33a65a877a88befc29ea220e87b98c89b27dcc
Re: Why is citext/regress failing on hamerkop?
On Tue, May 14, 2024 at 9:00 PM Alexander Lakhin wrote: > 14.05.2024 03:38, Thomas Munro wrote: > > I was beginning to suspect that lingering odour myself. I haven't > > look at the GSS code but I was imagining that what we have here is > > perhaps not unsent data dropped on the floor due to linger policy > > (unclean socket close on process exist), but rather that the server > > didn't see the socket as ready to read because it lost track of the > > FD_CLOSE somewhere because the client closed it gracefully, and our > > server-side FD_CLOSE handling has always been a bit suspect. I wonder > > if the GSS code is somehow more prone to brokenness. One thing we > > learned in earlier problems was that abortive/error disconnections > > generate FD_CLOSE repeatedly, while graceful ones give you only one. > > In other words, if the other end politely calls closesocket(), the > > server had better not miss the FD_CLOSE event, because it won't come > > again. That's what > > > > https://commitfest.postgresql.org/46/3523/ > > > > is intended to fix. Does it help here? Unfortunately that's > > unpleasantly complicated and unbackpatchable (keeping a side-table of > > socket FDs and event handles, so we don't lose events between the > > cracks). > > Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last > state when that patch set can be applied). Thanks for checking, and generally for your infinite patience with all these horrible Windows problems. OK, so we know what the problem is here. Here is the simplest solution I know of for that problem. I have proposed this in the past and received negative feedback because it's a really gross hack. But I don't personally know what else to do about the back-branches (or even if that complex solution is the right way forward for master). The attached kludge at least has the [de]merit of being a mirror image of the kludge that follows it for the "opposite" event. Does this fix it? From cbe4680c3e561b26c0bb49fc39dc8a6f40e84134 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 15 May 2024 10:01:19 +1200 Subject: [PATCH] Add kludge to make FD_CLOSE level-triggered. Winsock only signals an FD_CLOSE event once, if the other end of the socket shuts down gracefully. Because event WaitLatchOrSocket() constructs and destroys a new event handle, we can miss the FD_CLOSE event that is signaled just as we're destroying the handle. The next WaitLatchOrSocket() will not see it. Fix that race with some extra polling. We wouldn't need this if we had long-lived event handles for sockets, which has been proposed and tested and shown to work, but it's far too complex to back-patch. --- src/backend/storage/ipc/latch.c | 37 + 1 file changed, 37 insertions(+) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index a7d88ebb048..9a5273f17d6 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -1999,6 +1999,43 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, cur_event->reset = false; } + /* + * Because we associate the socket with different event handles at + * different times, and because FD_CLOSE is only generated once the + * other end closes gracefully, we might miss an FD_CLOSE event that + * was signaled already on a handle we've already closed. We close + * that race by synchronously polling for EOF, after adjusting the + * event above and before sleeping below. + * + * XXX If we arranged to have one event handle for the lifetime of a + * socket, we wouldn't need this. + */ + if (cur_event->events & WL_SOCKET_READABLE) + { + char c; + WSABUF buf; + DWORD received; + DWORD flags; + int r; + + buf.buf = + buf.len = 1; + + /* + * Peek to see if EOF condition is true. Don't worry about error + * handling or pending data, just be careful not to consume it. + */ + flags = MSG_PEEK; + if (WSARecv(cur_event->fd, , 1, , , NULL, NULL) == 0) + { +occurred_events->pos = cur_event->pos; +occurred_events->user_data = cur_event->user_data; +occurred_events->events = WL_SOCKET_READABLE; +occurred_events->fd = cur_event->fd; +return 1; + } + } + /* * Windows does not guarantee to log an FD_WRITE network event * indicating that more data can be sent unless the previous send() -- 2.44.0
Re: Why is citext/regress failing on hamerkop?
On Tue, May 14, 2024 at 8:17 AM Tom Lane wrote: > I'm not sure whether we've got unsent data pending in the problematic > condition, nor why it'd remain unsent if we do (shouldn't the backend > consume it anyway?). But this has the right odor for an explanation. > > I'm pretty hesitant to touch this area myself, because it looks an > awful lot like commits 6051857fc and ed52c3707, which eventually > had to be reverted. I think we need a deeper understanding of > exactly what Winsock is doing or not doing before we try to fix it. I was beginning to suspect that lingering odour myself. I haven't look at the GSS code but I was imagining that what we have here is perhaps not unsent data dropped on the floor due to linger policy (unclean socket close on process exist), but rather that the server didn't see the socket as ready to read because it lost track of the FD_CLOSE somewhere because the client closed it gracefully, and our server-side FD_CLOSE handling has always been a bit suspect. I wonder if the GSS code is somehow more prone to brokenness. One thing we learned in earlier problems was that abortive/error disconnections generate FD_CLOSE repeatedly, while graceful ones give you only one. In other words, if the other end politely calls closesocket(), the server had better not miss the FD_CLOSE event, because it won't come again. That's what https://commitfest.postgresql.org/46/3523/ is intended to fix. Does it help here? Unfortunately that's unpleasantly complicated and unbackpatchable (keeping a side-table of socket FDs and event handles, so we don't lose events between the cracks).
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
On Sat, May 11, 2024 at 5:00 PM Alexander Lakhin wrote: > 11.05.2024 07:25, Thomas Munro wrote: > > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin > > wrote: > >> 11.05.2024 06:26, Thomas Munro wrote: > >>> Perhaps a no-image, no-change registered buffer should not be > >>> including an image, even for XLR_CHECK_CONSISTENCY? It's actually > >>> useless for consistency checking too I guess, this issue aside, > >>> because it doesn't change anything so there is nothing to check. > >> Yes, I think something wrong is here. I've reduced the reproducer to: > > Does it reproduce if you do this? > > > > - include_image = needs_backup || (info & > > XLR_CHECK_CONSISTENCY) != 0; > > + include_image = needs_backup || > > + ((info & XLR_CHECK_CONSISTENCY) != 0 && > > +(regbuf->flags & REGBUF_NO_CHANGE) == 0); > > No, it doesn't (at least with the latter, more targeted reproducer). OK so that seems like a candidate fix, but ... > > Unfortunately the back branches don't have that new flag from 00d7fb5e > > so, even if this is the right direction (not sure, I don't understand > > this clean registered buffer trick) then ... but wait, why are there > > are no failures like this in the back branches (yet at least)? Does > > your reproducer work for 16? I wonder if something relevant changed > > recently, like f56a9def. CC'ing Michael and Amit K for info. > > Maybe it's hard to hit (autovacuum needs to process the index page in a > narrow time frame), but locally I could reproduce the issue even on > ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching > hash indexes, didn't dig deeper). ... we'd need to figure out how to fix this in the back-branches too. One idea would be to back-patch REGBUF_NO_CHANGE, and another might be to deduce that case from other variables. Let me CC a couple more people from this thread, which most recently hacked on this stuff, to see if they have insights: https://www.postgresql.org/message-id/flat/d2c31606e6bb9b83a02ed4835d65191b38d4ba12.camel%40j-davis.com
Re: Why is citext/regress failing on hamerkop?
On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan wrote: > Well, this is more or less where I came in back in about 2002 :-) I've been > trying to help support it ever since, mainly motivated by stubborn > persistence than anything else. Still, I agree that the lack of support for > the Windows port from Microsoft over the years has been more than > disappointing. I think "state of the Windows port" would make a good discussion topic at pgconf.dev (with write-up for those who can't be there). If there is interest, I could organise that with a short presentation of the issues I am aware of so far and possible improvements and smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I would focus on technical stuff, not who-should-be-doing-what, 'cause I can't make anyone do anything. For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so here's hoping for 100% green on master by Tuesday or so.
Re: Why is citext/regress failing on hamerkop?
On Sat, May 11, 2024 at 1:14 PM Thomas Munro wrote: > Either way, it seems like we'll need to skip that test on Windows if > we want hamerkop to be green. That can probably be cribbed from > collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's > prelude... I just don't know how to explain it in the comment 'cause I > don't know why. Here's a minimal patch like that. I don't think it's worth back-patching. Only 15 and 16 could possibly be affected, I think, because the test wasn't enabled before that. I think this is all just a late-appearing blow-up predicted by the commit that enabled it: commit c2e8bd27519f47ff56987b30eb34a01969b9a9e8 Author: Tom Lane Date: Wed Jan 5 13:30:07 2022 -0500 Enable routine running of citext's UTF8-specific test cases. These test cases have been commented out since citext was invented, because at the time we had no nice way to deal with tests that have restrictions such as requiring UTF8 encoding. But now we do have a convention for that, ie put them into a separate test file with an early-exit path. So let's enable these tests to run when their prerequisites are satisfied. (We may have to tighten the prerequisites beyond the "encoding = UTF8 and locale != C" checks made here. But let's put it on the buildfarm and see what blows up.) Hamerkop is already green on the 15 and 16 branches, apparently because it's using the pre-meson test stuff that I guess just didn't run the relevant test. In other words, nobody would notice the difference anyway, and a master-only fix would be enough to end this 44-day red streak. From 1f0e1dc21d4055a0e5109bac3b290508e2d8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 12 May 2024 10:20:06 +1200 Subject: [PATCH] Skip the citext_utf8 test on Windows. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On other Windows build farm animals it is already skipped because they don't use UTF-8 encoding. On "hamerkop", UTF-8 is used, and then the test fails. It is not clear to me (a non-Windows person looking only at buildfarm evidence) whether Windows is less sophisticated than other OSes and doesn't know how to downcase Turkish İ with the standard Unicode database, or if it is more sophisticated than other systems and uses locale-specific behavior like ICU does. Whichever the reason, the result is the same: we need to skip the test on Windows, just as we already do for ICU, at least until a Windows-savvy developer comes up with a better idea. The technique for detecting the OS is borrowed from collate.windows.win1252.sql. Discussion: https://postgr.es/m/CA%2BhUKGJ1LeC3aE2qQYTK95rFVON3ZVoTQpTKJqxkHdtEyawH4A%40mail.gmail.com --- contrib/citext/expected/citext_utf8.out | 3 +++ contrib/citext/expected/citext_utf8_1.out | 3 +++ contrib/citext/sql/citext_utf8.sql| 3 +++ 3 files changed, 9 insertions(+) diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out index 5d988dcd485..19538db674e 100644 --- a/contrib/citext/expected/citext_utf8.out +++ b/contrib/citext/expected/citext_utf8.out @@ -6,8 +6,11 @@ * Turkish dotted I is not correct for many ICU locales. citext always * uses the default collation, so it's not easy to restrict the test * to the "tr-TR-x-icu" collation where it will succeed. + * + * Also disable for Windows. It fails similarly, at least in some locales. */ SELECT getdatabaseencoding() <> 'UTF8' OR + version() ~ '(Visual C\+\+|mingw32|windows)' OR (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i' FROM pg_database WHERE datname=current_database()) diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out index 7065a5da190..874ec8519e1 100644 --- a/contrib/citext/expected/citext_utf8_1.out +++ b/contrib/citext/expected/citext_utf8_1.out @@ -6,8 +6,11 @@ * Turkish dotted I is not correct for many ICU locales. citext always * uses the default collation, so it's not easy to restrict the test * to the "tr-TR-x-icu" collation where it will succeed. + * + * Also disable for Windows. It fails similarly, at least in some locales. */ SELECT getdatabaseencoding() <> 'UTF8' OR + version() ~ '(Visual C\+\+|mingw32|windows)' OR (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i' FROM pg_database WHERE datname=current_database()) diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql index 34b232d64e2..ba283320797 100644 --- a/contrib/citext/sql/citext_utf8.sql +++ b/contrib/citext/sql/citext_utf8.sql @@ -6,9 +6,12 @@ * Turkish dotted I is not correct for many ICU locales. citext always * uses the default collation, so it's not easy to restrict the test * to the "tr-TR-x-icu" collation where it
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin wrote: > 11.05.2024 06:26, Thomas Munro wrote: > > Perhaps a no-image, no-change registered buffer should not be > > including an image, even for XLR_CHECK_CONSISTENCY? It's actually > > useless for consistency checking too I guess, this issue aside, > > because it doesn't change anything so there is nothing to check. > > Yes, I think something wrong is here. I've reduced the reproducer to: Does it reproduce if you do this? - include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0; + include_image = needs_backup || + ((info & XLR_CHECK_CONSISTENCY) != 0 && +(regbuf->flags & REGBUF_NO_CHANGE) == 0); Unfortunately the back branches don't have that new flag from 00d7fb5e so, even if this is the right direction (not sure, I don't understand this clean registered buffer trick) then ... but wait, why are there are no failures like this in the back branches (yet at least)? Does your reproducer work for 16? I wonder if something relevant changed recently, like f56a9def. CC'ing Michael and Amit K for info.
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
On Sat, May 11, 2024 at 3:57 AM Andres Freund wrote: > On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote: > > and discovered that XLogRecordAssemble() calculates CRC over a buffer, > > that might be modified by another process. > > If, with "might", you mean that it's legitimate for that buffer to be > modified, I don't think so. The bug is that something is modifying the buffer > despite it being exclusively locked. > > I.e. what we likely have here is a bug somewhere in the hash index code. I don't have a good grip on the higher level locking protocols of hash.c, but one microscopic thing jumps out: /* * bucket buffer was not changed, but still needs to be * registered to ensure that we can acquire a cleanup lock on * it during replay. */ if (!xlrec.is_primary_bucket_page) { uint8flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE; XLogRegisterBuffer(0, bucket_buf, flags); } That registers a buffer that is pinned but not content-locked, and we tell xloginsert.c not to copy its image into the WAL, but it does it anyway because: /* * If needs_backup is true or WAL checking is enabled for current * resource manager, log a full-page write for the current block. */ include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0; So I guess it copies the image on dodo, which has: 'PG_TEST_EXTRA' => 'ssl ldap kerberos wal_consistency_checking libpq_encryption xid_wraparound' Perhaps a no-image, no-change registered buffer should not be including an image, even for XLR_CHECK_CONSISTENCY? It's actually useless for consistency checking too I guess, this issue aside, because it doesn't change anything so there is nothing to check.
Why is citext/regress failing on hamerkop?
For example, 'i'::citext = 'İ'::citext fails to be true. It must now be using UTF-8 (unlike, say, Drongo) and non-C ctype, given that the test isn't skipped. This isn't the first time that we've noticed that Windows doesn't seem to know about İ→i (see [1]), but I don't think anyone has explained exactly why, yet. It could be that it just doesn't know about that in any locale, or that it is locale-dependent and would only do that for Turkish, the same reason we skip the test for ICU, or ... Either way, it seems like we'll need to skip that test on Windows if we want hamerkop to be green. That can probably be cribbed from collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's prelude... I just don't know how to explain it in the comment 'cause I don't know why. One new development in Windows-land is that the system now does actually support UTF-8 in the runtime libraries[2]. You can set it at a system level, or for an application at build time, or by adding ".UTF-8" to a locale name when opening the locale (apparently much more like Unix systems, but I don't know what exactly it does). I wonder why we see this change now... did hamerkop have that ACP=UTF-8 setting applied on purpose, or if computers in Japan started doing that by default instead of using Shift-JIS, or if it only started picking UTF-8 around the time of the Meson change somehow, or the initdb-template change. It's a little hard to tell from the logs. [1] https://www.postgresql.org/message-id/CAC%2BAXB10p%2BmnJ6wrAEm6jb51%2B8%3DBfYzD%3Dw6ftHRbMjMuSFN3kQ%40mail.gmail.com [2] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
Re: First draft of PG 17 release notes
On Thu, May 9, 2024 at 4:04 PM Bruce Momjian wrote: > I welcome feedback. For some reason it was an easier job than usual. > 2024-01-25 [820b5af73] jit: Require at least LLVM 10. > Require LLVM version 10 or later (Peter Eisentraut) Peter reviewed, I authored, and I think you intend to list authors in parentheses.
Re: cannot abort transaction 2737414167, it was already committed
On Thu, Dec 28, 2023 at 11:42 AM Tom Lane wrote: > Thomas Munro writes: > > In CommitTransaction() there is a stretch of code beginning s->state = > > TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call > > out to various subsystems' AtEOXact_XXX() functions. There is no way > > to roll back in that state, so anything that throws ERROR from those > > routines is going to get something much like $SUBJECT. Hmm, we'd know > > which exact code path got that EIO from your smoldering core if we'd > > put an explicit critical section there (if we're going to PANIC > > anyway, it might as well not be from a different stack after > > longjmp()...). > > +1, there's basically no hope of debugging this sort of problem > as things stand. I was reminded of this thread by Justin's other file system snafu thread. Naively defining a critical section to match the extent of the TRANS_COMMIT state doesn't work, as a bunch of code under there uses palloc(). That reminds me of the nearby RelationTruncate() thread, and there is possibly even some overlap, plus more in this case... ugh. Hmm, AtEOXact_RelationMap() is one of those steps, but lives just outside the crypto-critical-section created by TRANS_COMMIT, though has its own normal CS for logging. I wonder, given that "updating the map file is effectively commit of the relocation", why wouldn't it have a variant of the problem solved by DELAY_CHKPT_START for normal commit records, under diabolical scheduling? It's a stretch, but: You log XLOG_RELMAP_UPDATE, a concurrent checkpoint runs with REDO after that record, you crash before/during durable_rename(), and then you perform crash recovery. Now your catalog is still using the old relfilenode on the primary, but any replica following along replays XLOG_RELMAP_UPDATE and is using the new relfilenode, frozen in time, for queries, while replaying changes to the old relfilenode. Right?
Re: backend stuck in DataFileExtend
On Wed, May 8, 2024 at 6:54 AM Justin Pryzby wrote: > On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote: > > https://github.com/openzfs/zfs/issues/11641 > > > > I don't know enough to say anything useful about that but it certainly > > smells similar... > > Wow - I'd completely forgotten about that problem report. > The symptoms are the same, even with a zfs version 3+ years newer. > I wish the ZFS people would do more with their problem reports. If I had to guess, my first idea would be that your 1MB or ginormous 16MB recordsize (a relatively new option) combined with PostgreSQL's 8KB block-at-a-time random order I/O patterns are tickling strange corners and finding a bug that no one has seen before. I would imagine that almost everyone in the galaxy who uses very large records does so with 'settled' data that gets streamed out once sequentially (for example I think some of the OpenZFS maintainers are at Lawrence Livermore National Lab where I guess they might pump around petabytes of data produced by particle physics research or whatever it might be, probably why they they are also adding direct I/O to avoid caches completely...). But for us, if we have lots of backends reading, writing and extending random 8KB fragments of a 16MB page concurrently (2048 pages per record!), maybe we hit some broken edge... I'd be sure to include that sort of detail in any future reports. Normally I suppress urges to blame problems on kernels, file systems etc and in the past accusations that ZFS was buggy turned out to be bugs in PostgreSQL IIRC, but user space sure seems to be off the hook for this one...
Re: backend stuck in DataFileExtend
On Tue, May 7, 2024 at 6:21 AM Justin Pryzby wrote: > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org. ... > Yes, they're running centos7 with the indicated kernels. So far we've got: * spurious EIO when opening a file (your previous report) * hanging with CPU spinning (?) inside pwritev() * old kernel, bleeding edge ZFS >From an (uninformed) peek at the ZFS code, if it really is spinning there is seems like a pretty low level problem: it's finish the write, and now is just trying to release (something like our unpin) and unlock the buffers, which involves various code paths that might touch various mutexes and spinlocks, and to get stuck like that I guess it's either corrupted itself or it is deadlocking against something else, but what? Do you see any other processes (including kernel threads) with any stuck stacks that might be a deadlock partner? While looking around for reported issues I found your abandoned report against an older ZFS version from a few years ago, same old Linux version: https://github.com/openzfs/zfs/issues/11641 I don't know enough to say anything useful about that but it certainly smells similar... I see you've been busy reporting lots of issues, which seems to involve big data, big "recordsize" (= ZFS block sizes), compression and PostgreSQL: https://github.com/openzfs/zfs/issues?q=is%3Aissue+author%3Ajustinpryzby
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Thu, Feb 15, 2024 at 10:40 PM Anton Voloshin wrote: > On 19/01/2024 01:35, Thomas Munro wrote: > > I don't yet have an opinion on the best way to > > do it though. Would it be enough to add emit_message($node, 0) after > > advance_out_of_record_splitting_zone()? > > Yes, indeed that seems to be enough. At least I could not produce any > more "xl_tot_len zero" failures with that addition. > > I like this solution the best. Oh, it looks like this new build farm animal "skimmer" might be reminding us about this issue: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skimmer=HEAD I don't know why it changed, but since this is an LSN/page alignment thing, it could be due to external things like an OS upgrade adding more locales or something that affects initdb. Will look soon and fix.
Re: Streaming I/O, vectored I/O (WIP)
On Wed, May 1, 2024 at 2:51 PM David Rowley wrote: > On Wed, 24 Apr 2024 at 14:32, David Rowley wrote: > > I've attached a patch with a few typo fixes and what looks like an > > incorrect type for max_ios. It's an int16 and I think it needs to be > > an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do > > anything when max_ios is int16. > > No feedback, so I'll just push this in a few hours unless anyone has anything. Patch looks correct, thanks. Please do. (Sorry, running a bit behind on email ATM... I also have a few more typos around here from an off-list email from Mr Lakhin, will get to that soon...)
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On Tue, Apr 30, 2024 at 7:17 AM Thomas Munro wrote: > On Tue, Apr 30, 2024 at 6:47 AM Maksim Milyutin wrote: > >> Should not we call at the end the StrategyFreeBuffer() function to add > >> target buffer to freelist and not miss it after invalidation? > > > Please take a look at this issue, current implementation of > > EvictUnpinnedBuffer() IMO is erroneous - evicted buffers are lost > > permanently and will not be reused again I don't think that's true: it is not lost permanently, it'll be found by the regular clock hand. Perhaps it should be put on the freelist so it can be found again quickly, but I'm not sure that's a bug, is it? If it were true, even basic testing eg select count(pg_buffercache_evict(bufferid)) from pg_buffercache would leave the system non-functional, but it doesn't, the usual CLOCK algorithm just does its thing.
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On Tue, Apr 30, 2024 at 6:47 AM Maksim Milyutin wrote: >> Should not we call at the end the StrategyFreeBuffer() function to add >> target buffer to freelist and not miss it after invalidation? > Please take a look at this issue, current implementation of > EvictUnpinnedBuffer() IMO is erroneous - evicted buffers are lost permanently > and will not be reused again Hi Maksim, Oops, thanks, will fix.
Re: WIP: Vectored writeback
tive strategy. A ring that starts small, and grows/shrinks in response to dirty data (instead of "rejecting"). That would have at least superficial similarities to the ARC algorithm, the "adaptive" bit that controls ring size (it's interested in recency vs frequency, but here it's more like "we're willing to waste more memory on dirty data, because we need to keep it around longer, to avoid flushing the WAL, but not longer than that" which may be a different dimension to value cached data on, I'm not sure). Of course there must be some workloads/machines where using a strategy (instead of BAS_BULKREAD when it degrades to BAS_NORMAL behaviour) will be slower because of WAL flushes, but that's not a fair fight: the flip side of that coin is that you've trashed the buffer pool, which is an external cost paid by someone else, ie it's anti-social, BufferAccessStrategy's very raison d'être. Anyway, in the meantime, I hacked heapam.c to use BAS_BULKWRITE just to see how it would work with this patch set. (This causes an assertion to fail in some test, something about the stats for different IO contexts that was upset by IOCONTEXT_BULKWRITE, which I didn't bother to debug, it's only a demo hack.) Unsurprisingly, it looks like this: postgres=# delete from t; ... pread(25,...,131072,0xc89e000) = 131072 (0x2) <-- already committed pread(25,...,131072,0xc8be000) = 131072 (0x2) read-stream behaviour kill(75954,SIGURG) = 0 (0x0)<-- hey WAL writer! pread(25,...,131072,0xc8de000) = 131072 (0x2) pread(25,...,131072,0xc8fe000) = 131072 (0x2) ... pwrite(25,...,131072,0x1520) = 131072 (0x2) <-- write-behind! pwrite(25,...,131072,0x151e) = 131072 (0x2) pwrite(25,...,131072,0x151c) = 131072 (0x2) ... UPDATE and INSERT conceptually work too, but they suffer from other stupid page-at-a-time problems around extension so it's more fun to look at DELETE first. The whole write-behind notion, and the realisation that we already have it and should just make it into a "real thing", jumped out at me while studying Melanie's VACUUM pass 1 and VACUUM pass 2 patches for adding read streams. Rebased and attached here. That required hacking on the new tidstore.c stuff a bit. (We failed to get the VACUUM read stream bits into v17, but the study of that led to the default BAS_VACUUM size being cranked up to reflect modern realities, and generally sent me down this rabbit hole for a while.) Some problems: * If you wake the WAL writer more often, throughput might actually go down on high latency storage due to serialisation of WAL flushes. So far I have declined to try to write an adaptive algorithm to figure out whether to do it, and where the threshold should be. I suspect it might involve measuring time and hill-climbing... One option is to abandon this part (ie just do no worse than master at WAL flushing), or at least consider that a separate project. * This might hold too many pins! It does respect the limit mechanism, but that can let you have a lot of pins (it's a bit TOCTOU-racy too, we might need something smarter). One idea would be to release pins while writes are in the LSN queue, and reacquire them with ReadRecentBuffer() as required, since we don't really care if someone else evicts them in the meantime. * It seems a bit weird that we *also* have the WritebackContext machinery. I could probably subsume that whole mechanism into write_stream.c. If you squint, sync_file_range() is a sort of dual of POSIX_FADV_WILLNEED, which the read counterpart looks after. * I would like to merge Heikki's bulk write stuff into this somehow, not yet thought about it much. The patches are POC-quality only and certainly have bugs/missed edge cases/etc. Thoughts, better ideas, references to writing about this problem space, etc, welcome. From 63cb8f88fd65ef34536c7d4360b964ca5e6cf62d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 25 Apr 2024 23:45:48 +1200 Subject: [PATCH v2 01/11] Teach WritebackContext to work with block ranges. Instead of having to feed it one block at a time, allow writeback of ranges to scheduled, in preparation for I/O combining. --- src/backend/storage/buffer/bufmgr.c | 28 src/include/storage/buf_internals.h | 4 +++- 2 files changed, 19 insertions(+), 13 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 49637284f91..d7e434daaf1 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1992,7 +1992,7 @@ again: LWLockRelease(content_lock); ScheduleBufferTagForWriteback(, io_context, - _hdr->tag); + _hdr->tag, 1); } @@ -3486,7 +3486,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) * SyncOneBuffer() is only called by checkpointer and bgwriter, so * IOContext will always be IOCONT
Re: Requiring LLVM 14+ in PostgreSQL 18
Rebased over ca89db5f. I looked into whether we could drop the "old pass manager" code too[1]. Almost, but nope, even the C++ API lacks a way to set the inline threshold before LLVM 16, so that would cause a regression. Although we just hard-code the threshold to 512 with a comment that sounds like it's pretty arbitrary, a change to the default (225?) would be unjustifiable just for code cleanup. Oh well. [1] https://github.com/macdice/postgres/commit/0d40abdf1feb75210c3a3d2a35e3d6146185974c v2-0001-jit-Require-at-least-LLVM-14-if-enabled.patch Description: Binary data v2-0002-jit-Use-opaque-pointers-in-all-supported-LLVM-ver.patch Description: Binary data
Re: fix tablespace handling in pg_combinebackup
On Tue, Apr 23, 2024 at 8:05 AM Robert Haas wrote: > I reworked the test cases so that they don't (I think) rely on > symlinks working as they do on normal platforms. Cool. (It will remain a mystery for now why perl readlink() can't read the junction points that PostgreSQL creates (IIUC), but the OS can follow them and PostgreSQL itself can read them with apparently similar code. I find myself wondering if symlinks should go on the list of "things we pretended Windows had out of convenience, that turned out to be more inconvenient than we expected, and we'd do better to tackle head-on with a more portable idea". Perhaps we could just use a tablespace map file instead to do our own path construction, or something like that. I suspect that would be one of those changes that is technically easy, but community-wise hard as it affects a load of backup tools and procedures...)
Re: fix tablespace handling in pg_combinebackup
On Mon, Apr 22, 2024 at 12:00 AM Alexander Lakhin wrote: > From what I can see, the following condition (namely, -l): > if ($path =~ /^pg_tblspc\/(\d+)$/ && -l "$backup_path/$path") > { > push @tsoids, $1; > return 0; > } > > is false for junction points on Windows (cf [1]), but the target path is: Ah, yes, right, -l doesn't like junction points. Well, we're already using the Win32API::File package (it looks like a CPAN library, but I guess the Windows perl distros like Strawberry are all including it for free?). See PostgreSQL::Test::Utils::is_symlink(), attached. That seems to work as expected, but someone who actually knows perl can surely make it better. Then I hit the next problem: readlink C:\cirrus\build/testrun/pg_combinebackup/002_compare_backups\data/t_002_compare_backups_primary_data/backup/backup1/pg_tblspc/16415: Inappropriate I/O control operation at C:/cirrus/src/test/perl/PostgreSQL/Test/Cluster.pm line 927. https://cirrus-ci.com/task/5162332353986560 I don't know where exactly that error message is coming from, but assuming that Strawberry Perl contains this code: https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L2041 https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L1976 ... then it's *very* similar to what we're doing in our own pgreadlink() code. I wondered if the buffer size might be too small for our path, but it doesn't seem so: https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L1581C1-L1581C35 (I think MAX_PATH is 256 on Windows.) If there is some unfixable problem with what they're doing in their readlink(), then I guess it should be possible to read the junction point directly in Perl using Win32API::File::DeviceIoControl()... but I can't see what's wrong with it! Maybe it's not the same code? Attached are the new test support functions, and the fixup to Robert's 6bf5c42b that uses them. To be clear, this doesn't work, yet. It has got to be close though... 0001-More-Windows-pseudo-symlink-support-for-Perl-code.patch Description: Binary data 0002-fixup.patch Description: Binary data
Re: fix tablespace handling in pg_combinebackup
I don't know how to fix 82023d47^ on Windows[1][2], but in case it's useful, here's a small clue. I see that Perl's readlink() does in fact know how to read "junction points" on Windows[3], so if that was the only problem it might have been very close to working. One difference is that our own src/port/dirmod.c's pgreadlink() also strips \??\ from the returned paths (something to do with the various forms of NT path), but when I tried that: my $olddir = readlink("$backup_path/pg_tblspc/$tsoid") || die "readlink $backup_path/pg_tblspc/$tsoid: $!"; + # strip NT path prefix (see src/port/dirmod.c pgreadlink()) + $olddir =~ s/^\\\?\?\\// if $PostgreSQL::Test::Utils::windows_os; ... it still broke[4]. So I'm not sure what's going on... [1] https://github.com/postgres/postgres/runs/24040897199 [2] https://api.cirrus-ci.com/v1/artifact/task/5550091866472448/testrun/build/testrun/pg_combinebackup/002_compare_backups/log/002_compare_backups_pitr1.log [3] https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L2041 [4] https://cirrus-ci.com/task/6746621638082560
Re: cfbot is failing all tests on FreeBSD/Meson builds
On Fri, Apr 19, 2024 at 10:36 AM Thomas Munro wrote: > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276535 > > So perhaps it's time for me to undo what I did before... looking now. It turned out that I still needed the previous work-around, but I was too clever for my own boots last time. For the record, here is the new change to the image building script: https://github.com/anarazel/pg-vm-images/commit/faff91cd40d6af0cbc658f5c11da47e2aa88d332 I should have listened to Bilal's prediction[1] about this the first time. But this time, I know that the real fix is coming in the next package very soon, per bugzilla link above. One thing I noticed is that 010_tab_completion is actually being skipped, with that fix. It used to run, I'm sure. Not sure why, but I'll look more seriously when the 1.21 or whatever shows up. At least we should soon have green CI again in the meantime. [1] https://github.com/anarazel/pg-vm-images/pull/92
Re: Allow tests to pass in OpenSSL FIPS mode
On Fri, Apr 19, 2024 at 4:00 PM Tom Lane wrote: > Thomas Munro writes: > > Probably not this thread's fault, but following the breadcrumbs to the > > last thread to touch the relevant test lines in > > authentication/001_password, is it expected that we have these > > warnings? > > > psql::1: WARNING: roles created by regression test cases > > should have names starting with "regress_" > > I think the policy is that we enforce that for cases reachable > via "make installcheck" (to avoid possibly clobbering global > objects in a live installation), but not for cases only reachable > via "make check", such as TAP tests. So I'm not that concerned > about this, although if someone is feeling anal enough to rename > the test role I won't stand in the way. Got it, thanks. Not me, just asking.
Re: Allow tests to pass in OpenSSL FIPS mode
On Sat, Nov 18, 2023 at 7:46 AM Peter Eisentraut wrote: > All done, thanks. Probably not this thread's fault, but following the breadcrumbs to the last thread to touch the relevant test lines in authentication/001_password, is it expected that we have these warnings? psql::1: WARNING: roles created by regression test cases should have names starting with "regress_"
Re: AIX support
On Fri, Apr 19, 2024 at 6:01 AM Andres Freund wrote: > On 2024-04-18 11:15:43 +, Sriram RK wrote: > > We (IBM-AIX team) looked into this issue > > > > https://www.postgresql.org/message-id/20240225194322...@rfd.leadboat.com > > > > This is related to the compiler issue. The compilers xlc(13.1) and gcc(8.0) > > have issues. But we verified that this issue is resolved with the newer > > compiler versions openXL(xlc17.1) and gcc(12.0) onwards. > > The reason we used these compilers was that those were the only ones we had > kinda somewhat reasonable access to, via the gcc projects' > "compile farm" https://portal.cfarm.net/ > We have to rely on whatever the aix machines there provide. They're not > particularly plentiful resource-wise either. To be fair, those OSUOSL machines are donated by IBM: https://osuosl.org/services/powerdev/ It's just that they seem to be mostly focused on supporting Linux on POWER, with only a couple of AIX hosts (partitions/virtual machines?) made available via portal.cfarm.net, and they only very recently added a modern AIX 7.3 host. That's cfarm119, upgraded in September-ish, long after many threads on this list that between-the-lines threatened to drop support. > This is generally one of the big issues with AIX support. There are other > niche-y OSs that don't have a lot of users, e.g. openbsd, but in contrast to > AIX I can just start an openbsd VM within a few minutes and iron out whatever > portability issue I'm dealing with. Yeah. It is a known secret that you can run AIX inside Qemu/kvm (it appears that IBM has made changes to it to make that possible, because earlier AIX versions didn't like Qemu's POWER emulation or virtualisation, there are blog posts about it), but IBM doesn't actually make the images available to non-POWER-hardware owners (you need a serial number). If I were an OS vendor and wanted developers to target my OS for free, at the very top of my TODO list I would have: provide an easy to use image for developers to be able to spin something up in minutes and possibly even use in CI systems. That's the reason I can fix any minor portability issue on Linux, illumos, *BSD quickly and Windows with only moderate extra pain. Even Oracle knows this, see Solaris CBE. > > We want to make a note that postgres is used extensively in our IBM product > > and is being exploited by multiple customers. > > To be blunt: Then it'd have been nice to see some investment in that before > now. Both on the code level and the infrastructure level (i.e. access to > machines etc). In the AIX space generally, there were even clues that funding had been completely cut even for packaging PostgreSQL. I was aware of two packaging projects (not sure how they were related): 1. The ATOS packaging group, who used to show up on our mailing lists and discuss code changes, which announced it was shutting down: https://github.com/power-devops/bullfreeware 2. And last time I looked a few months back, the IBM AIX Toolbox packaging project only had PostgreSQL 10 or 11 packages, already out of support by us, meaning that their maintainer had given up, too: https://www.ibm.com/support/pages/aix-toolbox-open-source-software-downloads-alpha However I see that recently (last month?) someone has added PostgreSQL 15, so something has only just reawoken there? There are quite a lot of threads about AIX problems, but they are almost all just us non-AIX-users trying to unbreak stupid stuff on the build farm, which at some points began to seem distinctly quixotic: chivalric hackers valiantly trying to keep the entire Unix family tree working even though we don't remember why and th versions involved are out of support even by the vendor. Of the three old giant commercial Unixes, HP-UX was dropped without another mention (it really was a windmill after all), Solaris is somehow easier to deal with (I could guess it's because it influenced Linux and BSD so much, ELF and linker details spring to mind), while AIX fails on every dimension: unrepresented by users, lacking in build farm, unavailable to non-customers, and unusual among Unixen.
Re: cfbot is failing all tests on FreeBSD/Meson builds
On Thu, Feb 8, 2024 at 3:53 PM Tom Lane wrote: > Thomas Munro writes: > > On Tue, Jan 30, 2024 at 5:06 PM Tom Lane wrote: > >> Thomas Munro writes: > >>> Ahh, there is one: https://github.com/cpan-authors/IO-Tty/issues/38 > > >> Just for the archives' sake: I hit this today on a fresh install > >> of FreeBSD 14.0, which has pulled in p5-IO-Tty-1.18. Annoying... > > > FWIW here's what I did to downgrade: > > Thanks for the recipe --- this worked for me, although I noticed > it insisted on installing perl5.34-5.34.3_3 alongside 5.36. > Doesn't seem to be a problem though --- the main perl installation > is still 5.36. Looks like CI is broken in this way again, as of ~13 hours ago. Looking into that... > Also, I'm not entirely convinced that the above-cited issue report is > complaining about the same thing that's biting us. The reported error > messages are completely different. You could be right about that. It seems there was a clash between an upstream commit and a patch in FBSD's port tree, which has just been removed: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276535 So perhaps it's time for me to undo what I did before... looking now.
Re: Solaris tar issues, or other reason why margay fails 010_pg_basebackup?
On Fri, Apr 19, 2024 at 12:57 AM Marcel Hofstetter wrote: > SKIP_READLINE_TESTS works. margay is now green again. Great! FTR there was a third thing revealed by margay since you enabled the TAP tests: commit e2a23576. I would guess that the best chance of getting the readline stuff to actually work would be to interest someone who hacks on IPC::Run-and-related-stuff (*cough* Noah *cough*) and who has Solaris access to look at that... I would guess it needs a one-line fix relating to raw/cooked behaviour, but as the proverbial mechanic said, most of the fee is for knowing where to hit it...
Re: Cannot find a working 64-bit integer type on Illumos
On Thu, Apr 18, 2024 at 6:09 PM Japin Li wrote: > /home/japin/postgres/build/../src/common/config_info.c:198:11: error: > comparison of integer expressions of different signedness: 'int' and 'size_t' > {aka 'long unsigned int'} [-Werror=sign-compare] > 198 | Assert(i == *configdata_len); Right, PostgreSQL doesn't compile cleanly with the "sign-compare" warning. There have been a few threads about that, and someone might want to think harder about it, but it's a different topic unrelated to . > /home/japin/postgres/build/../src/include/lib/simplehash.h:1138:9: error: > format '%llu' expects argument of type 'long long unsigned int', but argument > 4 has type 'uint64' {aka 'long unsigned int'} [-Werror=format=] It seems my v1 patch's configure probe for INT64_FORMAT was broken. In the v2 patch I tried not doing that probe at all, and instead inviting into our world (that's the standardised way to produce format strings, which has the slight complication that we are intercepting printf calls...). I suspect that'll work better for you.
Re: Cannot find a working 64-bit integer type on Illumos
On Thu, Apr 18, 2024 at 8:47 PM Peter Eisentraut wrote: > Maybe this means something like our int64 is long long int but the > system's int64_t is long int underneath, but I don't see how that would > matter for the limit macros. Agreed, so I don't think it's long vs long long (when they have the same width). I wonder if this comment is a clue: static char * inet_net_ntop_ipv6(const u_char *src, int bits, char *dst, size_t size) { /* * Note that int32_t and int16_t need only be "at least" large enough to * contain a value of the specified size. On some systems, like Crays, * there is no such thing as an integer variable with 16 bits. Keep this * in mind if you think this function should have been coded to use * pointer overlays. All the world's not a VAX. */ I'd seen that claim before somewhere else but I can't recall where. So there were systems using those names in an ad hoc unspecified way before C99 nailed this stuff down? In modern C, int32_t is definitely an exact width type (but there are other standardised variants like int_fast32_t to allow for Cray-like systems that would prefer to use a wider type, ie "at least", 32 bits wide, so I guess that's what happened to that idea?). Or perhaps it's referring to worries about the width of char, short, int or the assumption of two's-complement. I think if any of that stuff weren't as assumed we'd have many problems in many places, so I'm not seeing a problem. (FTR C23 finally nailed down two's-complement as a requirement, and although C might not say so, POSIX says that char is a byte, and our assumption that int = int32_t is pretty deeply baked into PostgreSQL, so it's almost impossible to imagine that short has a size other than 16 bits; but these are all assumptions made by the OLD coding, not by the patch I posted). In short, I guess that isn't what was meant.
Re: Cannot find a working 64-bit integer type on Illumos
On Thu, Apr 18, 2024 at 8:47 PM Peter Eisentraut wrote: > I'm not sure I understand the problem here. Do you mean that in theory > a platform's PRId64 could be something other than "l" or "ll"? Yes. I don't know why anyone would do that, and the systems I checked all have the obvious definitions, eg "ld", "lld" etc. Perhaps it's an acceptable risk? It certainly gives us a tidier result. For discussion, here is a variant that fully embraces and the PRI*64 macros. v2-0001-Use-int64_t-support-from-stdint.h-and-inttypes.h.patch Description: Binary data v2-0002-Remove-traces-of-BeOS.patch Description: Binary data
Re: Cannot find a working 64-bit integer type on Illumos
On Sat, Mar 23, 2024 at 3:23 PM Tom Lane wrote: > Thomas Munro writes: > > . o O ( int64_t, PRIdi64, etc were standardised a quarter of a century ago ) > > Yeah. Now that we require C99 it's probably reasonable to assume > that those things exist. I wouldn't be in favor of ripping out our > existing notations like UINT64CONST, because the code churn would be > substantial and the gain minimal. But we could imagine reimplementing > that stuff atop and then getting rid of the configure-time > probes. I played around with this a bit, but am not quite there yet. printf() is a little tricky. The standard wants us to use 's PRId64 etc, but that might confuse our snprintf.c (in theory, probably not in practice). "ll" should have the right size on all systems, but gets warnings from the printf format string checker on systems where "l" is the right type. So I think we still need to probe for INT64_MODIFIER at configure-time. Here's one way, but I can see it's not working on Clang/Linux... perhaps instead of that dubious incantation I should try compiling some actual printfs and check for warnings/errors. I think INT64CONST should just point to standard INT64_C(). For limits, why do we have this: - * stdint.h limits aren't guaranteed to have compatible types with our fixed - * width types. So just define our own. ? I mean, how could they not have compatible types? I noticed that configure.ac checks if int64 (no "_t") might be defined already by system header pollution, but meson.build doesn't. That's an inconsistency that should be fixed, but which way? Hmm, commit 15abc7788e6 said that was done for BeOS, which we de-supported. So maybe we should get rid of that? 0001-Use-int64_t-and-friends-from-stdint.h.patch Description: Binary data 0002-Remove-traces-of-BeOS.patch Description: Binary data
Re: Solaris tar issues, or other reason why margay fails 010_pg_basebackup?
On Thu, Apr 18, 2024 at 1:40 AM Marcel Hofstetter wrote: > Using gnu tar helps to make pg_basebackup work. Thanks! I guess that'll remain a mystery. > It fails now at a later step. Oh, this rings a bell: [14:54:58] t/010_tab_completion.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) We had another thread[1] where we figured out that Solaris's termios defaults include TABDLY=TAB3, meaning "expand tabs to spaces on output", and that was upsetting our tab-completion test. Other Unixes used to vary on this point too, but they all converged on not doing that, except Solaris, apparently. Perhaps IPC::Run could fix that by calling ->set_raw() on the pseudo-terminal, but I'm not very sure about that. This test suite is passing on pollock because it doesn't have IO::Pty installed. Could you try uninstalling that perl package for now, so we can see what breaks next? [06:34:40] t/010_tab_completion.pl .. skipped: IO::Pty is needed to run this test [1] https://www.postgresql.org/message-id/flat/MEYP282MB1669E2E11495A2DEAECE8736B6A7A%40MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
Re: Solaris tar issues, or other reason why margay fails 010_pg_basebackup?
On Wed, Apr 17, 2024 at 7:17 PM Marcel Hofstetter wrote: > Is there a way to configure which tar to use? > > gnu tar would be available. > > -bash-5.1$ ls -l /usr/gnu/bin/tar > -r-xr-xr-x 1 root bin 1226248 Jul 1 2022 /usr/gnu/bin/tar Cool. I guess you could fix the test either by setting TAR=/usr/gnu/bin/tar or PATH=/usr/gnu/bin:$PATH. If we want to understand *why* it doesn't work, someone would need to dig into that. It's possible that PostgreSQL is using some GNU extension (if so, apparently the BSDs' tar is OK with it too, and I guess AIX's and HP-UX's was too in the recent times before we dropped those OSes). I vaguely recall (maybe 20 years ago, time flies) that Solaris tar wasn't able to extract some tarballs but I can't remember why... I'm also happy to leave it at "Sun's tar doesn't work for us, we don't know why" if you are.
Solaris tar issues, or other reason why margay fails 010_pg_basebackup?
Hi, I noticed that margay (Solaris) has started running more of the tests lately, but is failing in pg_basebaseup/010_pg_basebackup. It runs successfully on wrasse (in older branches, Solaris 11.3 is desupported in 17/master), and also on pollock (illumos, forked from common ancestor Solaris 10 while it was open source). Hmm, wrasse is using "/opt/csw/bin/gtar xf ..." and pollock is using "/usr/gnu/bin/tar xf ...", while margay is using "/usr/bin/tar xf ...". The tar command is indicating success (it's run by system_or_bail and it's not bailing), but the replica doesn't want to come up: pg_ctl: directory "/home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/pgdata" is not a database cluster directory" So one idea would be that our tar format is incompatible with Sun tar in some way that corrupts the output, or there is some still difference in the nesting of the directory structure it creates, or something like that. I wonder if this is already common knowledge in the repressed memories of this list, but I couldn't find anything specific. I'd be curious to know why exactly, if so (in terms of POSIX conformance etc, who is doing something wrong).
s/shm_mq_iovec/struct iovec/
Hi, I was grepping for iovec users and noticed that the shm_mq stuff defines its own iovec struct. Is there any reason not to use the standard one, now that we can? Will add to next commitfest. From 20b44cab0bb9f6218270aa0ae150ac0e560b49fe Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 15 Apr 2024 10:11:10 +1200 Subject: [PATCH] Use standard iovec in shm_mq.h interface. Commit 2bd9e412 (2014) introduced "shm_mq_iovec", perhaps because POSIX struct iovec was missing on Windows. Commit 13a021f3 (2021) introduced a standard-conforming replacement for Windows, for use in various other parts of the tree. Now that we can, we might as well use the standard struct in the shared memory queue API too. --- src/backend/libpq/pqmq.c | 10 +- src/backend/storage/ipc/shm_mq.c | 29 - src/include/storage/shm_mq.h | 10 ++ src/tools/pgindent/typedefs.list | 1 - 4 files changed, 23 insertions(+), 27 deletions(-) diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c index 00a44ca803f..d737663e230 100644 --- a/src/backend/libpq/pqmq.c +++ b/src/backend/libpq/pqmq.c @@ -117,7 +117,7 @@ mq_is_send_pending(void) static int mq_putmessage(char msgtype, const char *s, size_t len) { - shm_mq_iovec iov[2]; + struct iovec iov[2]; shm_mq_result result; /* @@ -147,10 +147,10 @@ mq_putmessage(char msgtype, const char *s, size_t len) pq_mq_busy = true; - iov[0].data = - iov[0].len = 1; - iov[1].data = s; - iov[1].len = len; + iov[0].iov_base = + iov[0].iov_len = 1; + iov[1].iov_base = unconstify(char *, s); + iov[1].iov_len = len; Assert(pq_mq_handle != NULL); diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c index 9235fcd08ec..0d73bbf1879 100644 --- a/src/backend/storage/ipc/shm_mq.c +++ b/src/backend/storage/ipc/shm_mq.c @@ -21,6 +21,7 @@ #include "miscadmin.h" #include "pgstat.h" #include "port/pg_bitutils.h" +#include "port/pg_iovec.h" #include "postmaster/bgworker.h" #include "storage/shm_mq.h" #include "storage/spin.h" @@ -329,10 +330,10 @@ shm_mq_result shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait, bool force_flush) { - shm_mq_iovec iov; + struct iovec iov; - iov.data = data; - iov.len = nbytes; + iov.iov_base = unconstify(void *, data); + iov.iov_len = nbytes; return shm_mq_sendv(mqh, , 1, nowait, force_flush); } @@ -358,7 +359,7 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait, * ring size. */ shm_mq_result -shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait, +shm_mq_sendv(shm_mq_handle *mqh, struct iovec *iov, int iovcnt, bool nowait, bool force_flush) { shm_mq_result res; @@ -374,7 +375,7 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait, /* Compute total size of write. */ for (i = 0; i < iovcnt; ++i) - nbytes += iov[i].len; + nbytes += iov[i].iov_len; /* Prevent writing messages overwhelming the receiver. */ if (nbytes > MaxAllocSize) @@ -423,9 +424,9 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait, Size chunksize; /* Figure out which bytes need to be sent next. */ - if (offset >= iov[which_iov].len) + if (offset >= iov[which_iov].iov_len) { - offset -= iov[which_iov].len; + offset -= iov[which_iov].iov_len; ++which_iov; if (which_iov >= iovcnt) break; @@ -442,16 +443,17 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait, * MAXALIGN'd. */ if (which_iov + 1 < iovcnt && - offset + MAXIMUM_ALIGNOF > iov[which_iov].len) + offset + MAXIMUM_ALIGNOF > iov[which_iov].iov_len) { char tmpbuf[MAXIMUM_ALIGNOF]; int j = 0; for (;;) { -if (offset < iov[which_iov].len) +if (offset < iov[which_iov].iov_len) { - tmpbuf[j] = iov[which_iov].data[offset]; + char *data = iov[which_iov].iov_base; + tmpbuf[j] = data[offset]; j++; offset++; if (j == MAXIMUM_ALIGNOF) @@ -459,7 +461,7 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait, } else { - offset -= iov[which_iov].len; + offset -= iov[which_iov].iov_len; which_iov++; if (which_iov >= iovcnt) break; @@ -487,10 +489,11 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait, * isn't a multiple of MAXIMUM_ALIGNOF. Otherwise, we need to * MAXALIGN_DOWN the write size. */ - chunksize = iov[which_iov].len - offset; + chunksize = iov[which_iov].iov_len - offset; if (which_iov + 1 < iovcnt) chunksize = MAXALIGN_DOWN(chunksize); - res = shm_mq_send_bytes(mqh, chunksize, [which_iov].data[offset], + res = shm_mq_send_bytes(mqh, chunksize, +(char *) iov[which_iov].iov_base + offset, nowait
Re: Cleaning up threading code
On Mon, Jul 3, 2023 at 8:43 PM Heikki Linnakangas wrote: > On 10/06/2023 05:23, Thomas Munro wrote: > > 2. I don't like the way we have to deal with POSIX vs Windows at > > every site where we use threads, and each place has a different style > > of wrappers. I considered a few different approaches to cleaning this > > up: > > > > * provide centralised and thorough pthread emulation for Windows; I > > don't like this, I don't even like all of pthreads and there are many > > details to get lost in > > * adopt C11 ; unfortunately it is too early, so you'd need > > to write/borrow replacements for at least 3 of our 11 target systems > > * invent our own mini-abstraction for a carefully controlled subset of stuff > > Google search on "c11 threads on Windows" found some emulation wrappers: > https://github.com/jtsiomb/c11threads and > https://github.com/tinycthread/tinycthread, for example. Would either of > those work for us? > > Even if we use an existing emulation wrapper, I wouldn't mind having our > own pg_* abstration on top of it, to document which subset of the POSIX > or C11 functions we actually use. Yeah. I am still interested in getting our thread API tidied up, and I intend to do that for PG 18. The patch I posted on that front, which can be summarised as a very lightweight subset of standard except with pg_ prefixes everywhere mapping to Windows or POSIX threads, still needs one tiny bit more work: figuring out how to get the TLS destructors to run on Windows FLS or similar, or implementing destructors myself (= little destructor callback list that a thread exit hook would run, work I was hoping to avoid by using something from the OS libs... I will try again soon). Andres also opined upthread that we should think about offering a thread_local storage class and getting away from TLS with keys. One thing to note is that the ECPG code is using TLS with destructors (in fact they are b0rked in master, git grep "FIXME: destructor" so ECPG leaks memory on Windows, so the thing that I'm trying to fix in pg_threads.h is actually fixing a long-standing bug), and although thread_local has destructors in C++ it doesn't in C, so if we decided to add the storage class but not bother with the tss_create feature, then ECPG would need to do cleanup another way. I will look into that option. One new development since last time I wrote the above stuff is that the Microsoft toolchain finally implemented the library components of C11 : https://devblogs.microsoft.com/cppblog/c11-threads-in-visual-studio-2022-version-17-8-preview-2/ It seems like it would be a long time before we could contemplate using that stuff though, especially given our duty of keeping libpq and ECPG requirements low and reasonable. However, it seems fairly straightforward to say that we require C99 + some way to access a C11-like thread local storage class. In practice that means a pg_thread_local macro that points to MSVC __declspec(thread) (which works since MSVC 2014?) or GCC/Clang __thread_local (which works since GCC4.8 in 2013?) or whatever. Of course every serious toolchain supports this stuff somehow or other, having been required to for C11. I can't immediately see any build farm animals that would be affected. Grison's compiler info is out of date, it's really running 8.something. The old OpenBSD GCC 4.2 animal was upgraded, and antique AIX got the boot: that's not even a coincidence, those retirements came about because those systems didn't support arbitrary alignment, another C11 feature that we now effectively require. (We could have worked around it it we had to but on but they just weren't reasonable targets.) So I'll go ahead and add the storage class to the next version, and contemplate a couple of different options for the tss stuff, including perhaps leaving it out if that seems doable. https://stackoverflow.com/questions/29399494/what-is-the-current-state-of-support-for-thread-local-across-platforms
Re: In MacOS, psql reacts on SIGINT in a strange fashion (Linux is fine)
On Sun, Apr 14, 2024 at 11:49 AM Tom Lane wrote: > Dmitry Koterov writes: > > I wish it was zsh... I tested it with zsh, but with bash (and with > > low-level kill syscall), I observed the same effect unfortunately. > > > So it's still a puzzle. > > > 1. Even more, when I send a kill() low-level syscall using e.g. Perl - perl > > -e 'kill("INT", 16107)' - it is the same. > > 2. If any other program but psql is used (e.g. vim or my custom Perl script > > which ignores SIGINT), the effect is not reproducible. > > > Can it be e.g. readline? Or something related to tty or session settings > > which psql could modify (I did not find any in the source code though). > > OK, I tried dtruss'ing psql on macOS. What I see is that with > Apple's libedit, the response to SIGINT includes this: > > kill(0, 2) = 0 0 > > that is, "SIGINT my whole process group". If I build with libreadline > from MacPorts, I just see > > kill(30902, 2) = 0 0 > > (30902 being the process's own PID). I'm not real sure why either > library finds it necessary to re-signal the process --- maybe they > trap SIGINT and then try to hide that by reinstalling the app's > normal SIGINT handler and re-delivering the signal. But anyway, > libedit seems to be vastly exceeding its authority here. If > signaling the whole process group were wanted, it would have been > the responsibility of the original signaller to do that. Probably to intercept SIGWINCH -- I assume it's trying to propagate the signal to the "real" signal handler, but it's propagating it a little too far. https://github.com/NetBSD/src/blob/1de18f216411bce77e26740327b0782976a89965/lib/libedit/sig.c#L110
Re: In MacOS, psql reacts on SIGINT in a strange fashion (Linux is fine)
On Sun, Apr 14, 2024 at 11:18 AM Dmitry Koterov wrote: > Can it be e.g. readline? Or something related to tty or session settings > which psql could modify (I did not find any in the source code though). I was wondering about that. Are you using libedit or libreadline? What happens if you build without readline/edit support? From a quick glance at libedit, it does a bunch of signal interception, but I didn't check the details. It is interested in stuff like SIGWINCH, the window-resized-by-user signal.
Re: DROP DATABASE is interruptible
On Tue, Mar 12, 2024 at 9:00 PM Alexander Lakhin wrote: > I see two backends waiting: > law 2420132 2420108 0 09:05 ?00:00:00 postgres: node: law > postgres [local] DROP DATABASE waiting > law 2420135 2420108 0 09:05 ?00:00:00 postgres: node: law > postgres [local] startup waiting Ugh.
Re: Requiring LLVM 14+ in PostgreSQL 18
On Wed, Apr 10, 2024 at 1:38 PM Thomas Munro wrote: > Therefore, some time after the tree re-opens for hacking, we could rip > out a bunch of support code for LLVM 10-13, and then rip out support > for pre-opaque-pointer mode. Please see attached. ... or of course closer to the end of the cycle if that's what people prefer for some reason, I don't mind too much as long as it happens. I added this to the commitfest app, and it promptly failed for cfbot. That's expected: CI is still using Debian 11 "bullseye", which only has LLVM 11. It became what Debian calls "oldstable" last year, and reaches the end of oldstable in a couple of months from now. Debian 12 "bookworm" is the current stable release, and it has LLVM 14, so we should probably go and update those CI images...
Re: Potential stack overflow in incremental base backup
On Thu, Apr 11, 2024 at 12:11 AM Robert Haas wrote: > On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro wrote: > > Could we just write the blocks directly into the output array, and > > then transpose them directly in place if start_blkno > 0? See > > attached. I may be missing something, but the only downside I can > > think of is that the output array is still clobbered even if we decide > > to return BACK_UP_FILE_FULLY because of the 90% rule, but that just > > requires a warning in the comment at the top. > > Yeah. This approach makes the name "relative_block_numbers" a bit > confusing, but not running out of memory is nice, too. Pushed. That fixes the stack problem. Of course we still have this: /* * Since this array is relatively large, avoid putting it on the stack. * But we don't need it at all if this is not an incremental backup. */ if (ib != NULL) relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE); To rescue my initdb --rel-segsize project[1] for v18, I will have a go at making that dynamic. It looks like we don't actually need to allocate that until we get to the GetFileBackupMethod() call, and at that point we have the file size. If I understand correctly, statbuf.st_size / BLCKSZ would be enough, so we could embiggen our block number buffer there if required, right? That way, you would only need shedloads of virtual memory if you used initdb --rel-segsize=shedloads and you actually have shedloads of data in a table/segment. For example, with initdb --rel-segsize=1TB and a table that contains 1TB+ of data, that'd allocate 512MB. It sounds borderline OK when put that way. It sounds not OK with --rel-segsize=infinite and 32TB of data -> palloc(16GB). I suppose one weasel-out would be to say we only support segments up to (say) 1TB, until eventually we figure out how to break this code's dependency on segments. I guess we'll have to do that one day to support incremental backups of other smgr implementations that don't even have segments (segments are a private detail of md.c after all, not part of the smgr abstraction). [1] https://commitfest.postgresql.org/48/4305/
Re: Potential stack overflow in incremental base backup
On Fri, Mar 8, 2024 at 6:53 AM Robert Haas wrote: > ... We could > avoid transposing relative block numbers to absolute block numbers > whenever start_blkno is 0, ... Could we just write the blocks directly into the output array, and then transpose them directly in place if start_blkno > 0? See attached. I may be missing something, but the only downside I can think of is that the output array is still clobbered even if we decide to return BACK_UP_FILE_FULLY because of the 90% rule, but that just requires a warning in the comment at the top. v2-0001-Fix-potential-stack-overflow-in-incremental-baseb.patch Description: Binary data
Re: Speed up clean meson builds by ~25%
On Wed, Apr 10, 2024 at 5:03 PM Tom Lane wrote: > I don't doubt that there are other clang versions where the problem > bites a lot harder. What result do you get from the test I tried > (turning mm_strdup into a no-op macro)? #define mm_strdup(x) (x) does this: Apple clang 15: master: 14s -> 9s MacPorts clang 16, master: 170s -> 10s
Re: Speed up clean meson builds by ~25%
On Wed, Apr 10, 2024 at 11:44 AM Tom Lane wrote: > ... On my Mac laptop > (Apple clang version 15.0.0), the compile time for preproc.o went from > 6.7sec to 5.5sec. Having seen multi-minute compile times on FreeBSD (where clang is the system compiler) and Debian (where I get packages from apt.llvm.org), I have been quietly waiting for this issue to hit Mac users too (where a clang with unknown proprietary changes is the system compiler), but it never did. Huh. I tried to understand a bit more about Apple's version soup. This seems to be an up-to-date table (though I don't understand their source of information): https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_(since_visionOS_support)_2 According to cc -v on my up-to-date MacBook Air, it has "Apple clang version 15.0.0 (clang-1500.3.9.4)", which, if the table is correct, means that it's using LLVM 16.0.0 (note, not 16.0.6, the final version of that branch of [open] LLVM, and the version I saw the issue with on FreeBSD and Debian). They relabel everything to match the Xcode version that shipped it, and they're currently off by one. I wondered if perhaps the table just wasn't accurate in the final digits, so I looked for clues in strings in the binary, and sure enough it contains "LLVM 15.0.0". My guess would be that they've clobbered the major version, but not the rest: the Xcode version is 15.3, and I don't see a 3, so I guess this is really derived from LLVM 16.0.0. One explanation would be that they rebase their proprietary bits and pieces over the .0 version of each major release, and then cherry-pick urgent fixes and stuff later, not pulling in the whole minor release; they also presumably have to maintain it for much longer than the LLVM project's narrow support window. Who knows. So now I wonder if it could be that LLVM 16.0.6 does this, but LLVM 16.0.0 doesn't. I installed clang-16 (16.0.6) with MacPorts, and it does show the problem.
Requiring LLVM 14+ in PostgreSQL 18
Hi PostgreSQL 18 will ship after these vacuum horizon systems reach EOL[1]: animal | arch | llvm_version | os | os_release | end_of_support ---+-+--+++ branta | s390x | 10.0.0 | Ubuntu | 20.04 | 2025-04-01 splitfin | aarch64 | 10.0.0 | Ubuntu | 20.04 | 2025-04-01 urutau | s390x | 10.0.0 | Ubuntu | 20.04 | 2025-04-01 massasauga | aarch64 | 11.1.0 | Amazon | 2 | 2025-06-30 snakefly | aarch64 | 11.1.0 | Amazon | 2 | 2025-06-30 Therefore, some time after the tree re-opens for hacking, we could rip out a bunch of support code for LLVM 10-13, and then rip out support for pre-opaque-pointer mode. Please see attached. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2B-g61yq7Ce4aoZtBDO98b4GXH8Cu3zxVk-Zn1Vh7TKpA%40mail.gmail.com From f5de5c6535b825033b1829eaf340baacc10ed654 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 19 Oct 2023 04:45:46 +1300 Subject: [PATCH 1/2] jit: Require at least LLVM 14, if enabled. Remove support for LLVM versions 10-13. The default on all non-EOL'd OSes represented in our build farm will be at least LLVM 14 when PostgreSQL 18 ships. Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com --- config/llvm.m4 | 8 +- configure | 8 +- doc/src/sgml/installation.sgml | 4 +- meson.build | 2 +- src/backend/jit/llvm/llvmjit.c | 109 src/backend/jit/llvm/llvmjit_error.cpp | 25 -- src/backend/jit/llvm/llvmjit_inline.cpp | 13 --- src/backend/jit/llvm/llvmjit_wrap.cpp | 4 - 8 files changed, 11 insertions(+), 162 deletions(-) diff --git a/config/llvm.m4 b/config/llvm.m4 index 44769d819a0..6ca088b5a45 100644 --- a/config/llvm.m4 +++ b/config/llvm.m4 @@ -13,7 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], AC_REQUIRE([AC_PROG_AWK]) AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command]) - PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10) + PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-18 llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14) # no point continuing if llvm wasn't found if test -z "$LLVM_CONFIG"; then @@ -25,14 +25,14 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], AC_MSG_ERROR([$LLVM_CONFIG does not work]) fi # and whether the version is supported - if echo $pgac_llvm_version | $AWK -F '.' '{ if ([$]1 >= 10) exit 1; else exit 0;}';then -AC_MSG_ERROR([$LLVM_CONFIG version is $pgac_llvm_version but at least 10 is required]) + if echo $pgac_llvm_version | $AWK -F '.' '{ if ([$]1 >= 14) exit 1; else exit 0;}';then +AC_MSG_ERROR([$LLVM_CONFIG version is $pgac_llvm_version but at least 14 is required]) fi AC_MSG_NOTICE([using llvm $pgac_llvm_version]) # need clang to create some bitcode files AC_ARG_VAR(CLANG, [path to clang compiler to generate bitcode]) - PGAC_PATH_PROGS(CLANG, clang clang-17 clang-16 clang-15 clang-14 clang-13 clang-12 clang-11 clang-10) + PGAC_PATH_PROGS(CLANG, clang clang-18 clang-17 clang-16 clang-15 clang-14) if test -z "$CLANG"; then AC_MSG_ERROR([clang not found, but required when compiling --with-llvm, specify with CLANG=]) fi diff --git a/configure b/configure index cfbd2a096f4..ba4ce5a5282 100755 --- a/configure +++ b/configure @@ -5065,7 +5065,7 @@ if test "$with_llvm" = yes; then : if test -z "$LLVM_CONFIG"; then - for ac_prog in llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10 + for ac_prog in llvm-config llvm-config-18 llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 do # Extract the first word of "$ac_prog", so it can be a program name with args. set dummy $ac_prog; ac_word=$2 @@ -5129,8 +5129,8 @@ fi as_fn_error $? "$LLVM_CONFIG does not work" "$LINENO" 5 fi # and whether the version is supported - if echo $pgac_llvm_version | $AWK -F '.' '{ if ($1 >= 10) exit 1; else exit 0;}';then -as_fn_error $? "$LLVM_CONFIG version is $pgac_llvm_version but at least 10 is required" "$LINENO" 5 + if echo $pgac_llvm_version | $AWK -F '.' '{ if ($1 >= 14) exit 1; else exit 0;}';then +as_fn_error $? "$LLVM_CONFIG version is $pgac_llvm_version but at least 14 is required" "$LINENO" 5 fi { $as_echo "$as_me:${as_lineno-$LINENO}: using llvm $pgac_llvm_version" >&5 $as_echo "$as_me: using llvm $pgac_llvm_version" >&6;} @@ -5138,7 +5138,7 @@ $as_echo "$as_me: using llvm $pgac_llvm_version" >&6;} # need clang to create some bitcode files if test -z "$CLANG"; then - for ac_prog in clang clang-17 clang
Re: broken JIT support on Fedora 40
On Tue, Apr 9, 2024 at 10:05 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > + /* In assertion builds, run the LLVM verify pass. */ > +#ifdef USE_ASSERT_CHECKING > + LLVMPassBuilderOptionsSetVerifyEach(options, true); > +#endif Thanks, that seems nicer. I think the question is whether it will slow down build farm/CI/local meson test runs to a degree that exceeds its value. Another option would be to have some other opt-in macro, like the existing #ifdef LLVM_PASS_DEBUG, for people who maintain JIT-related stuff to turn on. Supposing we go with USE_ASSERT_CHECKING, I have another question: - const char *nm = "llvm.lifetime.end.p0i8"; + const char *nm = "llvm.lifetime.end.p0"; Was that a mistake, or did the mangling rules change in some version? I don't currently feel inclined to go and test this on the ancient versions we claim to support in back-branches. Perhaps we should just do this in master, and then it'd be limited to worrying about LLVM versions 10-18 (see 820b5af7), which have the distinct advantage of being available in package repositories for testing. Or I suppose we could back-patch, but only do it if LLVM_VERSION_MAJOR >= 10. Or we could do it unconditionally, and wait for ancient-LLVM build farm animals to break if they're going to. I pushed the illegal attribute fix though. Thanks for the detective work! (It crossed my mind that perhaps deform functions should have their own template function, but if someone figures out that that's a good idea, I think we'll *still* need that change just pushed.)
Re: broken JIT support on Fedora 40
On Sat, Apr 6, 2024 at 5:01 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > Yep, I think this is it. I've spent some hours trying to understand why > > > suddenly deform function has noundef ret attribute, when it shouldn't -- > > > this explains it and the proposed change fixes the crash. One thing that > > > is still not clear to me though is why this copied attribute doesn't > > > show up in the bitcode dumped right before running inline pass (I've > > > added this to troubleshoot the issue). > > > > One thing to consider in this context is indeed adding "verify" pass as > > suggested in the PR, at least for the debugging configuration. Without the > > fix > > it immediately returns: > > > > Running analysis: VerifierAnalysis on deform_0_1 > > Attribute 'noundef' applied to incompatible type! > > > > llvm error: Broken function found, compilation aborted! > > Here is what I have in mind. Interestingly enough, it also shows few > more errors besides "noundef": > > Intrinsic name not mangled correctly for type arguments! Should be: > llvm.lifetime.end.p0 > ptr @llvm.lifetime.end.p0i8 > > It refers to the function from create_LifetimeEnd, not sure how > important is this. Would it be too slow to run the verify pass always, in assertion builds? Here's a patch for the original issue, and a patch to try that idea + a fix for that other complaint it spits out. The latter would only run for LLVM 17+, but that seems OK. From 57af42d9a1b47b7361c7200a17a210f2ca37bd5d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 9 Apr 2024 18:10:30 +1200 Subject: [PATCH 1/2] Fix illegal attribute LLVM attribute propagation. Commit 72559438 started copying more attributes from AttributeTemplate to the functions we generate on the fly. In the case of deform functions, which return void, this meant that "noundef", from AttributeTemplate's return value (a Datum) was copied to a void type. Older LLVM releases were OK with that, but LLVM 18 crashes. "noundef" is rejected for void by the verify pass, which would have alerted us to this problem (something we'll consider checking automatically in a later commit). Update our llvm_copy_attributes() function to skip copying the attribute for the return value, if the target function returns void. Thanks to Dmitry Dolgov for help chasing this down. Back-patch to all supported releases, like 72559438. Reported-by: Pavel Stehule Reviewed-by: Dmitry Dolgov <9erthali...@gmail.com> Discussion: https://postgr.es/m/CAFj8pRACpVFr7LMdVYENUkScG5FCYMZDDdSGNU-tch%2Bw98OxYg%40mail.gmail.com --- src/backend/jit/llvm/llvmjit.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c index ec0fdd49324..92b4993a98a 100644 --- a/src/backend/jit/llvm/llvmjit.c +++ b/src/backend/jit/llvm/llvmjit.c @@ -552,8 +552,11 @@ llvm_copy_attributes(LLVMValueRef v_from, LLVMValueRef v_to) /* copy function attributes */ llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeFunctionIndex); - /* and the return value attributes */ - llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex); + if (LLVMGetTypeKind(LLVMGetFunctionReturnType(v_to)) != LLVMVoidTypeKind) + { + /* and the return value attributes */ + llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex); + } /* and each function parameter's attribute */ param_count = LLVMCountParams(v_from); -- 2.44.0 From 91d8b747686aa07341337e1cd0b7c2244e955adb Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 9 Apr 2024 18:57:48 +1200 Subject: [PATCH 2/2] Run LLVM verify pass on all generated IR. We fixed a recent problem that crashed LLVM 18, but Dmitry pointed out that we'd have known about that all along if we'd automatically run the verify pass on the IR we generate. Turn that on in assertion builds. That reveals one other complaint about a name, which is also fixed here. Suggested-by: Dmitry Dolgov <9erthali...@gmail.com> Discussion: https://postgr.es/m/CAFj8pRACpVFr7LMdVYENUkScG5FCYMZDDdSGNU-tch%2Bw98OxYg%40mail.gmail.com --- src/backend/jit/llvm/llvmjit.c | 11 +-- src/backend/jit/llvm/llvmjit_expr.c | 2 +- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c index 92b4993a98a..bc429e970f2 100644 --- a/src/backend/jit/llvm/llvmjit.c +++ b/src/backend/jit/llvm/llvmjit.c @@ -703,10 +703,17 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module) LLVMErrorRef err; const char *passes; + /* In assertion builds, run the LLVM verify pass. */ +#ifdef USE_ASSERT_CHECKING +#define PASSES_PREFIX "verify," +#else +#define PASSES_PREFIX "" +#endif + if (context->b
Re: Speed up clean meson builds by ~25%
On Tue, Apr 9, 2024 at 5:01 PM Michael Paquier wrote: > On Mon, Apr 08, 2024 at 12:23:56PM +0300, Nazir Bilal Yavuz wrote: > > On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin wrote: > >> As I wrote in [1], I didn't observe the issue with clang-18, so maybe it > >> is fixed already. > >> Perhaps it's worth rechecking... > >> > >> [1] > >> https://www.postgresql.org/message-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8%40gmail.com > > > > I had this problem on my local computer. My build times are: > > > > gcc: 20s > > clang-15: 24s > > clang-16: 105s > > clang-17: 111s > > clang-18: 25s > > Interesting. A parallel build of ecpg shows similar numbers here: > clang-16: 101s > clang-17: 112s > clang-18: 14s > gcc: 10s I don't expect it to get fixed BTW, because it's present in 16.0.6, and .6 is the terminal release, if I understand their system correctly. They're currently only doing bug fixes for 18, and even there not for much longer. Interesting that not everyone saw this at first, perhaps the bug arrived in a minor release that some people didn't have yet? Or perhaps there is something special required to trigger it?
Re: Vectored I/O in bulk_write.c
Here's a rebase. I decided against committing this for v17 in the end. There's not much wrong with it AFAIK, except perhaps an unprincipled chopping up of writes with large io_combine_limit due to simplistic flow control, and I liked the idea of having a decent user of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit faster, but... I'd like to try something more ambitious that streamifies this and also the "main" writeback paths. I shared some patches for that that are counterparts to this, over at[1]. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK1in4FiWtisXZ%2BJo-cNSbWjmBcPww3w3DBM%2BwhJTABXA%40mail.gmail.com From 7ee50aae3d4eba0df5bce05c196f411abb0bd9ab Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 8 Apr 2024 18:19:41 +1200 Subject: [PATCH v7 1/3] Use smgrwritev() for both overwriting and extending. Since mdwrite() and mdextend() were very similar and both needed vectored variants, merge them into a single interface. This reduces duplication and fills out the set of operations. We still want to be able to assert that callers know the difference between overwriting and extending, and to activate slightly different behavior during recovery, so add a new "flags" argument to control that. The goal here is to provide the missing vectored interface without which the new bulk write facility from commit 8af25652 can't write to the file system in bulk. A following commit will connect them together. Like smgrwrite(), the traditional single-block smgrextend() function with skipFsync boolean argument is now translated to smgrwritev() by an inlinable wrapper function. The smgrzeroextend() interface remains distinct; the idea was floated of merging that too, but so far without consensus. Reviewed-by: Heikki Linnakangas Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com --- src/backend/storage/buffer/bufmgr.c | 10 +-- src/backend/storage/smgr/md.c | 120 src/backend/storage/smgr/smgr.c | 100 ++- src/include/storage/md.h| 4 +- src/include/storage/smgr.h | 19 - 5 files changed, 94 insertions(+), 159 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 44836751b71..5166a839da8 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4011,11 +4011,11 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, * * In recovery, we cache the value returned by the first lseek(SEEK_END) * and the future writes keeps the cached value up-to-date. See - * smgrextend. It is possible that the value of the first lseek is smaller - * than the actual number of existing blocks in the file due to buggy - * Linux kernels that might not have accounted for the recent write. But - * that should be fine because there must not be any buffers after that - * file size. + * smgrzeroextend and smgrwritev. It is possible that the value of the + * first lseek is smaller than the actual number of existing blocks in the + * file due to buggy Linux kernels that might not have accounted for the + * recent write. But that should be fine because there must not be any + * buffers after that file size. */ for (i = 0; i < nforks; i++) { diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index bf0f3ca76d1..73d077ca3ea 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -447,79 +447,10 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo) pfree(path); } -/* - * mdextend() -- Add a block to the specified relation. - * - * The semantics are nearly the same as mdwrite(): write at the - * specified position. However, this is to be used for the case of - * extending a relation (i.e., blocknum is at or beyond the current - * EOF). Note that we assume writing a block beyond current EOF - * causes intervening file space to become filled with zeroes. - */ -void -mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, - const void *buffer, bool skipFsync) -{ - off_t seekpos; - int nbytes; - MdfdVec*v; - - /* If this build supports direct I/O, the buffer must be I/O aligned. */ - if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ) - Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); - - /* This assert is too expensive to have on normally ... */ -#ifdef CHECK_WRITE_VS_EXTEND - Assert(blocknum >= mdnblocks(reln, forknum)); -#endif - - /* - * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any - * more --- we mustn't create a block whose number actually is - * InvalidBlockNumber. (Note that this failure should be unreachable - * because of upstream checks in bufmgr.c.) - */ - if (blocknum == InvalidBlockNumber) - ereport(ERROR, -(e
Streaming relation data out of order
Hi This idea is due to Robert Haas, who complained that he feared that the streaming I/O API already worked like this. It doesn't, but it could! Here is a concept patch to try it out. Normally, read_stream_next_buffer() spits out buffers in the order that the user's callback generated block numbers. This option says that any order would be OK. I had been assuming that this sort of thing would come with real asynchronous I/O: if I/Os complete out of order and the caller explicitly said she doesn't care about block order, we can stream them as the completion events arrive. But even with synchronous I/O, we could stream already-in-cache blocks before ones that require I/O. Letting the caller chew on blocks that are already available maximises the time between fadvise() and preadv() for misses, which minimises the likelihood that the process will have to go to "D" sleep. The patch is pretty trivial: if started with the READ_STREAM_OUT_OF_ORDER flag, "hit" buffers are allowed to jump in front of "miss" buffers in the queue. The attached coding may not be optimal, it's just a proof of concept. ANALYZE benefits from this, for example, with certain kinds of partially cached initial states and fast disks (?). I'm not sure how generally useful it is though. I'm posting it because I wonder if it could be interesting for the streaming bitmap heapscan project, and I wonder how many other things don't care about the order. From d549d3c0bac4a6e609781775e63277b370eb48ff Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 6 Apr 2024 13:28:28 +1300 Subject: [PATCH] Add READ_STREAM_OUT_OF_ORDER. Just a proof-of-concept. --- src/backend/storage/aio/read_stream.c | 42 +++ src/include/storage/read_stream.h | 7 + 2 files changed, 43 insertions(+), 6 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 2e45dcdcd4b..c1b95a42d39 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -115,6 +115,7 @@ struct ReadStream int16 pinned_buffers; int16 distance; bool advice_enabled; + bool out_of_order_enabled; /* * Small buffer of block numbers, useful for 'ungetting' to resolve flow @@ -307,12 +308,38 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) >buffers[stream->queue_size], sizeof(stream->buffers[0]) * overflow); - /* Compute location of start of next read, without using % operator. */ - buffer_index += nblocks; - if (buffer_index >= stream->queue_size) - buffer_index -= stream->queue_size; - Assert(buffer_index >= 0 && buffer_index < stream->queue_size); - stream->next_buffer_index = buffer_index; + /* + * If this was a hit and there is already I/O in progress, can we + * re-prioritize it so that I/O has longer to complete and the caller can + * consume this buffer immediately? + */ + if (!need_wait && + nblocks == 1 && + stream->ios_in_progress > 0 && + stream->out_of_order_enabled) + { + int16 oldest_buffer_index; + + /* Insert before "oldest" buffer. */ + oldest_buffer_index = stream->oldest_buffer_index - 1; + if (oldest_buffer_index < 0) + oldest_buffer_index += stream->queue_size; + stream->buffers[oldest_buffer_index] = stream->buffers[buffer_index]; + if (stream->per_buffer_data_size > 0) + memcpy(get_per_buffer_data(stream, oldest_buffer_index), + get_per_buffer_data(stream, buffer_index), + stream->per_buffer_data_size); + stream->oldest_buffer_index = oldest_buffer_index; + } + else + { + /* Compute location of start of next read, without using % operator. */ + buffer_index += nblocks; + if (buffer_index >= stream->queue_size) + buffer_index -= stream->queue_size; + Assert(buffer_index >= 0 && buffer_index < stream->queue_size); + stream->next_buffer_index = buffer_index; + } /* Adjust the pending read to cover the remaining portion, if any. */ stream->pending_read_blocknum += nblocks; @@ -515,6 +542,9 @@ read_stream_begin_relation(int flags, stream->advice_enabled = true; #endif + if (flags & READ_STREAM_OUT_OF_ORDER) + stream->out_of_order_enabled = true; + /* * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled * above. If we had real asynchronous I/O we might need a slightly diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h index fae09d2b4cc..d23d51af00c 100644 --- a/src/include/storage/read_stream.h +++ b/src/include/storage/read_stream.h @@ -41,6 +41,13 @@ */ #define READ_STREAM_FULL 0x04 +/* + * We usually stream buffers in the order the callback generates block + * numbers, but if the caller can cope with it, there are sometimes + * opportunities to reorder blocks to reduce I/O stalls. + */ +#define READ_STREAM_OUT_OF_ORDER 0x08 + struct ReadStream; typedef struct ReadStream ReadStream; -- 2.44.0
Experimental prefetching of buffer memory
FUN([PGAC_CHECK_BUILTIN_VOID_FUNC], +[AC_CACHE_CHECK(for $1, pgac_cv$1, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([ +void +call$1($2) +{ +$1(x); +}], [])], +[pgac_cv$1=yes], +[pgac_cv$1=no])]) +if test x"${pgac_cv$1}" = xyes ; then +AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, + [Define to 1 if your compiler understands $1.]) +fi])# PGAC_CHECK_BUILTIN_VOID_FUNC # PGAC_CHECK_BUILTIN_FUNC_PTR diff --git a/configure b/configure index cfbd2a096f4..9d9ed38fe57 100755 --- a/configure +++ b/configure @@ -15546,6 +15546,46 @@ _ACEOF fi +# Can we use a built-in to prefetch memory? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +void +call__builtin_prefetch(void *x) +{ +__builtin_prefetch(x); +} +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5 +$as_echo "$pgac_cv__builtin_prefetch" >&6; } +if test x"${pgac_cv__builtin_prefetch}" = xyes ; then + +cat >>confdefs.h <<_ACEOF +#define HAVE__BUILTIN_PREFETCH 1 +_ACEOF + +fi + # We require 64-bit fseeko() to be available, but run this check anyway # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that. { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5 diff --git a/configure.ac b/configure.ac index 67e738d92b1..e991e56613c 100644 --- a/configure.ac +++ b/configure.ac @@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x]) # so it needs a different test function. PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0]) +# Can we use a built-in to prefetch memory? +PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x]) + # We require 64-bit fseeko() to be available, but run this check anyway # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that. AC_FUNC_FSEEKO diff --git a/meson.build b/meson.build index 5acf083ce3c..eecd8aa6f45 100644 --- a/meson.build +++ b/meson.build @@ -1707,6 +1707,7 @@ builtins = [ 'constant_p', 'frame_address', 'popcount', + 'prefetch', 'unreachable', ] diff --git a/src/include/c.h b/src/include/c.h index dc1841346cd..1cb191b02a4 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -425,6 +425,14 @@ typedef void (*pg_funcptr_t) (void); #define HAVE_PRAGMA_GCC_SYSTEM_HEADER 1 #endif +/* Do we have support for prefetching memory? */ +#if defined(HAVE__BUILTIN_PREFETCH) +#define pg_prefetch_mem(a) __builtin_prefetch(a) +#elif defined(_MSC_VER) +#define pg_prefetch_mem(a) _m_prefetch(a) +#else +#define pg_prefetch_mem(a) +#endif /* *Section 2: bool, true, false diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index f8d3e3b6b84..6f0b6a51417 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -546,6 +546,9 @@ /* Define to 1 if your compiler understands __builtin_popcount. */ #undef HAVE__BUILTIN_POPCOUNT +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P -- 2.44.0 From 0f1a87954e27cd6e59e3ef45b610677b13a3985b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 15:06:32 +1300 Subject: [PATCH 2/2] Prefetch page header memory when streaming relations. read_stream.c can always see at least one page ahead of the one the caller is accessing. Take the opportunity to prefetch the cache line that holds the next page's header. For some scans, that can generate a decent speedup, though real world results will depend on how much work the CPU actually does before it gets around to accessing the next page. Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index f54dacdd914..2e45dcdcd4b 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -617,7 +617,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) stream->advice_enabled ? READ_BUFFERS_ISSUE_ADVICE : 0))) { -/* Fast return.
Re: post-freeze damage control
On Tue, Apr 9, 2024 at 7:47 AM Robert Haas wrote: > - The streaming read API stuff was all committed very last minute. I > think this should have been committed much sooner. It's probably not > going to break the world; it's more likely to have performance > consequences. But if it had gone in sooner, we'd have had more time to > figure that out. OK, let me give an update on this work stream (pun intended). One reason for the delay in committing was precisely that we were fretting about regressions risks. We tried pretty hard to identify and grind down every regression we could find, and cases with outstanding not-fully-understood or examined problems in that area have been booted into the next cycle for more work: streaming bitmap heapscan, several streaming vacuum patches, and more, basically things that seem to have more complex interactions with other machinery. The only three places using streaming I/O that went in were: 041b9680: Use streaming I/O in ANALYZE. b7b0f3f2: Use streaming I/O in sequential scans. 3a352df0: Use streaming I/O in pg_prewarm. The first is a good first exercise in streaming random blocks; hopefully no one would be too upset about an unexpected small regression in ANALYZE, but as it happens it goes faster hot and cold according to all reports. The second is a good first exercise in streaming sequential blocks, and it ranges from faster to no regression, according to testing and reports. The third is less important, but it also goes faster. Of those, streaming seq scan is clearly the most relevant to real workloads that someone might be upset about, and I made a couple of choices that you might say had damage control in mind: * A conservative choice not to get into the business of the issuing new hints to the kernel for random jumps in cold scans, even though we think we probably should for better performance: more research needed precisely to avoid unexpected interactions (cf the booted bitmap heapscan where that sort of thing seems to be afoot). * A GUC to turn off I/O combining if it somehow upsets your storage in ways we didn't foresee (io_combine_limit=1). For fully cached hot scans, it does seem to be quite sensitive to tiny changes in a hot code path that I and others spent a lot of time optimising and testing during the CF. Perhaps it is possible that someone else's microarchitecture or compiler could show a regression that I don't see, and I will certainly look into it with vim and vigour if so. In that case we could consider a tiny micro-optimisation that I've shared already (it seemed a little novel so I'd rather propose it in the new cycle if I can), or, if it comes to it based on evidence and inability to address a problem quickly, reverting just b7b0f3f2 which itself is a very small patch. An aspect you didn't mention is correctness. I don't actually know how to prove that buffer manager protocols are correct beyond thinking and torture testing, ie what kind of new test harness machinery could be used to cross-check more things about buffer pool state explicitly, and that is a weakness I'm planning to look into. I realise that "these are the good ones, you should see all the stuff we decided not to commit!" is not an argument, I'm just laying out how I see the patches that went in and why I thought they were good. It's almost an architectural change, but done in tiny footsteps. I appreciate that people would have liked to see those particular tiny footsteps in some of the other fine months available for patching the tree, and some of the earlier underpinning patches that were part of the same patch series did go in around New Year, but clearly my "commit spreading" didn't go as well as planned after that (not helped by Jan/Feb summer vacation season down here). Mr Paquier this year announced his personal code freeze a few weeks back on social media, which seemed like an interesting idea I might adopt. Perhaps that is what some other people are doing without saying so, and perhaps the time they are using for that is the end of the calendar year. I might still be naturally inclined to crunch-like behaviour, but it wouldn't be at the same time as everyone else, except all the people who follow the same advice.
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On Mon, Apr 8, 2024 at 11:53 AM Melanie Plageman wrote: > I've reviewed v6. I think you should mention in the docs that it only > works for shared buffers -- so specifically not buffers containing > blocks of temp tables. Thanks for looking! The whole pg_buffercache extension is for working with shared buffers only, as mentioned at the top. I have tried to improve that paragraph though, as it only mentioned examining them. > In the function pg_buffercache_invalidate(), why not use the > BufferIsValid() function? > > - if (buf < 1 || buf > NBuffers) > + if (!BufferIsValid(buf) || buf > NBuffers) It doesn't check the range (it has assertions, not errors). > I thought the below would be more clear for the comment above > InvalidateUnpinnedBuffer(). > > - * Returns true if the buffer was valid and it has now been made invalid. > - * Returns false if the wasn't valid, or it couldn't be evicted due to a pin, > - * or if the buffer becomes dirty again while we're trying to write it out. > + * Returns true if the buffer was valid and has now been made invalid. > Returns > + * false if it wasn't valid, if it couldn't be evicted due to a pin, or if > the > + * buffer becomes dirty again while we're trying to write it out. Fixed. > Some of that probably applies for the docs too (i.e. you have some > similar wording in the docs). There is actually one typo in your > version, so even if you don't adopt my suggestion, you should fix that > typo. Yeah, thanks, improved similarly there. > I didn't notice anything else out of place. I tried it and it worked > as expected. I'm excited to have this feature! Thanks! > I didn't read through this whole thread, but was there any talk of > adding other functions to let me invalidate a bunch of buffers at once > or even some options -- like invalidate every 3rd buffer or whatever? > (Not the concern of this patch, but just wondering because that would > be a useful future enhancement IMO). TBH I tried to resist people steering in that direction because you can also just define a SQL function to do that built on this, and if you had specialised functions they'd never be quite right. IMHO we succeeded in minimising the engineering and maximising flexibility, 'cause it's for hackers. Crude, but already able to express a wide range of stuff by punting the problem to SQL. Thanks to Palak for the patch. Pushed.
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On Mon, Apr 8, 2024 at 12:10 PM Andres Freund wrote: > On 2024-04-07 11:07:58 +1200, Thomas Munro wrote: > > I thought of a better name for the bufmgr.c function though: > > InvalidateUnpinnedBuffer(). That name seemed better to me after I > > festooned it with warnings about why exactly it's inherently racy and > > only for testing use. > > I still dislike that, fwiw, due to the naming similarity to > InvalidateBuffer(), which throws away dirty buffer contents too. Which > obviously isn't acceptable from "userspace". I'd just name it > pg_buffercache_evict() - given that the commit message's first paragraph uses > "it is useful to be able to evict arbitrary blocks" that seems to describe > things at least as well as "invalidate"? Alright, sold. I'll go with EvictUnpinnedBuffer() in bufmgr.c and pg_buffercache_evict() in the contrib module.
Re: Use streaming read API in ANALYZE
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman wrote: > On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote: > > On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote: > > > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001 > > > From: Melanie Plageman > > > Date: Sun, 7 Apr 2024 15:38:41 -0400 > > > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore() > > > > > > A previous commit stopped using BlockSampler_HasMore() for flow control > > > in acquire_sample_rows(). There seems little use now for > > > BlockSampler_HasMore(). It should be sufficient to return > > > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore() > > > would have returned false. Remove BlockSampler_HasMore(). > > > > > > Author: Melanie Plageman, Nazir Bilal Yavuz > > > Discussion: > > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com > > > > The justification here seems somewhat odd. Sure, the previous commit stopped > > using BlockSampler_HasMore in acquire_sample_rows - but only because it was > > moved to block_sampling_streaming_read_next()? > > It didn't stop using it. It stopped being useful. The reason it existed, > as far as I can tell, was to use it as the while() loop condition in > acquire_sample_rows(). I think it makes much more sense for > BlockSampler_Next() to return InvalidBlockNumber when there are no more > blocks -- not to assert you don't call it when there aren't any more > blocks. > > I didn't want to change BlockSampler_Next() in the same commit as the > streaming read user and we can't remove BlockSampler_HasMore() without > changing BlockSampler_Next(). I agree that the code looks useless if one condition implies the other, but isn't it good to keep that cross-check, perhaps reformulated as an assertion? I didn't look too hard at the maths, I just saw the words "It is not obvious that this code matches Knuth's Algorithm S ..." and realised I'm not sure I have time to develop a good opinion about this today. So I'll leave the 0002 change out for now, as it's a tidy-up that can easily be applied in the next cycle. From c3b8df8e4720d8b0dfb4c892c0aa3ddaef8f401f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 8 Apr 2024 14:38:58 +1200 Subject: [PATCH v12] Remove obsolete BlockSampler_HasMore(). Commit 041b9680 stopped using BlockSampler_HasMore() for flow control in acquire_sample_rows(). There seems to be little use for it now. We can just return InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore() would have returned false. Author: Melanie Plageman Author: Nazir Bilal Yavuz Reviewed-by: Andres Freund Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com --- src/backend/commands/analyze.c| 4 +--- src/backend/utils/misc/sampling.c | 10 +++--- src/include/utils/sampling.h | 1 - 3 files changed, 4 insertions(+), 11 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index da27a13a3f..e9fa3470cf 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -,9 +,7 @@ block_sampling_read_stream_next(ReadStream *stream, void *callback_private_data, void *per_buffer_data) { - BlockSamplerData *bs = callback_private_data; - - return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber; + return BlockSampler_Next(callback_private_data); } /* diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c index 933db06702..6e2bca9739 100644 --- a/src/backend/utils/misc/sampling.c +++ b/src/backend/utils/misc/sampling.c @@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize, return Min(bs->n, bs->N); } -bool -BlockSampler_HasMore(BlockSampler bs) -{ - return (bs->t < bs->N) && (bs->m < bs->n); -} - BlockNumber BlockSampler_Next(BlockSampler bs) { @@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs) double p;/* probability to skip block */ double V;/* random */ - Assert(BlockSampler_HasMore(bs)); /* hence K > 0 and k > 0 */ + /* Return if no remaining blocks or no blocks to sample */ + if (K <= 0 || k <= 0) + return InvalidBlockNumber; if ((BlockNumber) k >= K) { diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h index be48ee52ba..fb5d6820a2 100644 --- a/src/include/utils/sampling.h +++ b/src/include/utils/sampling.h @@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler; extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize, uint32 randseed); -extern bool BlockSampler_HasMore(BlockSampler bs); extern BlockNumber BlockSampler_Next(BlockSampler bs); /* Reservoir sampling methods */ -- 2.44.0
Re: Use streaming read API in ANALYZE
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman wrote: > On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote: > > > src/backend/commands/analyze.c | 89 ++ > > > 1 file changed, 26 insertions(+), 63 deletions(-) > > > > That's a very nice demonstration of how this makes good prefetching > > easier... > > Agreed. Yay streaming read API and Bilal! +1 I found a few comments to tweak, just a couple of places that hadn't got the memo after we renamed "read stream", and an obsolete mention of pinning buffers. I adjusted those directly. I ran some tests on a random basic Linux/ARM cloud box with a 7.6GB table, and I got: coldhot master: 9025ms 199ms patched, io_combine_limit=1:9025ms 191ms patched, io_combine_limit=default: 8729ms 191ms Despite being random, occasionally some I/Os must get merged, allowing slightly better random throughput when accessing disk blocks through a 3000 IOPS drinking straw. Looking at strace, I see 29144 pread* calls instead of 30071, which fits that theory. Let's see... if you roll a fair 973452-sided dice 30071 times, how many times do you expect to roll consecutive numbers? Each time you roll there is a 1/973452 chance that you get the last number + 1, and we have 30071 tries giving 30071/973452 = ~3%. 9025ms minus 3% is 8754ms. Seems about right. I am not sure why the hot number is faster exactly. (Anecdotally, I did notice that in the cases that beat master semi-unexpectedly like this, my software memory prefetch patch doesn't help or hurt, while in some cases and on some CPUs there is little difference, and then that patch seems to get a speed-up like this, which might be a clue. *Shrug*, investigation needed.) Pushed. Thanks Bilal and reviewers!
Re: Streaming read-ready sequential scan code
On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman wrote: > Attached v13 0001 is your fix and 0002 is a new version of the > sequential scan streaming read user. Off-list Andres mentioned that I > really ought to separate the parallel and serial sequential scan users > into two different callbacks. I've done that in the attached. It > actually makes the code used by the callbacks nicer and more readable > anyway (putting aside performance). I was able to measure a small > performance difference as well. Thanks. I changed a couple of very trivial things before pushing. +BlockNumber (*cb) (ReadStream *pgsr, void *private_data, + void *per_buffer_data); This type has a friendly name: ReadStreamBlockNumberCB. +scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL, I've been on the fence about that flag for sequential scan... Some days I want to consider changing to READ_STREAM_DEFAULT and relying on our "anti-heuristics" to suppress advice, which would work out the same in most cases but might occasionally win big. It might also hurt, I dunno, so I suspect we'd have to make it better first, which my patch in [1] was a first swing at, but I haven't researched that enough. So, kept this way! - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buffer before saving it in the scan descriptor. Changed to: * Read the next block of the scan relation from the read stream and save it * in the scan descriptor. It is already pinned. +static BlockNumber +heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data, +void *per_buffer_data) Changed argument names to match the function pointer type definition, "stream" and "callback_private_data". BTW looking at the branching in read-stream user patches that have an initialisation step like yours, I wonder if it might every make sense to be able to change the callback on the fly from inside the callback, so that you finish up with a branchless one doing most of the work. I have no idea if it's worth it... [1] https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com
Re: Streaming read-ready sequential scan code
I found a bug in read_stream.c that could be hit with Melanie's streaming seq scan patch with parallelism enabled and certain buffer pool conditions. Short version: there is an edge case where an "if" needed to be a "while", or we could lose a few blocks. Here's the fix for that, longer explanation in commit message. From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 7 Apr 2024 12:36:44 +1200 Subject: [PATCH] Fix bug in read_stream.c. When we determine that a wanted block can't be combined with the current pending read, it's time to start that pending read to get it out of the way. An "if" in that code path should have been a "while", because it might take more than one go to get that job done. Otherwise the remaining part of a partially started read could be clobbered and we could lose some blocks. This was only broken for smaller ranges, as the more common case of io_combine_limit-sized ranges is handled earlier in the code and knows how to loop. Discovered while testing parallel sequential scans of partially cached tables. They have a ramp-down phase with ever smaller ranges of contiguous blocks, to be fair to parallel workers as the work runs out. Defect in commit b5a9b18c. --- src/backend/storage/aio/read_stream.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 9a70a81f7ae..f54dacdd914 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) } /* We have to start the pending read before we can build another. */ - if (stream->pending_read_nblocks > 0) + while (stream->pending_read_nblocks > 0) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; -- 2.44.0
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On second thoughts, I think the original "invalidate" terminology was fine, no need to invent a new term. I thought of a better name for the bufmgr.c function though: InvalidateUnpinnedBuffer(). That name seemed better to me after I festooned it with warnings about why exactly it's inherently racy and only for testing use. I suppose someone could propose an additional function pg_buffercache_invalidate(db, tbspc, rel, fork, blocknum) that would be slightly better in the sense that it couldn't accidentally evict some innocent block that happened to replace the real target just before it runs, but I don't think it matters much for this purpose and it would still be racy on return (vacuum decides to load your block back in) so I don't think it's worth bothering with. So this is the version I plan to commit. From 6a13349b788c8539b2d349f9553706d7c563f8f8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 7 Apr 2024 09:13:17 +1200 Subject: [PATCH v6] Add pg_buffercache_invalidate() function for testing. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When testing buffer pool logic, it is useful to be able to evict arbitrary blocks. This function can be used in SQL queries over the pg_buffercache view to set up a wide range of buffer pool states. Of course, buffer mappings might change concurrently so you might evict a block other than the one you had in mind, and another session might bring it back in at any time. That's OK for the intended purpose of setting up developer testing scenarios, and more complicated interlocking schemes to give stronger guararantees about that would likely be less flexible for actual testing work anyway. Superuser-only. Author: Palak Chaturvedi Author: Thomas Munro (docs, small tweaks) Reviewed-by: Nitin Jadhav Reviewed-by: Andres Freund Reviewed-by: Cary Huang Reviewed-by: Cédric Villemain Reviewed-by: Jim Nasby Reviewed-by: Maxim Orlov Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/calfch19pw48zwwzuorspsav9hqt0upyabpc4boz4w+c7ff5...@mail.gmail.com --- contrib/pg_buffercache/Makefile | 2 +- contrib/pg_buffercache/meson.build| 1 + .../pg_buffercache--1.4--1.5.sql | 6 ++ contrib/pg_buffercache/pg_buffercache.control | 2 +- contrib/pg_buffercache/pg_buffercache_pages.c | 20 ++ doc/src/sgml/pgbuffercache.sgml | 37 +-- src/backend/storage/buffer/bufmgr.c | 63 +++ src/include/storage/bufmgr.h | 2 + 8 files changed, 125 insertions(+), 8 deletions(-) create mode 100644 contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile index d6b58d4da9..eae65ead9e 100644 --- a/contrib/pg_buffercache/Makefile +++ b/contrib/pg_buffercache/Makefile @@ -8,7 +8,7 @@ OBJS = \ EXTENSION = pg_buffercache DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \ pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \ - pg_buffercache--1.3--1.4.sql + pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time" REGRESS = pg_buffercache diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build index c86e33cc95..1ca3452918 100644 --- a/contrib/pg_buffercache/meson.build +++ b/contrib/pg_buffercache/meson.build @@ -22,6 +22,7 @@ install_data( 'pg_buffercache--1.2--1.3.sql', 'pg_buffercache--1.2.sql', 'pg_buffercache--1.3--1.4.sql', + 'pg_buffercache--1.4--1.5.sql', 'pg_buffercache.control', kwargs: contrib_data_args, ) diff --git a/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql new file mode 100644 index 00..92c530bc19 --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql @@ -0,0 +1,6 @@ +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.5'" to load this file. \quit + +CREATE FUNCTION pg_buffercache_invalidate(IN int) +RETURNS bool +AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate' +LANGUAGE C PARALLEL SAFE VOLATILE STRICT; diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control index a82ae5f9bb..5ee875f77d 100644 --- a/contrib/pg_buffercache/pg_buffercache.control +++ b/contrib/pg_buffercache/pg_buffercache.control @@ -1,5 +1,5 @@ # pg_buffercache extension comment = 'examine the shared buffer cache' -default_version = '1.4' +default_version = '1.5' module_pathname = '$libdir/pg_buffercache' relocatable = true diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 3316732365..9617bfa47b 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -63,6 +63,7 @@ typedef struct PG_FUNCTION_INFO_V1(pg_buffercach
Re: LogwrtResult contended spinlock
On Sat, Apr 6, 2024 at 6:55 AM Alvaro Herrera wrote: > Pushed 0001. Could that be related to the 3 failures on parula that look like this? TRAP: failed Assert("node->next == 0 && node->prev == 0"), File: "../../../../src/include/storage/proclist.h", Line: 63, PID: 29119 2024-04-05 16:16:26.812 UTC [29114:15] pg_regress/drop_operator LOG: statement: DROP OPERATOR <|(bigint, bigint); postgres: postgres regression [local] CREATE ROLE(ExceptionalCondition+0x4c)[0x9b3fdc] postgres: postgres regression [local] CREATE ROLE[0x8529e4] postgres: postgres regression [local] CREATE ROLE(LWLockWaitForVar+0xec)[0x8538fc] postgres: postgres regression [local] CREATE ROLE[0x54c7d4] postgres: postgres regression [local] CREATE ROLE(XLogFlush+0xf0)[0x552600] postgres: postgres regression [local] CREATE ROLE[0x54a9b0] postgres: postgres regression [local] CREATE ROLE[0x54bbdc] Hmm, the comments for LWLockWaitForVar say: * Be aware that LWLockConflictsWithVar() does not include a memory barrier, * hence the caller of this function may want to rely on an explicit barrier or * an implied barrier via spinlock or LWLock to avoid memory ordering issues. But that seems to be more likely to make LWLockWaitForVar suffer data races (ie hang), not break assertions about LWLock sanity, so I don't know what's going on there. I happened to have a shell on a Graviton box, but I couldn't reproduce it after a while...
Re: Streaming read-ready sequential scan code
On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman wrote: > On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro wrote: > > The interesting column is hot. The 200ms->211ms regression is due to > > the extra bookkeeping in the slow path. The rejiggered fastpath code > > fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can > > you reproduce that? > > I am able to reproduce the fast path solving the issue using Heikki's > example here [1] but in shared buffers (hot). > > master: 25 ms > stream read: 29 ms > stream read + fast path: 25 ms Great, thanks. > I haven't looked into or reviewed the memory prefetching part. > > While reviewing 0002, I realized that I don't quite see how > read_stream_get_block() will be used in the fastpath -- which it > claims in its comments. Comments improved. > Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist? Just for testing purposes. Behaviour should be identical to external observers either way, it's just a hand-rolled specialisation for certain parameters, and it's useful to be able to verify that and measure the speedup. I think it's OK to leave a bit of developer/testing scaffolding in the tree if it's likely to be useful again and especially if like this case it doesn't create any dead code. (Perhaps in later work we might find the right way to get the compiler to do the specialisation? It's so simple though.) The occasional CI problem I mentioned turned out to be read_stream_reset() remembering a little too much state across rescans. Fixed. Thanks both for the feedback on the ring buffer tweaks. Comments updated. Here is the full stack of patches I would like to commit very soon, though I may leave the memory prefetching part out for a bit longer to see if I can find any downside. From 54efd755040507b55672092907d53b4db30f5a06 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 12:08:24 +1300 Subject: [PATCH v12 1/6] Improve read_stream.c's fast path. Unfortunately the "fast path" for cached scans that don't do any I/O was accidentally coded in a way that could only be triggered by pg_prewarm's usage patter. We really need it to work for the streaming sequential scan patch. Refactor. Reviewed-by: Melanie Plageman Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 75 +++ 1 file changed, 31 insertions(+), 44 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5e..b9e11a28312 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -169,7 +169,7 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index) /* * Ask the callback which block it would like us to read next, with a small * buffer in front to allow read_stream_unget_block() to work and to allow the - * fast path to work in batches. + * fast path to skip this function and work directly from the array. */ static inline BlockNumber read_stream_get_block(ReadStream *stream, void *per_buffer_data) @@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) if (likely(stream->fast_path)) { BlockNumber next_blocknum; - bool need_wait; /* Fast path assumptions. */ Assert(stream->ios_in_progress == 0); Assert(stream->pinned_buffers == 1); Assert(stream->distance == 1); - Assert(stream->pending_read_nblocks == 1); + Assert(stream->pending_read_nblocks == 0); Assert(stream->per_buffer_data_size == 0); /* We're going to return the buffer we pinned last time. */ @@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) buffer = stream->buffers[oldest_buffer_index]; Assert(buffer != InvalidBuffer); - /* - * Pin a buffer for the next call. Same buffer entry, and arbitrary - * I/O entry (they're all free). - */ - need_wait = StartReadBuffer(>ios[0].op, - >buffers[oldest_buffer_index], - stream->pending_read_blocknum, - stream->advice_enabled ? - READ_BUFFERS_ISSUE_ADVICE : 0); - - /* Choose the block the next call will pin. */ + /* Choose the next block to pin. */ if (unlikely(stream->blocknums_next == stream->blocknums_count)) read_stream_fill_blocknums(stream); next_blocknum = stream->blocknums[stream->blocknums_next++]; - /* - * Fast return if the next call doesn't require I/O for the buffer we - * just pinned, and we have a block number to give it as a pending - * read. - */ - if (likely(!need_wait && next_blocknum != InvalidBlockNumber)) + if (likely(next_blocknum != InvalidBlockNumber)) { - stream->pending_read_blocknum = next_blocknum; - return buffer; - } - - /* - * For an
Re: broken JIT support on Fedora 40
On Sun, Mar 31, 2024 at 12:49 PM Thomas Munro wrote: > https://github.com/llvm/llvm-project/pull/87093 Oh, with those clues, I think I might see... It is a bit strange that we copy attributes from AttributeTemplate(), a function that returns Datum, to our void deform function. It works (I mean doesn't crash) if you just comment this line out: llvm_copy_attributes(AttributeTemplate, v_deform_fn); ... but I guess that disables inlining of the deform function? So perhaps we just need to teach that thing not to try to copy the return value's attributes, which also seems to work here: diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c index ec0fdd49324..92b4993a98a 100644 --- a/src/backend/jit/llvm/llvmjit.c +++ b/src/backend/jit/llvm/llvmjit.c @@ -552,8 +552,11 @@ llvm_copy_attributes(LLVMValueRef v_from, LLVMValueRef v_to) /* copy function attributes */ llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeFunctionIndex); - /* and the return value attributes */ - llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex); + if (LLVMGetTypeKind(LLVMGetFunctionReturnType(v_to)) != LLVMVoidTypeKind) + { + /* and the return value attributes */ + llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex); + }
Re: Streaming read-ready sequential scan code
Yeah, I plead benchmarking myopia, sorry. The fastpath as committed is only reached when distance goes 2->1, as pg_prewarm does. Oops. With the attached minor rearrangement, it works fine. I also poked some more at that memory prefetcher. Here are the numbers I got on a desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled, cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe, GCC -O3). create table t (i int, filler text) with (fillfactor=10); insert into t select g, repeat('x', 900) from generate_series(1, 56) g; vacuum freeze t; set max_parallel_workers_per_gather = 0; select count(*) from t; cold = must be read from actual disk (Linux drop_caches) warm = read from linux page cache hot = already in pg cache via pg_prewarm cold warmhot master2479ms 886ms 200ms seqscan 2498ms 716ms 211ms <-- regression seqscan + fastpath2493ms 711ms 200ms <-- fixed, I think? seqscan + memprefetch 2499ms 716ms 182ms seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/ Cold has no difference. That's just my disk demonstrating Linux RA at 128kB (default); random I/O is obviously a more interesting story. It's consistently a smidgen faster with Linux RA set to 2MB (as in blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect probably also increases on fancier faster storage than what I have on hand: cold master1775ms seqscan + fastpath + memprefetch 1700ms Warm is faster as expected (fewer system calls schlepping data kernel->userspace). The interesting column is hot. The 200ms->211ms regression is due to the extra bookkeeping in the slow path. The rejiggered fastpath code fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can you reproduce that? The memory prefetching trick, on top of that, seems to be a good optimisation so far. Note that that's not an entirely independent trick, it's something we can only do now that we can see into the future; it's the next level up of prefetching, worth doing around 60ns before you need the data I guess. Who knows how thrashed the cache might be before the caller gets around to accessing that page, but there doesn't seem to be much of a cost or downside to this bet. We know there are many more opportunities like that[1] but I don't want to second-guess the AM here, I'm just betting that the caller is going to look at the header. Unfortunately there seems to be a subtle bug hiding somewhere in here, visible on macOS on CI. Looking into that, going to find my Mac... [1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com From 74b8cde45a8babcec7b52b06bdb8ea046a0a966f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 13:32:14 +1300 Subject: [PATCH v10 1/4] Use streaming I/O in heapam sequential scan. --- src/backend/access/heap/heapam.c | 100 +-- src/include/access/heapam.h | 15 + 2 files changed, 97 insertions(+), 18 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index dada2ecd1e3..f7946a39fd9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = * */ +static BlockNumber +heap_scan_stream_read_next(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + if (unlikely(!scan->rs_inited)) + { + scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir); + scan->rs_inited = true; + } + else + scan->rs_prefetch_block = heapgettup_advance_block(scan, + scan->rs_prefetch_block, + scan->rs_dir); + + return scan->rs_prefetch_block; +} + /* * initscan - scan code common to heap_beginscan and heap_rescan * @@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + /* + * Initialize to ForwardScanDirection because it is most common and heap + * scans usually must go forwards before going backward. + */ + scan->rs_dir = ForwardScanDirection; + scan->rs_prefetch_block = InvalidBlockNumber; + /* page-at-a-time fields are always invalid when not rs_inited */ /* @@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM. * - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next b
Re: Built-in CTYPE provider
Hi, +command_ok( + [ + 'initdb', '--no-sync', + '--locale-provider=builtin', '-E UTF-8', + '--builtin-locale=C.UTF-8', "$tempdir/data8" + ], + 'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8'); This Sun animal recently turned on --enable-tap-tests, and that ↑ failed[1]: # Running: initdb --no-sync --locale-provider=builtin -E UTF-8 --builtin-locale=C.UTF-8 /home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/initdb/tmp_check/tmp_test_XvK1/data8 The files belonging to this database system will be owned by user "marcel". This user must also own the server process. The database cluster will be initialized with this locale configuration: locale provider: builtin default collation: C.UTF-8 LC_COLLATE: en_US LC_CTYPE:en_US LC_MESSAGES: C LC_MONETARY: en_US LC_NUMERIC: en_US LC_TIME: en_US initdb: error: encoding mismatch initdb: detail: The encoding you selected (UTF8) and the encoding that the selected locale uses (LATIN1) do not match. This would lead to misbehavior in various character string processing functions. initdb: hint: Rerun initdb and either do not specify an encoding explicitly, or choose a matching combination. [14:04:12.462](0.036s) not ok 28 - locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8 [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-04-04%2011%3A42%3A40
Re: Streaming read-ready sequential scan code
On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman wrote: > So, sequential scan does not have per-buffer data. I did some logging > and the reason most fully-in-SB sequential scans don't use the fast > path is because read_stream->pending_read_nblocks is always 0. Hnghghghgh... right, sorry I guessed the wrong reason, it turns out that I made a fast path just a little too specialised for pg_prewarm. Working on it...
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 8:02 PM David Rowley wrote: > 3a4a3537a > latency average = 34.497 ms > latency average = 34.538 ms > > 3a4a3537a + read_stream_for_seqscans.patch > latency average = 40.923 ms > latency average = 41.415 ms > > i.e. no meaningful change from the refactor, but a regression from a > cached workload that changes the page often without doing much work in > between with the read stread patch. I ran Heikki's test except I ran the "insert" 4 times to get a table of 4376MB according to \d+. On my random cloud ARM server (SB=8GB, huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms when the data is in LInux cache and PG is not prewarmed, roughly as he reported. Good. If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying things out to see what works, I got that down to 243ms (ie beat master) by inserting a memory prefetch: --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) /* Prepare for the next call. */ read_stream_look_ahead(stream, false); + __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); Maybe that's a solution to a different problem that just happens to more than make up the difference in this case, and it may be questionable whether that cache line will survive long enough to help you, but this one-tuple-per-page test likes it... Hmm, to get a more realistic table than the one-tuple-per-page on, I tried doubling a tenk1 table until it reached 2759MB and then I got a post-prewarm regression 702ms -> 721ms, and again I can beat master by memory prefetching: 689ms. Annoyingly, with the same table I see no difference between the actual pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to use the 'fast path' I made pretty much just to be able to minimise regression in that (probably stupid) all-cached tes that doesn't even look at the page contents. Unfortunately seq scan can't use it because it has per-buffer data, which is one of the things it can't do (because of a space management issue). Maybe I should try to find a way to fix that. > I'm happy to run further benchmarks, but for the remainder of the > committer responsibility for the next patches, I'm going to leave that > to Thomas. Thanks!
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro wrote: > Alright what about this? Forgot to git add a change, new version. From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 4 Apr 2024 20:31:26 +1300 Subject: [PATCH v2 1/2] Allow BufferAccessStrategy to limit pin count. When pinning extra buffers to look ahead, users of a strategy are in danger of pinning a lot of the buffers in the ring, or even more than the ring size. For some strategies, that means "escaping" from the ring, and in others it means forcing dirty data to disk very frequently with associated WAL flushing. Since external code has no insight into any of that, allow individual strategy types to expose a clamp that should be applied when deciding how many buffers to pin at once. Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 5 src/backend/storage/buffer/freelist.c | 35 +++ src/include/storage/bufmgr.h | 1 + 3 files changed, 41 insertions(+) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5..eab87f6f73 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -419,6 +419,7 @@ read_stream_begin_relation(int flags, size_t size; int16 queue_size; int16 max_ios; + int strategy_pin_limit; uint32 max_pinned_buffers; Oid tablespace_id; SMgrRelation smgr; @@ -460,6 +461,10 @@ read_stream_begin_relation(int flags, max_pinned_buffers = Min(max_pinned_buffers, PG_INT16_MAX - io_combine_limit - 1); + /* Give the strategy a chance to limit the number of buffers we pin. */ + strategy_pin_limit = GetAccessStrategyPinLimit(strategy); + max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers); + /* Don't allow this backend to pin more than its share of buffers. */ if (SmgrIsTemp(smgr)) LimitAdditionalLocalPins(_pinned_buffers); diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 3611357fa3..c69590d6d8 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy) return strategy->nbuffers; } +/* + * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned + * + * Strategies can specify the maximum number of buffers that a user should pin + * at once when performing look-ahead. Callers should combine this number with + * other relevant limits and take the minimum. + */ +int +GetAccessStrategyPinLimit(BufferAccessStrategy strategy) +{ + if (strategy == NULL) + return NBuffers; + + switch (strategy->btype) + { + case BAS_BULKREAD: + + /* +* Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers +* shouldn't be a problem and the caller is free to pin up to the +* entire ring at once. +*/ + return strategy->nbuffers; + + default: + + /* +* Tell call not to pin more than half the buffers in the ring. +* This is a trade-off between look ahead distance and deferring +* writeback and associated WAL traffic. +*/ + return strategy->nbuffers / 2; + } +} + /* * FreeAccessStrategy -- release a BufferAccessStrategy object * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index f380f9d9a6..07ba1a6050 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype, int ring_size_kb); extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy); +extern int GetAccessStrategyPinLimit(BufferAccessStrategy strategy); extern void FreeAccessStrategy(BufferAccessStrategy strategy); -- 2.39.3 (Apple Git-146) From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 4 Apr 2024 21:11:06 +1300 Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB. The BAS_VACUUM ring size has been 256kB since commit d526575f. Commit 1cbbee03 made it configurable but retaine
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 11:13 AM Andres Freund wrote: > The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so, > should we apply that only if the ring the strategy doesn't use the > StrategyRejectBuffer() logic? Hmm, I don't really know, but that sounds plausible. What do you think about the attached? > I think for VACUUM we should probably go a bit further. There's no comparable > L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more > expensive, compared to a seqscan. I'd go or at lest 4x-8x. Alright what about this? From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 4 Apr 2024 20:31:26 +1300 Subject: [PATCH 1/2] Allow BufferAccessStrategy to limit pin count. When pinning extra buffers to look ahead, users of a strategy are in danger of pinning a lot of the buffers in the ring, or even more than the ring size. For some strategies, that means "escaping" from the ring, and in others it means forcing dirty data to disk very frequently with associated WAL flushing. Since external code has no insight into any of that, allow individual strategy types to expose a clamp that should be applied when deciding how many buffers to pin at once. Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 5 src/backend/storage/buffer/freelist.c | 35 +++ src/include/storage/bufmgr.h | 1 + 3 files changed, 41 insertions(+) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5..eab87f6f73 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -419,6 +419,7 @@ read_stream_begin_relation(int flags, size_t size; int16 queue_size; int16 max_ios; + int strategy_pin_limit; uint32 max_pinned_buffers; Oid tablespace_id; SMgrRelation smgr; @@ -460,6 +461,10 @@ read_stream_begin_relation(int flags, max_pinned_buffers = Min(max_pinned_buffers, PG_INT16_MAX - io_combine_limit - 1); + /* Give the strategy a chance to limit the number of buffers we pin. */ + strategy_pin_limit = GetAccessStrategyPinLimit(strategy); + max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers); + /* Don't allow this backend to pin more than its share of buffers. */ if (SmgrIsTemp(smgr)) LimitAdditionalLocalPins(_pinned_buffers); diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 3611357fa3..c69590d6d8 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy) return strategy->nbuffers; } +/* + * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned + * + * Strategies can specify the maximum number of buffers that a user should pin + * at once when performing look-ahead. Callers should combine this number with + * other relevant limits and take the minimum. + */ +int +GetAccessStrategyPinLimit(BufferAccessStrategy strategy) +{ + if (strategy == NULL) + return NBuffers; + + switch (strategy->btype) + { + case BAS_BULKREAD: + + /* +* Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers +* shouldn't be a problem and the caller is free to pin up to the +* entire ring at once. +*/ + return strategy->nbuffers; + + default: + + /* +* Tell call not to pin more than half the buffers in the ring. +* This is a trade-off between look ahead distance and deferring +* writeback and associated WAL traffic. +*/ + return strategy->nbuffers / 2; + } +} + /* * FreeAccessStrategy -- release a BufferAccessStrategy object * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index f380f9d9a6..07ba1a6050 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype, int ring_size_kb); extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy); +extern int GetAcc
WIP: Vectored writeback
Hi, Here are some vectored writeback patches I worked on in the 17 cycle and posted as part of various patch sets, but didn't get into a good enough shape to take further. They "push" vectored writes out, but I think what they need is to be turned inside out and converted into users of a new hypothetical write_stream.c, so that we have a model that will survive contact with asynchronous I/O and would "pull" writes from a stream that controls I/O concurrency. That all seemed a lot less urgent to work on than reads, hence leaving on ice for now. There is a lot of code that reads, and a small finite amount that writes. I think the patches show some aspects of the problem-space though, and they certainly make checkpointing faster. They cover 2 out of 5ish ways we write relation data: checkpointing, and strategies AKA ring buffers. They make checkpoints look like this, respecting io_combine_limit, instead of lots of 8kB writes: pwritev(9,[...],2,0x0) = 131072 (0x2) pwrite(9,...,131072,0x2) = 131072 (0x2) pwrite(9,...,131072,0x4) = 131072 (0x2) pwrite(9,...,131072,0x6) = 131072 (0x2) pwrite(9,...,131072,0x8) = 131072 (0x2) ... Two more ways data gets written back are: bgwriter and regular BAS_NORMAL buffer eviction, but they are not such natural candidates for write combining. Well, if you know you're going to write out a buffer, *maybe* it's worth probing the buffer pool to see if adjacent block numbers are also present and dirty? I don't know. Before and after? Or maybe it's better to wait for the tree-based mapping table of legend first so it becomes cheaper to navigate in block number order. The 5th way is raw file copy that doesn't go through the buffer pool, such as CREATE DATABASE ... STRATEGY=FILE_COPY, which already works with big writes, and CREATE INDEX via bulk_write.c which is easily converted to vectored writes, and I plan to push the patches for that shortly. I think those should ultimately become stream-based too. Anyway, I wanted to share these uncommitfest patches, having rebased them over relevant recent commits, so I could leave them in working state in case anyone is interested in this file I/O-level stuff... From c6d227678c586387a49c30c4f9a61f62c9b04b1c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 13 Mar 2024 17:02:42 +1300 Subject: [PATCH v5 1/3] Provide vectored variant of FlushBuffer(). FlushBuffers() is just like FlushBuffer() except that it tries to write out multiple consecutive disk blocks at a time with smgrwritev(). The traditional simple FlushBuffer() call is now implemented in terms of it. --- src/backend/storage/buffer/bufmgr.c | 240 src/include/storage/buf_internals.h | 10 ++ 2 files changed, 185 insertions(+), 65 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 06e9ffd2b00..a815e58e307 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -57,6 +57,7 @@ #include "storage/smgr.h" #include "storage/standby.h" #include "utils/memdebug.h" +#include "utils/memutils.h" #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/resowner.h" @@ -526,6 +527,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr, static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context); static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, IOContext io_context); +static int FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln, + IOObject io_object, IOContext io_context); static void FindAndDropRelationBuffers(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber nForkBlock, @@ -3705,9 +3708,19 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum, *blknum = bufHdr->tag.blockNum; } +struct shared_buffer_write_error_info +{ + BufferDesc *buf; + int nblocks; +}; + /* - * FlushBuffer - * Physically write out a shared buffer. + * FlushBuffers + * Physically write out shared buffers. + * + * The buffers do not have to be consecutive in memory but must refer to + * consecutive blocks of the same relation fork in increasing order of block + * number. * * NOTE: this actually just passes the buffer contents to the kernel; the * real write to disk won't happen until the kernel feels like it. This @@ -3715,66 +3728,149 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum, * However, we will need to force the changes to disk via fsync before * we can checkpoint WAL. * - * The caller must hold a pin on the buffer and have share-locked the + * The caller must hold a pin on the buffers and have share-locked the * buffer contents. (Note: a share-lock does not prevent updates of * hint bits in the buffer, so the page could change while t
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On Fri, Mar 8, 2024 at 6:20 AM Maxim Orlov wrote: > Quite an interesting patch, in my opinion. I've decided to work on it a bit, > did some refactoring (sorry) and add > basic tests. Also, I try to take into account as much as possible notes on > the patch, mentioned by Cédric Villemain. Thanks! Unfortunately I don't think it's possible to include a regression test that looks at the output, because it'd be non-deterministic. Any other backend could pin or dirty the buffer you try to evict, changing the behaviour. > > and maybe better to go with FlushOneBuffer() ? > It's a good idea, but I'm not sure at the moment. I'll try to dig some > deeper into it. At least, FlushOneBuffer does > not work for a local buffers. So, we have to decide whatever > pg_buffercache_invalidate should or should not > work for local buffers. For now, I don't see why it should not. Maybe I > miss something? I think it's OK to ignore local buffers for now. pg_buffercache generally doesn't support/show them so I don't feel inclined to support them for this. I removed a few traces of local support. It didn't seem appropriate to use the pg_monitor role for this, so I made it superuser-only. I don't think it makes much sense to use this on any kind of production system so I don't think we need a new role for it, and existing roles don't seem too appropriate. pageinspect et al use the same approach. I added a VOLATILE qualifier to the function. I added some documentation. I changed the name to pg_buffercache_evict(). I got rid of the 'force' flag which was used to say 'I only want to evict this buffer it is clean'. I don't really see the point in that, we might as well keep it simple. You could filter buffers on "isdirty" if you want. I added comments to scare anyone off using EvictBuffer() for anything much, and marking it as something for developer convenience. (I am aware of an experimental patch that uses this same function as part of a buffer pool resizing operation, but that has other infrastructure to make that safe and would adjust those remarks accordingly.) I wondered whether it should really be testing for BM_TAG_VALID rather than BM_VALID. Arguably, but it doesn't seem important for now. The distinction would arise if someone had tried to read in a buffer, got an I/O error and abandoned ship, leaving a buffer with a valid tag but not valid contents. Anyone who tries to ReadBuffer() it will then try to read it again, but in the meantime this function won't be able to evict it (it'll just return false). Doesn't seem that obvious to me that this obscure case needs to be handled. That doesn't happen *during* a non-error case, because then it's pinned and we already return false in this code for pins. I contemplated whether InvalidateBuffer() or InvalidateVictimBuffer() would be better here and realised that Andres's intuition was probably right when he suggested the latter up-thread. It is designed with the right sort of arbitrary concurrent activity in mind, where the former assumes things about locking and dropping, which could get us into trouble if not now maybe in the future. I ran the following diabolical buffer blaster loop while repeatedly running installcheck: do $$ begin loop perform pg_buffercache_evict(bufferid) from pg_buffercache where random() <= 0.25; end loop; End; $$; The only ill-effect was a hot laptop. Thoughts, objections, etc? Very simple example of use: create or replace function uncache_relation(name text) returns boolean begin atomic; select bool_and(pg_buffercache_evict(bufferid)) from pg_buffercache where reldatabase = (select oid from pg_database where datname = current_database()) and relfilenode = pg_relation_filenode(name); end; More interesting for those of us hacking on streaming I/O stuff was the ability to evict just parts of things and see how the I/O merging and I/O depth react. v5-0001-Add-pg_buffercache_evict-function-for-testing.patch Description: Binary data
Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
On Thu, Apr 4, 2024 at 11:51 AM Peter Eisentraut wrote: > On 30.03.24 22:27, Thomas Munro wrote: > > Hmm, OK so it doesn't have 3 available in parallel from base repos. > > But it's also about to reach end of "full support" in 2 months[1], so > > if we applied the policies we discussed in the LLVM-vacuuming thread > > (to wit: build farm - EOL'd OSes), then... One question I'm unclear > > on is whether v17 will be packaged for RHEL8. > > The rest of the thread talks about the end of support of RHEL 7, but you > are here talking about RHEL 8. It is true that "full support" for RHEL > 8 ended in May 2024, but that is the not the one we are tracking. We > are tracking the 10-year one, which I suppose is now called "maintenance > support". I might have confused myself with the two EOLs and some wishful thinking. I am a lot less worked up about this general topic now that RHEL has moved to "rolling" LLVM updates in minor releases, removing a physical-pain-inducing 10-year vacuuming horizon (that's 20 LLVM major releases and they only fix bugs in one...). I will leave openssl discussions to those more knowledgeable about that. > So if the above package list is correct, then we ought to keep > supporting openssl 1.1.* until 2029. That's a shame. But it sounds like the developer burden isn't so different from 1.1.1 to 3.x, so maybe it's not such a big deal from our point of view. (I have no opinion on the security ramifications of upstream's EOL, but as a layman it sounds completely bonkers to use it. I wonder why the packaging community wouldn't just arrange to have a supported-by-upstream 3.x package in their RPM repo when they supply the newest PostgreSQL versions for the oldest RHEL, but again not my area so I'll shut up).
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman wrote: > On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas wrote: > > On 01/04/2024 22:58, Melanie Plageman wrote: > > > Attached v7 has version 14 of the streaming read API as well as a few > > > small tweaks to comments and code. > > > > I saw benchmarks in this thread to show that there's no regression when > > the data is in cache, but I didn't see any benchmarks demonstrating the > > benefit of this. So I ran this quick test: > > Good point! It would be good to show why we would actually want this > patch. Attached v8 is rebased over current master (which now has the > streaming read API). Anecdotally by all reports I've seen so far, all-in-cache seems to be consistently a touch faster than master if anything, for streaming seq scan and streaming ANALYZE. That's great!, but it doesn't seem to be due to intentional changes. No efficiency is coming from batching anything. Perhaps it has to do with CPU pipelining effects: though it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work itself is cut up into stages in a kind of pipeline: read_stream_next_buffer() chooses page n + 2, pins page n + 1 and returns page n each time you call it, so maybe we get more CPU parallelism due to spreading the data dependencies out? (Makes me wonder what happens if you insert a memory prefetch for the page header into that production line, and if there are more opportunities to spread dependencies out eg hashing the buffer tag ahead of time.) > On the topic of BAS_BULKREAD buffer access strategy, I think the least > we could do is add an assert like this to read_stream_begin_relation() > after calculating max_pinned_buffers. > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > Perhaps we should do more? I think with a ring size of 16 MB, large > SELECTs are safe for now. But I know future developers will make > changes and it would be good not to assume they will understand that > pinning more buffers than the size of the ring effectively invalidates > the ring. Yeah I think we should clamp max_pinned_buffers if we see a strategy. What do you think about: if (strategy) { int strategy_buffers = GetAccessStrategyBufferCount(strategy); max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); } I just don't know where to get that '2'. The idea would be to hopefully never actually be constrained by it in practice, except in tiny/toy setups, so we can't go too wrong with our number '2' there. Then we should increase the default ring sizes for BAS_BULKREAD and BAS_VACUUM to make the previous statement true. The size of main memory and L2 cache have increased dramatically since those strategies were invented. I think we should at least double them, and more likely quadruple them. I realise you already made them configurable per command in commit 1cbbee033, but I mean the hard coded default 256 in freelist.c. It's not only to get more space for our prefetching plans, it's also to give the system more chance of flushing WAL asynchronously/in some other backend before you crash into dirty data; as you discovered, prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy, by taking away space that is effectively deferring WAL flushes, so I'd at least like to double the size for if we use "/ 2" above.
Re: Streaming I/O, vectored I/O (WIP)
On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro wrote: > So this is the version I'm going to commit shortly, barring objections. And done, after fixing a small snafu with smgr-only reads coming from CreateAndCopyRelationData() (BM_PERMANENT would be incorrectly/unnecessarily set for unlogged tables). Here are the remaining patches discussed in this thread. They give tablespace-specific io_combine_limit, effective_io_readahead_window (is this useful?), and up-to-1MB io_combine_limit (is this useful?). I think the first two would probably require teaching reloption.c how to use guc.c's parse_int() and unit flags, but I won't have time to look at that for this release so I'll just leave these here. On the subject of guc.c, this is a terrible error message... did I do something wrong? postgres=# set io_combine_limit = '42MB'; ERROR: 5376 8kB is outside the valid range for parameter "io_combine_limit" (1 .. 32) From 84b8280481312cdd1efcb7efa1182d4647cbe00a Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 30 Mar 2024 19:09:44 +1300 Subject: [PATCH v16 1/4] ALTER TABLESPACE ... SET (io_combine_limit = ...). This is the per-tablespace version of the GUC of the same name. XXX reloptions.c lacks the ability to accept units eg '64kB'! Which is why I haven't included it with the main feature commit. Suggested-by: Tomas Vondra https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com --- doc/src/sgml/ref/alter_tablespace.sgml | 9 +++--- src/backend/access/common/reloptions.c | 12 +++- src/backend/storage/aio/read_stream.c | 39 -- src/backend/utils/cache/spccache.c | 14 + src/bin/psql/tab-complete.c| 3 +- src/include/commands/tablespace.h | 1 + src/include/utils/spccache.h | 1 + 7 files changed, 58 insertions(+), 21 deletions(-) diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml index 6ec863400d1..faf0c6e7fbc 100644 --- a/doc/src/sgml/ref/alter_tablespace.sgml +++ b/doc/src/sgml/ref/alter_tablespace.sgml @@ -84,16 +84,17 @@ ALTER TABLESPACE name RESET ( , , , - ). This may be useful if + ), + ). This may be useful if one tablespace is located on a disk which is faster or slower than the remainder of the I/O subsystem. diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index d6eb5d85599..1e1c611fab2 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -371,6 +371,15 @@ static relopt_int intRelOpts[] = 0, 0, 0 #endif }, + { + { + "io_combine_limit", + "Limit on the size of data reads and writes.", + RELOPT_KIND_TABLESPACE, + ShareUpdateExclusiveLock + }, + -1, 1, MAX_IO_COMBINE_LIMIT + }, { { "parallel_workers", @@ -2089,7 +2098,8 @@ tablespace_reloptions(Datum reloptions, bool validate) {"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)}, {"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)}, {"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)}, - {"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)} + {"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}, + {"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)}, }; return (bytea *) build_reloptions(reloptions, validate, diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5e..907c80e6bf9 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -114,6 +114,7 @@ struct ReadStream int16 max_pinned_buffers; int16 pinned_buffers; int16 distance; + int16 io_combine_limit; bool advice_enabled; /* @@ -241,7 +242,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) /* This should only be called with a pending read. */ Assert(stream->pending_read_nblocks > 0); - Assert(stream->pending_read_nblocks <= io_combine_limit); + Assert(stream->pending_read_nblocks <= stream->io_combine_limit); /* We had better not exceed the pin limit by starting this read. */ Assert(stream->pinned_buffers + stream->pending_read_nblocks <= @@ -329,7 +330,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) int16 buffer_index; void *per_buffer_data; - if (stream->pending_read_nblocks == io_combine_limit) + if (stream->pending_read_nblocks == stream->io_combine_limit) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; @@ -389,7 +390,7 @@ read_stream_look_ahead(ReadStream *stream
Re: Building with musl in CI and the build farm
On Wed, Mar 27, 2024 at 11:27 AM Wolfgang Walther wrote: > The animal runs in a docker container via GitHub Actions in [2]. Great idea :-)