from:"Thomas Munro"

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-08 Thread Thomas Munro

On Fri, Jun 7, 2024 at 8:05 PM Alvaro Herrera  wrote:
> In passing, I noticed that WaitReadBuffers has zero comments, which
> seems an insufficient number of them.

Ack.  Here is a patch for that.  I guess I hadn't put a comment there
because it's hard to write anything without sort of duplicating what
is already said by the StartReadBuffers() comment and doubling down on
descriptions of future plans... but, in for a penny, in for a pound as
they say.
From db5e56825e4b4c595885373c413011c50fedd3e8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 9 Jun 2024 09:51:02 +1200
Subject: [PATCH] Add comment for WaitReadBuffers().

Per complaint from Alvaro while reviewing something else.

Reported-by: Alvaro Herrera 
Discussion: https://postgr.es/m/202406070805.icofqromovvn%40alvherre.pgsql
---
 src/backend/storage/buffer/bufmgr.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a8e3377263f..409210a6ed2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1371,6 +1371,17 @@ WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * The second step of a StartReadBuffers(), WaitReadBuffers() sequence.  It is
+ * only necessary to call this if StartReadBuffers() returned true, indicating
+ * that I/O was necessary.
+ *
+ * Currently, this function performs synchronous I/O system calls.  The earlier
+ * StartReadBuffers() call might have issued advice to give the OS a head
+ * start so that it completes quickly.  In future work, a true asynchronous I/O
+ * could be started by StartReadBuffers(), and this function would wait for it
+ * to complete.
+ */
 void
 WaitReadBuffers(ReadBuffersOperation *operation)
 {
-- 
2.45.1

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-08 Thread Thomas Munro

New version.  Same code as v2, but comments improved to explain the
reasoning, with reference to README's buffer access rules.  I'm
planning to push this soon if there are no objections.
From 1fa26f407622cd69d82f3b4ea68fddf2c357cf06 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH v3] Fix RBM_ZERO_AND_LOCK.

Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool.  It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet.  This is done with the standard
BM_IO_IN_PROGRESS infrastructure to test for BM_VALID and wake
concurrent waiters without races (as it was in earlier releases).

Reported-by: Noah Misch 
Reported-by: Alexander Lakhin 
Reviewed-by: Alvaro Herrera  (earlier version)
Reviewed-by: Robert Haas  (earlier version)
Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 64 -
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..a8e3377263f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1010,43 +1010,69 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 }
 
 /*
- * Zero a buffer and lock it, as part of the implementation of
+ * Lock and optionally zero a buffer, as part of the implementation of
  * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
- * pinned.  It does not have to be valid, but it is valid and locked on
- * return.
+ * pinned.  If the buffer is not already valid, it is zeroed and made valid.
  */
 static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 {
 	BufferDesc *bufHdr;
-	uint32		buf_state;
+	bool		need_to_zero;
 
 	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 
-	if (BufferIsLocal(buffer))
+	if (already_valid)
+	{
+		/*
+		 * If the caller already knew the buffer was valid, we can skip some
+		 * header interaction.  The caller just wants to lock the buffer.
+		 */
+		need_to_zero = false;
+	}
+	else if (BufferIsLocal(buffer))
+	{
+		/* Simple case for non-shared buffers. */
 		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+		need_to_zero = (pg_atomic_read_u32(>state) & BM_VALID) == 0;
+	}
 	else
 	{
+		/*
+		 * Take BM_IO_IN_PROGRESS, or discover that BM_VALID has been set
+		 * concurrently.  Even though we aren't doing I/O, that protocol
+		 * ensures that we don't zero a page that someone else has pinned.  An
+		 * exclusive content lock wouldn't be enough, because readers are
+		 * allowed to drop the content lock after determining that a tuple is
+		 * visible (see buffer access rules in README).
+		 */
 		bufHdr = GetBufferDescriptor(buffer - 1);
+		need_to_zero = StartBufferIO(bufHdr, true, false);
+	}
+
+	if (need_to_zero)
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	/* Acquire the requested lock before setting BM_VALID. */
+	if (!BufferIsLocal(buffer))
+	{
 		if (mode == RBM_ZERO_AND_LOCK)
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		else
 			LockBufferForCleanup(buffer);
 	}
 
-	memset(BufferGetPage(buffer), 0, BLCKSZ);
-
-	if (BufferIsLocal(buffer))
-	{
-		buf_state = pg_atomic_read_u32(>state);
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(>state, buf_state);
-	}
-	else
+	/*
+	 * Set BM_VALID, and wake anyone waiting for BM_IO_IN_PROGRESS to be
+	 * cleared.
+	 */
+	if (need_to_zero)
 	{
-		buf_state = LockBufHdr(bufHdr);
-		buf_state |= BM_VALID;
-		UnlockBufHdr(bufHdr, buf_state);
+		if (BufferIsLocal(buffer))
+			pg_atomic_write_u32(>state,
+pg_atomic_read_u32(>state) | BM_VALID);
+		else
+			TerminateBufferIO(bufHdr, false, BM_VALID, true);
 	}
 }
 
@@ -1185,7 +1211,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 
 		buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
    forkNum, blockNum, strategy, );
-		ZeroBuffer(buffer, mode);
+		ZeroAndLockBuffer(buffer, mode, found);
 		return buffer;
 	}
 
-- 
2.45.1

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Thomas Munro

On Sat, Jun 8, 2024 at 12:47 AM Robert Haas  wrote:
>
> On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera  wrote:
> > >  static void
> > > -ZeroBuffer(Buffer buffer, ReadBufferMode mode)
> > > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
> >
> > This change makes the API very strange.  Should the function be called
> > ZeroAndLockBuffer() instead?  Then the addition of a "bool zero"
> > argument makes a lot more sense.
>
> I agree that's better, but it still looks a bit weird. You have to
> realize that 'bool zero' means 'is already zeroed' here -- or at
> least, I guess that's the intention. But then I wonder why you'd call
> a function called ZeroAndLockBuffer if all you need to do is
> LockBuffer.

The name weirdness comes directly from RBM_ZERO_AND_LOCK (the fact
that it doesn't always zero despite shouting ZERO is probably what
temporarily confused me).  But coming up with a better name is hard
and I certainly don't propose to change it now.  I think it's
reasonable for this internal helper function to have that matching
name as Alvaro suggested, with a good comment about that.

Even though that quick-demonstration change fixed the two reported
repros, I think it is still probably racy (or if it isn't, it relies
on higher level interlocking that I don't want to rely on).  This case
really should be using the standard StartBufferIO/TerminateBufferIO
infrastructure as it was before.  I had moved that around to deal with
multi-block I/O, but dropped the ball on the zero case... sorry.

Here's a version like that.  The "zero" argument (yeah that was not a
good name) is now inverted and called "already_valid", but it's only a
sort of optimisation for the case where we already know for sure that
it's valid.  If it isn't, we do the standard
BM_IO_IN_PROGRESS/BM_VALID/CV dance, for correct interaction with any
concurrent read or zero operation.

v2-0001-Fix-RBM_ZERO_AND_LOCK.patch
Description: Binary data

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Thomas Munro

On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro  wrote:
> On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:
> > My bisect run ended with:
> > 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
> >
> > Author: Thomas Munro 
> > Date:   Wed Apr 3 00:03:08 2024 +1300
> >
> >  Provide vectored variant of ReadBuffer().
> >
> > Other buildfarm failures with this Assert I could find kind of confirm this:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-04-03%2003%3A32%3A18
> > (presumably a first failure of this sort)
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-04-04%2015%3A38%3A16
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-05-07%2004%3A00%3A08
>
> Looking...

What Noah described[1] is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page.  Oops.  The attached
seems to cure his repro for me.  Does it also cure your test?  I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation.  I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.

[1] 
https://www.postgresql.org/message-id/flat/20240512171658.7e.nmi...@google.com
From f3bb1d69a57bea820895efaf366371463e62235d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 7 Jun 2024 17:49:19 +1200
Subject: [PATCH] Fix RBM_ZERO_AND_LOCK.

Commit 210622c6 accidentally zeroed out pages even if they were found in
the buffer pool.  It should always lock the page, but it should only
zero pages that were not already found as an optimization to avoid I/O.
Otherwise, concurrent readers that hold only a pin might see corrupted
page contents changing under their feet.

Reported-by: Noah Misch 
Reported-by: Alexander Lakhin 
Discussion: https://postgr.es/m/20240512171658.7e.nmi...@google.com
Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..238fc0e3547 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1016,7 +1016,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * return.
  */
 static void
-ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
 {
 	BufferDesc *bufHdr;
 	uint32		buf_state;
@@ -1034,7 +1034,8 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 			LockBufferForCleanup(buffer);
 	}
 
-	memset(BufferGetPage(buffer), 0, BLCKSZ);
+	if (zero)
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -1185,7 +1186,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 
 		buffer = PinBufferForBlock(rel, smgr, smgr_persistence,
    forkNum, blockNum, strategy, );
-		ZeroBuffer(buffer, mode);
+		ZeroBuffer(buffer, mode, !found);
 		return buffer;
 	}
 
-- 
2.45.1

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-06 Thread Thomas Munro

On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:
> My bisect run ended with:
> 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit
>
> Author: Thomas Munro 
> Date:   Wed Apr 3 00:03:08 2024 +1300
>
>  Provide vectored variant of ReadBuffer().
>
> Other buildfarm failures with this Assert I could find kind of confirm this:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae=2024-04-03%2003%3A32%3A18
> (presumably a first failure of this sort)
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-04-04%2015%3A38%3A16
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-05-07%2004%3A00%3A08

Looking...

Re: Streaming I/O, vectored I/O (WIP)

2024-05-24 Thread Thomas Munro

On Wed, Nov 29, 2023 at 1:17 AM Thomas Munro  wrote:
> Done.  I like it, I just feel a bit bad about moving the p*v()
> replacement functions around a couple of times already!  I figured it
> might as well be static inline even if we use the fallback (= Solaris
> and Windows).

Just for the record, since I'd said things like the above a few times
while writing about this stuff: Solaris 11.4.69 has gained preadv()
and pwritev().  That's interesting because it means that there will
soon be no liive Unixoid operating systems left without them, and the
fallback code in src/include/port/pg_iovec.h will, in practice, be
only for Windows.  I wondered if that might have implications for how
we code or comment stuff like that, but it still seems to make sense
as we have it.

(I don't think Windows can have a real synchronous implementation; the
kernel knows how to do scatter/gather, a feature implemented
specifically for databases, but only in asynchronous ("overlapped") +
direct I/O mode, a difference I don't know how to hide at this level.
In later AIO work we should be able to use it as intended, but not by
pretending to be Unix like this.)

Re: processes stuck in shutdown following OOM/recovery

2024-05-22 Thread Thomas Munro

On Thu, May 23, 2024 at 9:58 AM Martijn Wallet  wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Hi, I somehow fail to be able to mark all checkboxes on this review page...
> However, build and tested with all passed successfully on Rocky Linux release 
> 8.9 (Green Obsidian).
> Not sure of more reviewing is needed on other Operating Systems since this is 
> only my second review.

Thanks!

I'm also hoping to get review of the rather finickity state machine
logic involved from people familiar with that; I think it's right, but
I'd hate to break some other edge case...

> nb: second mail to see spf is fixed and Thomas receives this message.

FTR 171641337152.1103.7326466732639994038.p...@coridan.postgresql.org
and 171641505305.1105.9868637944637520353.p...@coridan.postgresql.org
both showed up in my inbox, and they both have headers "Received-SPF:
pass ...".

Re: Streaming read-ready sequential scan code

2024-05-20 Thread Thomas Munro

On Tue, May 21, 2024 at 9:11 AM Melanie Plageman
 wrote:
> So, if you are seeing the slow-down mostly go away by reducing
> blocknums array size, does the regression only appear when the scan
> data is fully in shared buffers? Or is this blocknums other use
> (dealing with short reads)?

That must be true (that blocknums array is normally only "filled" in
the "fast path", where all buffers are found in cache).

> Is your theory that one worker ends up reading 16 blocks that should
> have been distributed across multiple workers?

Yes, it just jiggles the odds around a bit, introducing a bit of extra
unfairness by calling the callback in a tighter loop to build a little
batch, revealing a pre-existing problem.

The mistake in PHJ (problem #2 above) is that, once a worker decides
it would like all workers to stop inserting so it can increase the
number of buckets, it sets a flag to ask them to do that, and waits
for them to see it, but if there is a worker filtering all tuples out,
it never checks the "growth" flag.  So it scans all the way to the end
while the other guy waits.  Normally it checks that flag when it is
time to allocate a new chunk of memory, which seemed to make sense to
me at the time: if we've hit the needs-more-buckets (or
needs-more-batches) logic, then surely workers are inserting tuples
and will soon allocate a new chunk!  But, of course, here is the edge
case where that isn't true: we had bad estimates so hash table too
small (problem #1), we got lots of tuples clustered over a few heap
pages and decided to expand the hash table, but right at that moment,
matching tuples ran out so somebody had to finish the whole scan
without ever checking the flag (problem #2), and that someone happened
to have all the rest of the pages because we made the lookahead a bit
less fair (problem #3).  Nice confluence of problems.  I expect #2 and
#3 to be easy to fix, and I didn't look at the estimation problem #1
at all (perhaps a stats puzzle designed by the TPC to trip us up?).

Re: commitfest.postgresql.org is no longer fit for purpose

2024-05-19 Thread Thomas Munro

On Mon, May 20, 2024 at 1:09 PM Tom Lane  wrote:
> (The cfbot tends to discourage this, since as soon as one of the
> patches is committed it no longer knows how to apply the rest.
> Can we improve on that tooling somehow?)

Cfbot currently applies patches with  (GNU) patch
--no-backup-if-mismatch -p1 -V none -f.  The -f means that any
questions of the form "Reversed (or previously applied) patch
detected!  Assume -R? [y]" is answered with "yes", and the operation
fails.  I wondered if --forward would be better.  It's the right idea,
but it seems to be useless in practice for this purpose because the
command still fails:

tmunro@phonebox postgresql % patch --forward -p1 < x.patch || echo XXX
failed $?
patching file 'src/backend/postmaster/postmaster.c'
Ignoring previously applied (or reversed) patch.
1 out of 1 hunks ignored--saving rejects to
'src/backend/postmaster/postmaster.c.rej'
XXX failed 1

I wondered if it might be distinguishable from other kinds of failure
that should stop progress, but nope:

   patch's exit status is 0 if all hunks are applied successfully, 1
   if some hunks cannot be applied or there were merge conflicts,
   and 2 if there is more serious trouble.  When applying a set of
   patches in a loop it behooves you to check this exit status so
   you don't apply a later patch to a partially patched file.

I guess I could parse stdout or whatever that is and detect
all-hunks-ignored condition, but that doesn't sound like fun...

Perhaps cfbot should test explicitly for patches that have already
been applied with something like "git apply --reverse --check", and
skip them.  That would work for exact patches, but of course it would
be confused by any tweaks made before committing.  If going that way,
it might make sense to switch to git apply/am (instead of GNU patch),
to avoid contradictory conclusions.

The reason I was using GNU patch in the first place is that it is/was
a little more tolerant of some of the patches people used to send a
few years back, but now I expect everyone uses git format-patch and
would be prepared to change their ways if not.  In the past we had a
couple of cases of the reverse, that is, GNU patch couldn't apply
something that format-patch produced (some edge case of renaming,
IIRC) and I'm sorry that I never got around to changing that.

Sometimes I question the sanity of the whole thing.  Considering
cfbot's original "zero-effort CI" goal (or I guess "zero-extra-effort"
would be better), I was curious about what other projects had the same
idea, or whether we're really just starting at the "wrong end", and
came up with:

https://github.com/getpatchwork/patchwork
http://vger.kernel.org/bpfconf2022_material/lsfmmbpf2022-bpf-ci.pdf
<-- example user
https://github.com/patchew-project/patchew

Actually cfbot requires more effort than those, because it's driven
first by Commitfest app registration.  Those projects are extremists
IIUC: just write to a mailing list, no other bureaucracy at all (at
least for most participants, presumably administrators can adjust the
status in some database when things go wrong?).  We're actually
halfway to Gitlab et al already, with a web account and interaction
required to start the process of submitting a patch for consideration.
What I'm less clear on is who else has come up with the "bitrot" test
idea, either at the mailing list or web extremist ends of the scale.
Those are also generic tools, and cfbot obviously knows lots of things
about PostgreSQL, like the "highlights" and probably more things I'm
forgetting.

Re: Streaming read-ready sequential scan code

2024-05-18 Thread Thomas Munro

On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin  wrote:
> With blocknums[1], timing is changed, but the effect is not persistent.
> 10 query15 executions in a row, b7b0f3f27:
> 277.932 ms
> 281.805 ms
> 278.335 ms
> 281.565 ms
> 284.167 ms
> 283.171 ms
> 281.165 ms
> 281.615 ms
> 285.394 ms
> 277.301 ms

The bad time 10/10.

> b7b0f3f27~1:
> 159.789 ms
> 165.407 ms
> 160.893 ms
> 159.343 ms
> 160.936 ms
> 161.577 ms
> 161.637 ms
> 163.421 ms
> 163.143 ms
> 167.109 ms

The good time 10/10.

> b7b0f3f27 + blocknums[1]:
> 164.133 ms
> 280.920 ms
> 160.748 ms
> 163.182 ms
> 161.709 ms
> 161.998 ms
> 161.239 ms
> 276.256 ms
> 161.601 ms
> 160.384 ms

The good time 8/10, the bad time 2/10.

Thanks for checking!  I bet all branches can show that flip/flop
instability in these adverse conditions, depending on random
scheduling details.  I will start a new thread with a patch for the
root cause of that, ie problem #2 (this will need back-patching), and
post a fix for #3 (v17 blocknums[N] tweak affecting
fairness/likelihood, which was probably basically a bit of ill-advised
premature optimisation) here in a few days.

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-05-18 Thread Thomas Munro

On Sun, May 19, 2024 at 10:46 AM Ole Peder Brandtzæg
 wrote:
> On Wed, May 15, 2024 at 07:20:09AM +0200, Peter Eisentraut wrote:
> > Yes, let's get that v3-0001 patch into PG17.
>
> Upon seeing this get committed in 4dd29b6833, I noticed that the docs
> still advertise the llvm-config-$version search dance. That's still
> correct for Meson-based builds since we use their config-tool machinery,
> but no longer holds for configure-based builds. The attached patch
> updates the docs accordingly.

Oops, right I didn't know we had that documented.  Thanks.  Will hold
off doing anything until the thaw.

Hmm, I also didn't know that Meson had its own list like our just-removed one:

https://github.com/mesonbuild/meson/blob/master/mesonbuild/environment.py#L183

Unsurprisingly, it suffers from maintenance lag, priority issues etc
(new major versions pop out every 6 months):

https://github.com/mesonbuild/meson/issues/10483

Re: speed up a logical replica setup

2024-05-18 Thread Thomas Munro

040_pg_createsubscriber.pl seems to be failing occasionally on
culicidae near "--dry-run on node S".  I couldn't immediately see why.
That animal is using EXEC_BACKEND and I also saw a one-off failure a
bit like that on my own local Linux + EXEC_BACKEND test run
(sorry I didn't keep the details around).  Coincidence?

Re: race condition when writing pg_control

2024-05-17 Thread Thomas Munro

On Fri, May 17, 2024 at 4:46 PM Thomas Munro  wrote:
> The specific problem here is that LocalProcessControlFile() runs in
> every launched child for EXEC_BACKEND builds.  Windows uses
> EXEC_BACKEND, and Windows' NTFS file system is one of the two file
> systems known to this list to have the concurrent read/write data
> mashing problem (the other being ext4).

Phngh... this is surprisingly difficult to fix.

Things that don't work: We "just" need to acquire ControlFileLock
while reading the file or examining the object in shared memory, or
get a copy of it, passed through the EXEC_BACKEND BackendParameters
that was acquired while holding the lock, but the current location of
this code in child startup is too early to use LWLocks, and the
postmaster can't acquire locks either so it can't even safely take a
copy to pass on.  You could reorder startup so that we are allowed to
acquire LWLocks in children at that point, but then you'd need to
convince yourself that there is no danger of breaking some ordering
requirement in external preload libraries, and figure out what to do
about children that don't even attach to shared memory.  Maybe that's
possible, but that doesn't sound like a good idea to back-patch.

First idea idea I've come up with to avoid all of that: pass a copy of
the "proto-controlfile", to coin a term for the one read early in
postmaster startup by LocalProcessControlFile().  As far as I know,
the only reason we need it is to suck some settings out of it that
don't change while a cluster is running (mostly can't change after
initdb, and checksums can only be {en,dis}abled while down).  Right?
Children can just "import" that sucker instead of calling
LocalProcessControlFile() to figure out the size of WAL segments yada
yada, I think?  Later they will attach to the real one in shared
memory for all future purposes, once normal interlocking is allowed.

I dunno.  Draft patch attached.  Better plans welcome.  This passes CI
on Linux systems afflicted by EXEC_BACKEND, and Windows.  Thoughts?
From 48c2de14bd9368b4708a99ecbb75452dc327e608 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 18 May 2024 13:41:09 +1200
Subject: [PATCH v1 1/3] Fix pg_control corruption in EXEC_BACKEND startup.

When backend processes were launched in EXEC_BACKEND builds, they would
run LocalProcessControlFile() to read in pg_control and extract several
important settings.

This happens too early to acquire ControlFileLock, and the postmaster is
also not allowed to acquire ControlFileLock, so it can't safely take a
copy to give to the child.

Instead, pass down the "proto-controlfile" that was read by the
postmaster in LocalProcessControlFile().  Introduce functions
ExportProtoControlFile() and ImportProtoControlFile() to allow that.
Subprocesses will extract information from that, and then later attach
to the current control file in shared memory.

Reported-by: Melanie Plageman  per Windows CI failure
Discussion: https://postgr.es/m/CAAKRu_YNGwEYrorQYza_W8tU%2B%3DtoXRHG8HpyHC-KDbZqA_ZVSA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c   | 46 +++--
 src/backend/postmaster/launch_backend.c | 19 ++
 src/include/access/xlog.h   |  5 +++
 3 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f2..b69a0d95af9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -568,6 +568,10 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;

+#ifdef EXEC_BACKEND
+static ControlFileData *ProtoControlFile = NULL;
+#endif
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -686,6 +690,7 @@ static bool PerformRecoveryXLogAction(void);
 static void InitControlFile(uint64 sysidentifier);
 static void WriteControlFile(void);
 static void ReadControlFile(void);
+static void ScanControlFile(void);
 static void UpdateControlFile(void);
 static char *str_time(pg_time_t tnow);

@@ -4309,9 +4314,7 @@ WriteControlFile(void)
 static void
 ReadControlFile(void)
 {
-	pg_crc32c	crc;
 	int			fd;
-	static char wal_segsz_str[20];
 	int			r;

 	/*
@@ -4344,6 +4347,15 @@ ReadControlFile(void)

 	close(fd);

+	ScanControlFile();
+}
+
+static void
+ScanControlFile(void)
+{
+	static char wal_segsz_str[20];
+	pg_crc32c	crc;
+
 	/*
 	 * Check for expected pg_control format version.  If this is wrong, the
 	 * CRC check will likely fail because we'll be checking the wrong number
@@ -4815,8 +4827,33 @@ LocalProcessControlFile(bool reset)
 	Assert(reset || ControlFile == NULL);
 	ControlFile = palloc(sizeof(ControlFileData));
 	ReadControlFile();
+
+#ifdef EXEC_BACKEND
+	/* We need to be able to give this to subprocesses. */
+	ProtoControlFile = ControlFile;
+#endif
+}
+
+#ifdef EXEC_BACKEND
+void
+E

Re: Refactoring backend fork+exec code

2024-05-17 Thread Thomas Munro

On Mon, Mar 18, 2024 at 10:41 PM Heikki Linnakangas  wrote:
> Committed, with some final cosmetic cleanups. Thanks everyone!

Nitpicking from UBSan with EXEC_BACKEND on Linux (line numbers may be
a bit off, from a branch of mine):

../src/backend/postmaster/launch_backend.c:772:2: runtime error: null
pointer passed as argument 2, which is declared to never be null
==13303==Using libbacktrace symbolizer.
#0 0x564b0202 in save_backend_variables
../src/backend/postmaster/launch_backend.c:772
#1 0x564b0242 in internal_forkexec
../src/backend/postmaster/launch_backend.c:311
#2 0x564b0bdd in postmaster_child_launch
../src/backend/postmaster/launch_backend.c:244
#3 0x564b3121 in StartChildProcess
../src/backend/postmaster/postmaster.c:3928
#4 0x564b933a in PostmasterMain
../src/backend/postmaster/postmaster.c:1357
#5 0x562de4ad in main ../src/backend/main/main.c:197
#6 0x7667ad09 in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
#7 0x55e34279 in _start
(/tmp/cirrus-ci-build/build/tmp_install/usr/local/pgsql/bin/postgres+0x8e0279)

This silences it:

-   memcpy(param->startup_data, startup_data, startup_data_len);
+   if (startup_data_len > 0)
+   memcpy(param->startup_data, startup_data, startup_data_len);

(I found that out by testing EXEC_BACKEND on CI.  I also learned that
the Mac and FreeBSD tasks fail with EXEC_BACKEND because of SysV shmem
bleating.  We probably should go and crank up the relevant sysctls in
the .cirrus.tasks.yml...)

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Thomas Munro

On Sat, May 18, 2024 at 11:30 AM Thomas Munro  wrote:
> Andres happened to have TPC-DS handy, and reproduced that regression
> in q15.  We tried some stuff and figured out that it requires
> parallel_leader_participation=on, ie that this looks like some kind of
> parallel fairness and/or timing problem.  It seems to be a question of
> which worker finishes up processing matching rows, and the leader gets
> a ~10ms head start but may be a little more greedy with the new
> streaming code.  He tried reordering the table contents and then saw
> 17 beat 16.  So for q15, initial indications are that this isn't a
> fundamental regression, it's just a test that is sensitive to some
> arbitrary conditions.
>
> I'll try to figure out some more details about that, ie is it being
> too greedy on small-ish tables,

After more debugging, we learned a lot more things...

1.  That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times.  That is quite interesting, but unrelated to new code.
2.  Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
think.  I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3.  This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-   BlockNumber blocknums[16];
+   BlockNumber blocknums[1];

I will follow up after some more study.

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Thomas Munro

On Sat, May 18, 2024 at 8:09 AM Thomas Munro  wrote:
> On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin  wrote:
> > I decided to compare v17 vs v16 performance (as I did the last year [1])
> > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
> > benchmark, query15 (and several others, but I focused on this one):
> > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 
> > 151.03): pg_tpcds.query15
> > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 
> > > 152.64): pg_tpcds.query15
> > Please look at the full html report attached in case you're interested.
> >
> > (I used my pg-mark tool to measure/analyze performance, but I believe the
> > same results can be seen without it.)
>
> Will investigate, but if it's easy for you to rerun, does it help if
> you increase Linux readahead, eg blockdev --setra setting?

Andres happened to have TPC-DS handy, and reproduced that regression
in q15.  We tried some stuff and figured out that it requires
parallel_leader_participation=on, ie that this looks like some kind of
parallel fairness and/or timing problem.  It seems to be a question of
which worker finishes up processing matching rows, and the leader gets
a ~10ms head start but may be a little more greedy with the new
streaming code.  He tried reordering the table contents and then saw
17 beat 16.  So for q15, initial indications are that this isn't a
fundamental regression, it's just a test that is sensitive to some
arbitrary conditions.

I'll try to figure out some more details about that, ie is it being
too greedy on small-ish tables, and generally I do wonder about the
interactions between the heuristics and batching working at different
levels (OS, seq scan, read stream, hence my earlier ra question which
is likely a red herring) and how there might be unintended
consequences/interference patterns, but this particular case seems
more data dependent.

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Thomas Munro

On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin  wrote:
> I decided to compare v17 vs v16 performance (as I did the last year [1])
> and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
> benchmark, query15 (and several others, but I focused on this one):
> Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 
> 151.03): pg_tpcds.query15
> Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 
> 152.64): pg_tpcds.query15
> Please look at the full html report attached in case you're interested.
>
> (I used my pg-mark tool to measure/analyze performance, but I believe the
> same results can be seen without it.)

Will investigate, but if it's easy for you to rerun, does it help if
you increase Linux readahead, eg blockdev --setra setting?

Re: race condition when writing pg_control

2024-05-16 Thread Thomas Munro

The specific problem here is that LocalProcessControlFile() runs in
every launched child for EXEC_BACKEND builds.  Windows uses
EXEC_BACKEND, and Windows' NTFS file system is one of the two file
systems known to this list to have the concurrent read/write data
mashing problem (the other being ext4).

Re: Potential stack overflow in incremental base backup

2024-05-16 Thread Thomas Munro

On Tue, Apr 16, 2024 at 4:10 AM Robert Haas  wrote:
> On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro  wrote:
> > To rescue my initdb --rel-segsize project[1] for v18, I will have a go
> > at making that dynamic.  It looks like we don't actually need to
> > allocate that until we get to the GetFileBackupMethod() call, and at
> > that point we have the file size.  If I understand correctly,
> > statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
> > block number buffer there if required, right?
>
> Yes.

Here is a first attempt at that.  Better factoring welcome.  New
observations made along the way: the current coding can exceed
MaxAllocSize and error out, or overflow 32 bit size_t and allocate
nonsense.  Do you think it'd be better to raise an error explaining
that, or silently fall back to full backup (tried that way in the
attached), or that + log messages?  Another option would be to use
huge allocations, so we only have to deal with that sort of question
for 32 bit systems (i.e. effectively hypothetical/non-relevant
scenarios), but I don't love that idea.

> ...
> I do understand that a 1GB segment size is not that big in 2024, and
> that filesystems with a 2GB limit are thought to have died out a long
> time ago, and I'm not against using larger segments. I do think,
> though, that increasing the segment size by 32768x in one shot is
> likely to be overdoing it.

My intuition is that the primary interesting lines to cross are at 2GB
and 4GB due to data type stuff.  I defend against the most basic
problem in my proposal: I don't let you exceed your off_t type, but
that doesn't mean we don't have off_t/ssize_t/size_t/long snafus
lurking in our code that could bite a 32 bit system with large files.
If you make it past those magic numbers and your tools are all happy,
I think you should be home free until you hit file system limits,
which are effectively unhittable on most systems except ext4's already
bemoaned 16TB limit AFAIK.  But all the same, I'm contemplating
limiting the range to 1TB in the first version, not because of general
fear of unknown unknowns, but just because it means we don't need to
use "huge" allocations for this known place, maybe until we can
address that.
From 30efb6d39c83e9d4f338e10e1b8944c64f8799c5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 15 May 2024 17:23:21 +1200
Subject: [PATCH v1] Limit block number buffer size in incremental backup.

Previously, basebackup.c would allocate an array big enough to hold a
block number for every block in a full sized md.c segment file.  That
works out to 512kB by default, which should be no problem.  However,
users can change the segment size at compile time, and the required
space for the array could be many gigabytes, posing problems:

1. For segment sizes over 2TB, MaxAllocSize would be exceeded,
   raising an error.
2. For segment sizes over 8TB, size_t arithmetic would overflow on 32 bit
   systems, leading to the wrong size allocation.
3. For any very large segment size, it's non-ideal to allocate a huge
   amount of memory if you're not actually going to use it.

This isn't a fundamental fix for the high memory requirement of the
algorithm as coded, but it seems like a good idea to avoid those limits
with a fallback strategy, and defer allocating until we see how big the
segments actually are in the cluster being backed up, upsizing as
required.

These are mostly theoretical problems at the moment, because it is so
hard to change the segment size in practice that people don't do it.  A
new proposal would make it changeable at run time, hence interest in
tidying these rough edges up.

Discussion: https://postgr.es/m/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 9a2bf59e84e..0703b77af94 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -34,6 +34,7 @@
 #include "pgstat.h"
 #include "pgtar.h"
 #include "port.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walsender.h"
@@ -45,6 +46,7 @@
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/resowner.h"
@@ -1198,13 +1200,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 	bool		isGlobalDir = false;
 	Oid			dboid = InvalidOid;
 	BlockNumber *relative_block_numbers = NULL;
-
-	/*
-	 * Since this array is relatively large, avoid putting it on the stack.
-	 * But we don't need it at all if this is not an incremental backup.
-	 */
-	if (ib != NULL)
-		relative_block_numbers

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-05-16 Thread Thomas Munro

On Fri, May 17, 2024 at 3:17 AM Nazir Bilal Yavuz  wrote:
> Actually, 32 bit builds are working but the Perl version needs to be
> updated to 'perl5.36-i386-linux-gnu' in .cirrus.tasks.yml. I changed
> 0001 with the working version of 32 bit builds [1] and the rest is the
> same. All tests pass now [2].

Ahh, right, thanks!  I will look at committing your CI/fixup patches.

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-05-15 Thread Thomas Munro

On Wed, May 15, 2024 at 5:20 PM Peter Eisentraut  wrote:
> Yes, let's get that v3-0001 patch into PG17.

Done.

Bilal recently created the CI images for Debian Bookworm[1].  You can
try them with s/bullseye/bookworm/ in .cirrus.tasks.yml, but it looks
like he is still wrestling with a perl installation problem[2] in the
32 bit build, so here is a temporary patch to do that and also delete
the 32 bit tests for now.  This way cfbot should succeed with the
remaining patches.  Parked here for v18.

[1] 
https://github.com/anarazel/pg-vm-images/commit/685ca7ccb7b3adecb11d948ac677d54cd9599e6c
[2] https://cirrus-ci.com/task/5459439048720384
From c0a05c2929e03558c730b148bdeb5d301dbc4312 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 16 May 2024 14:10:09 +1200
Subject: [PATCH v4 1/3] XXX CI kludge: bullseye->bookworm

Temporarily removed 32 bit tests, as the CI image is not fully baked
yet.
---
 .cirrus.tasks.yml | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index a2388cd5036..5ff6b6a0556 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -65,7 +65,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 8
 TEST_JOBS: 8
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 CCACHE_DIR: ${CIRRUS_WORKING_DIR}/ccache_dir
 # no options enabled, should be small
 CCACHE_MAXSIZE: "150M"
@@ -243,7 +243,7 @@ task:
 CPUS: 4
 BUILD_JOBS: 4
 TEST_JOBS: 8 # experimentally derived to be a decent choice
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 CCACHE_DIR: /tmp/ccache_dir
 DEBUGINFOD_URLS: "https://debuginfod.debian.net;
@@ -314,7 +314,7 @@ task:
 #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
 
   matrix:
-- name: Linux - Debian Bullseye - Autoconf
+- name: Linux - Debian Bookworm - Autoconf
 
   env:
 SANITIZER_FLAGS: -fsanitize=address
@@ -348,7 +348,7 @@ task:
   on_failure:
 <<: *on_failure_ac
 
-- name: Linux - Debian Bullseye - Meson
+- name: Linux - Debian Bookworm - Meson
 
   env:
 CCACHE_MAXSIZE: "400M" # tests two different builds
@@ -364,24 +364,7 @@ task:
 build
 EOF
 
-  # Also build & test in a 32bit build - it's gotten rare to test that
-  # locally.
-  configure_32_script: |
-su postgres <<-EOF
-  export CC='ccache gcc -m32'
-  meson setup \
---buildtype=debug \
--Dcassert=true -Dinjection_points=true \
-${LINUX_MESON_FEATURES} \
--Dllvm=disabled \
---pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
--DPERL=perl5.32-i386-linux-gnu \
--DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
-build-32
-EOF
-
   build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS}'
-  build_32_script: su postgres -c 'ninja -C build-32 -j${BUILD_JOBS}'
 
   upload_caches: ccache
 
@@ -393,16 +376,6 @@ task:
 # so that we don't upload 64bit logs if 32bit fails
 rm -rf build/
 
-  # There's currently no coverage of icu with LANG=C in the buildfarm. We
-  # can easily provide some here by running one of the sets of tests that
-  # way. Newer versions of python insist on changing the LC_CTYPE away
-  # from C, prevent that with PYTHONCOERCECLOCALE.
-  test_world_32_script: |
-su postgres <<-EOF
-  ulimit -c unlimited
-  PYTHONCOERCECLOCALE=0 LANG=C meson test $MTEST_ARGS -C build-32 --num-processes ${TEST_JOBS}
-EOF
-
   on_failure:
 <<: *on_failure_meson
 
@@ -652,7 +625,7 @@ task:
   env:
 CPUS: 4
 BUILD_JOBS: 4
-IMAGE_FAMILY: pg-ci-bullseye
+IMAGE_FAMILY: pg-ci-bookworm
 
 # Use larger ccache cache, as this task compiles with multiple compilers /
 # flag combinations
-- 
2.39.2

From bcdb3453b50cd37077ad71f012d7f92c9842495f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 19 Oct 2023 04:45:46 +1300
Subject: [PATCH v4 2/3] jit: Require at least LLVM 14, if enabled.

Remove support for LLVM versions 10-13.  The default on all non-EOL'd
OSes represented in our build farm will be at least LLVM 14 when
PostgreSQL 18 ships.

Reviewed-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
---
 config/llvm.m4  |   4 +-
 configure   |   4 +-
 doc/src/sgml/installation.sgml  |   4 +-
 meson.build |   2 +-
 src/backend/jit/llvm/llvmjit.c  | 101 
 src/backend/jit/llvm/llvmjit_error.cpp  |  25 --
 src/backend/jit/llvm/llvmjit_inline.cpp |  13 ---
 src/backend/jit/llvm/llvmjit_wrap.cpp   |   4 -
 8 files changed, 7 insertions(+), 150 deletions(-)

diff --git a/config/llvm.m4 b/conf

Re: Why is citext/regress failing on hamerkop?

2024-05-15 Thread Thomas Munro

On Thu, May 16, 2024 at 10:43 AM Thomas Munro  wrote:
> Any chance you could test this version please Alexander?

Sorry, cancel that.  v3 is not good.  I assume it fixes the GSSAPI
thing and is superficially better, but it doesn't handle code that
calls twice in a row and ignores the first result (I know that
PostgreSQL does that occasionally in a few places), and it's also
broken if someone gets recv() = 0 (EOF), and then decides to wait
anyway.  The only ways I can think of to get full reliable poll()-like
semantics is to do that peek every time, OR the complicated patch
(per-socket-workspace + intercepting recv etc).  So I'm back to v2.

Re: Why is citext/regress failing on hamerkop?

2024-05-15 Thread Thomas Munro

On Thu, May 16, 2024 at 9:46 AM Thomas Munro  wrote:
> Alright, unless anyone has an objection or ideas for improvements, I'm
> going to go ahead and back-patch this slightly tidied up version
> everywhere.

Of course as soon as I wrote that I thought of a useful improvement
myself: as far as I can tell, you only need to do the extra poll on
the first wait for WL_SOCKET_READABLE for any given WaitEventSet.  I
don't think it's needed for later waits done by long-lived
WaitEventSet objects, so we can track that with a flag.  That avoids
adding new overhead for regular backend socket waits after
authentication, it's just in these code paths that do a bunch of
WaitLatchOrSocket() calls that we need to consider FD_CLOSE events
lost between the cracks.

I also don't know if the condition should include "&& received == 0".
It probably doesn't make much difference, but by leaving that off we
don't have to wonder how peeking interacts with events, ie if it's a
problem that we didn't do the "reset" step.  Thinking about that, I
realised that I should probably set reset = true in this new return
path, just like the "normal" WL_SOCKET_READABLE notification path,
just to be paranoid.  (Programming computers you don't have requires
extra paranoia.)

Any chance you could test this version please Alexander?

v3-0001-Fix-FD_CLOSE-socket-event-hangs-on-Windows.patch
Description: Binary data

Re: Why is citext/regress failing on hamerkop?

2024-05-15 Thread Thomas Munro

On Wed, May 15, 2024 at 6:00 PM Alexander Lakhin  wrote:
> 15.05.2024 01:26, Thomas Munro wrote:
> > OK, so we know what the problem is here.  Here is the simplest
> > solution I know of for that problem.  I have proposed this in the past
> > and received negative feedback because it's a really gross hack.  But
> > I don't personally know what else to do about the back-branches (or
> > even if that complex solution is the right way forward for master).
> > The attached kludge at least has the [de]merit of being a mirror image
> > of the kludge that follows it for the "opposite" event.  Does this fix
> > it?
>
> Yes, I see that abandoned GSS connections are closed immediately, as
> expected. I have also confirmed that `meson test` with the basic
> configuration passes on REL_16_STABLE. So from the outside, the fix
> looks good to me.

Alright, unless anyone has an objection or ideas for improvements, I'm
going to go ahead and back-patch this slightly tidied up version
everywhere.


v2-0001-Add-kludge-to-make-FD_CLOSE-level-triggered.patch
Description: Binary data

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-05-14 Thread Thomas Munro

On Mon, May 13, 2024 at 2:33 AM Peter Eisentraut  wrote:
> These patches look fine to me.  The new cut-off makes sense, and it does
> save quite a bit of code.  We do need to get the Cirrus CI Debian images
> updated first, as you had already written.

Thanks for looking!

> As part of this patch, you also sneak in support for LLVM 18
> (llvm-config-18, clang-18 in configure).  Should this be a separate patch?

Yeah, right, I didn't really think too hard about why we have that,
and now that you question it...

> And as I'm looking up how this was previously handled, I notice that
> this list of clang-NN versions was last updated equally sneakily as part
> of your patch to trim off LLVM <10 (820b5af73dc).  I wonder if the
> original intention of that configure code was that maintaining the
> versioned list above clang-7/llvm-config-7 was not needed, because the
> unversioning programs could be used, or maybe because pkg-config could
> be used.  It would be nice if we could get rid of having to update that.

I probably misunderstood why we were doing that, perhaps something to
do with the way some distro (Debian?) was doing things with older
versions, and yeah I see that we went a long time after 7 without
touching it and nobody cared.  Yeah, it would be nice to get rid of
it.  Here's a patch.  Meson didn't have that.
From 025f8b0106821b5b6f2ab7992da388404e3e406c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 15 May 2024 15:51:28 +1200
Subject: [PATCH v3 1/3] jit: Remove configure probes for {lvm-config,clang}-N.

Previously we searched for llvm-config-N and clang-N as well as the
unversioned names, and maintained a list of expected values of N.  There
doesn't seem to be any reason to think that the default llvm-config and
clang won't be good enough, and if they aren't, they can be overridden
with LLVM_CONFIG and CLANG, so let's stop maintaining that list.

The Meson build system didn't do that, so this brings them into sync.

Suggested-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKG%2BSOP-aR%3DYF_n0dtXGWeCy6x%2BCn-RMWURU5ySQdmeKW1Q%40mail.gmail.com
---
 config/llvm.m4 | 4 ++--
 configure  | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/config/llvm.m4 b/config/llvm.m4
index 44769d819a0..c6cf8858f64 100644
--- a/config/llvm.m4
+++ b/config/llvm.m4
@@ -13,7 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
   AC_REQUIRE([AC_PROG_AWK])
 
   AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command])
-  PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10)
+  PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config)
 
   # no point continuing if llvm wasn't found
   if test -z "$LLVM_CONFIG"; then
@@ -32,7 +32,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
 
   # need clang to create some bitcode files
   AC_ARG_VAR(CLANG, [path to clang compiler to generate bitcode])
-  PGAC_PATH_PROGS(CLANG, clang clang-17 clang-16 clang-15 clang-14 clang-13 clang-12 clang-11 clang-10)
+  PGAC_PATH_PROGS(CLANG, clang)
   if test -z "$CLANG"; then
 AC_MSG_ERROR([clang not found, but required when compiling --with-llvm, specify with CLANG=])
   fi
diff --git a/configure b/configure
index 89644f2249e..8e7704d54bd 100755
--- a/configure
+++ b/configure
@@ -5065,7 +5065,7 @@ if test "$with_llvm" = yes; then :
 
 
   if test -z "$LLVM_CONFIG"; then
-  for ac_prog in llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10
+  for ac_prog in llvm-config
 do
   # Extract the first word of "$ac_prog", so it can be a program name with args.
 set dummy $ac_prog; ac_word=$2
@@ -5138,7 +5138,7 @@ $as_echo "$as_me: using llvm $pgac_llvm_version" >&6;}
   # need clang to create some bitcode files
 
   if test -z "$CLANG"; then
-  for ac_prog in clang clang-17 clang-16 clang-15 clang-14 clang-13 clang-12 clang-11 clang-10
+  for ac_prog in clang
 do
   # Extract the first word of "$ac_prog", so it can be a program name with args.
 set dummy $ac_prog; ac_word=$2
-- 
2.39.2

From ef1c38dbb1ec8a6a762c88ee50c55a87693d44f6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 19 Oct 2023 04:45:46 +1300
Subject: [PATCH v3 2/3] jit: Require at least LLVM 14, if enabled.

Remove support for LLVM versions 10-13.  The default on all non-EOL'd
OSes represented in our build farm will be at least LLVM 14 when
PostgreSQL 18 ships.

Reviewed-by: Peter Eisentraut 
Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
---
 config/llvm.m4  |   4 +-
 configure   |   4 +-
 doc/src/sgml/installation.sgml  |   4 +-
 meson.build |   2 +-
 src/backend/jit/llvm/llvmjit.c  | 101

Re: CFbot does not recognize patch contents

2024-05-14 Thread Thomas Munro

On Sun, May 12, 2024 at 10:11 PM Tatsuo Ishii  wrote:
> > I am able to apply all your patches. I found that a similar thing
> > happened before [0] and I guess your case is similar. Adding Thomas to
> > CC, he may be able to help more.
>
> Ok. Thanks for the info.

This obviously fixed itself automatically soon after this message, but
I figured out what happened:  I had not actually fixed that referenced
bug in cfbot :-(.  It was checking for HTTP error codes correctly in
the place that reads emails from the archives, but not the place that
downloads patches, so in this case I think when it tried to follow the
link[1] to download the patch, I guess it must have pulled down a
transient Varnish error message (I don't know what, I don't store it
anywhere), and tried to apply that as a patch.  Oops.  Fixed[2].

[1] 
https://www.postgresql.org/message-id/attachment/160138/v18-0001-Row-pattern-recognition-patch-for-raw-parser.patch
[2] 
https://github.com/macdice/cfbot/commit/ec33a65a877a88befc29ea220e87b98c89b27dcc

Re: Why is citext/regress failing on hamerkop?

2024-05-14 Thread Thomas Munro

On Tue, May 14, 2024 at 9:00 PM Alexander Lakhin  wrote:
> 14.05.2024 03:38, Thomas Munro wrote:
> > I was beginning to suspect that lingering odour myself.  I haven't
> > look at the GSS code but I was imagining that what we have here is
> > perhaps not unsent data dropped on the floor due to linger policy
> > (unclean socket close on process exist), but rather that the server
> > didn't see the socket as ready to read because it lost track of the
> > FD_CLOSE somewhere because the client closed it gracefully, and our
> > server-side FD_CLOSE handling has always been a bit suspect.  I wonder
> > if the GSS code is somehow more prone to brokenness.  One thing we
> > learned in earlier problems was that abortive/error disconnections
> > generate FD_CLOSE repeatedly, while graceful ones give you only one.
> > In other words, if the other end politely calls closesocket(), the
> > server had better not miss the FD_CLOSE event, because it won't come
> > again.   That's what
> >
> > https://commitfest.postgresql.org/46/3523/
> >
> > is intended to fix.  Does it help here?  Unfortunately that's
> > unpleasantly complicated and unbackpatchable (keeping a side-table of
> > socket FDs and event handles, so we don't lose events between the
> > cracks).
>
> Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
> state when that patch set can be applied).

Thanks for checking, and generally for your infinite patience with all
these horrible Windows problems.

OK, so we know what the problem is here.  Here is the simplest
solution I know of for that problem.  I have proposed this in the past
and received negative feedback because it's a really gross hack.  But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event.  Does this fix
it?
From cbe4680c3e561b26c0bb49fc39dc8a6f40e84134 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 15 May 2024 10:01:19 +1200
Subject: [PATCH] Add kludge to make FD_CLOSE level-triggered.

Winsock only signals an FD_CLOSE event once, if the other end of the
socket shuts down gracefully.  Because event WaitLatchOrSocket()
constructs and destroys a new event handle, we can miss the FD_CLOSE
event that is signaled just as we're destroying the handle.  The next
WaitLatchOrSocket() will not see it.

Fix that race with some extra polling.

We wouldn't need this if we had long-lived event handles for sockets,
which has been proposed and tested and shown to work, but it's far too
complex to back-patch.
---
 src/backend/storage/ipc/latch.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index a7d88ebb048..9a5273f17d6 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1999,6 +1999,43 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			cur_event->reset = false;
 		}
 
+		/*
+		 * Because we associate the socket with different event handles at
+		 * different times, and because FD_CLOSE is only generated once the
+		 * other end closes gracefully, we might miss an FD_CLOSE event that
+		 * was signaled already on a handle we've already closed.  We close
+		 * that race by synchronously polling for EOF, after adjusting the
+		 * event above and before sleeping below.
+		 *
+		 * XXX If we arranged to have one event handle for the lifetime of a
+		 * socket, we wouldn't need this.
+		 */
+		if (cur_event->events & WL_SOCKET_READABLE)
+		{
+			char		c;
+			WSABUF		buf;
+			DWORD		received;
+			DWORD		flags;
+			int			r;
+
+			buf.buf = 
+			buf.len = 1;
+
+			/*
+			 * Peek to see if EOF condition is true.  Don't worry about error
+			 * handling or pending data, just be careful not to consume it.
+			 */
+			flags = MSG_PEEK;
+			if (WSARecv(cur_event->fd, , 1, , , NULL, NULL) == 0)
+			{
+occurred_events->pos = cur_event->pos;
+occurred_events->user_data = cur_event->user_data;
+occurred_events->events = WL_SOCKET_READABLE;
+occurred_events->fd = cur_event->fd;
+return 1;
+			}
+		}
+
 		/*
 		 * Windows does not guarantee to log an FD_WRITE network event
 		 * indicating that more data can be sent unless the previous send()
-- 
2.44.0

Re: Why is citext/regress failing on hamerkop?

2024-05-13 Thread Thomas Munro

On Tue, May 14, 2024 at 8:17 AM Tom Lane  wrote:
> I'm not sure whether we've got unsent data pending in the problematic
> condition, nor why it'd remain unsent if we do (shouldn't the backend
> consume it anyway?).  But this has the right odor for an explanation.
>
> I'm pretty hesitant to touch this area myself, because it looks an
> awful lot like commits 6051857fc and ed52c3707, which eventually
> had to be reverted.  I think we need a deeper understanding of
> exactly what Winsock is doing or not doing before we try to fix it.

I was beginning to suspect that lingering odour myself.  I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect.  I wonder
if the GSS code is somehow more prone to brokenness.  One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again.   That's what

https://commitfest.postgresql.org/46/3523/

is intended to fix.  Does it help here?  Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).

Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-12 Thread Thomas Munro

On Sat, May 11, 2024 at 5:00 PM Alexander Lakhin  wrote:
> 11.05.2024 07:25, Thomas Munro wrote:
> > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin  
> > wrote:
> >> 11.05.2024 06:26, Thomas Munro wrote:
> >>> Perhaps a no-image, no-change registered buffer should not be
> >>> including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> >>> useless for consistency checking too I guess, this issue aside,
> >>> because it doesn't change anything so there is nothing to check.

> >> Yes, I think something wrong is here. I've reduced the reproducer to:

> > Does it reproduce if you do this?
> >
> > -   include_image = needs_backup || (info &
> > XLR_CHECK_CONSISTENCY) != 0;
> > +   include_image = needs_backup ||
> > +   ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> > +(regbuf->flags & REGBUF_NO_CHANGE) == 0);
>
> No, it doesn't (at least with the latter, more targeted reproducer).

OK so that seems like a candidate fix, but ...

> > Unfortunately the back branches don't have that new flag from 00d7fb5e
> > so, even if this is the right direction (not sure, I don't understand
> > this clean registered buffer trick) then ... but wait, why are there
> > are no failures like this in the back branches (yet at least)?  Does
> > your reproducer work for 16?  I wonder if something relevant changed
> > recently, like f56a9def.  CC'ing Michael and Amit K for info.
>
> Maybe it's hard to hit (autovacuum needs to process the index page in a
> narrow time frame), but locally I could reproduce the issue even on
> ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching
> hash indexes, didn't dig deeper).

... we'd need to figure out how to fix this in the back-branches too.
One idea would be to back-patch REGBUF_NO_CHANGE, and another might be
to deduce that case from other variables.  Let me CC a couple more
people from this thread, which most recently hacked on this stuff, to
see if they have insights:

https://www.postgresql.org/message-id/flat/d2c31606e6bb9b83a02ed4835d65191b38d4ba12.camel%40j-davis.com

Re: Why is citext/regress failing on hamerkop?

2024-05-12 Thread Thomas Munro

On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan  wrote:
> Well, this is more or less where I came in back in about 2002 :-) I've been 
> trying to help support it ever since, mainly motivated by stubborn 
> persistence than anything else. Still, I agree that the lack of support for 
> the Windows port from Microsoft over the years has been more than 
> disappointing.

I think "state of the Windows port" would make a good discussion topic
at pgconf.dev (with write-up for those who can't be there).  If there
is interest, I could organise that with a short presentation of the
issues I am aware of so far and possible improvements and
smaller-things-we-could-drop-instead-of-dropping-the-whole-port.  I
would focus on technical stuff, not who-should-be-doing-what, 'cause I
can't make anyone do anything.

For citext_utf8, I pushed cff4e5a3.  Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.

Re: Why is citext/regress failing on hamerkop?

2024-05-11 Thread Thomas Munro

On Sat, May 11, 2024 at 1:14 PM Thomas Munro  wrote:
> Either way, it seems like we'll need to skip that test on Windows if
> we want hamerkop to be green.  That can probably be cribbed from
> collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
> prelude... I just don't know how to explain it in the comment 'cause I
> don't know why.

Here's a minimal patch like that.

I don't think it's worth back-patching.  Only 15 and 16 could possibly
be affected, I think, because the test wasn't enabled before that.  I
think this is all just a late-appearing blow-up predicted by the
commit that enabled it:

commit c2e8bd27519f47ff56987b30eb34a01969b9a9e8
Author: Tom Lane 
Date:   Wed Jan 5 13:30:07 2022 -0500

Enable routine running of citext's UTF8-specific test cases.

These test cases have been commented out since citext was invented,
because at the time we had no nice way to deal with tests that
have restrictions such as requiring UTF8 encoding.  But now we do
have a convention for that, ie put them into a separate test file
with an early-exit path.  So let's enable these tests to run when
their prerequisites are satisfied.

(We may have to tighten the prerequisites beyond the "encoding = UTF8
and locale != C" checks made here.  But let's put it on the buildfarm
and see what blows up.)

Hamerkop is already green on the 15 and 16 branches, apparently
because it's using the pre-meson test stuff that I guess just didn't
run the relevant test.  In other words, nobody would notice the
difference anyway, and a master-only fix would be enough to end this
44-day red streak.
From 1f0e1dc21d4055a0e5109bac3b290508e2d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 12 May 2024 10:20:06 +1200
Subject: [PATCH] Skip the citext_utf8 test on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On other Windows build farm animals it is already skipped because they
don't use UTF-8 encoding.  On "hamerkop", UTF-8 is used, and then the
test fails.

It is not clear to me (a non-Windows person looking only at buildfarm
evidence) whether Windows is less sophisticated than other OSes and
doesn't know how to downcase Turkish İ with the standard Unicode
database, or if it is more sophisticated than other systems and uses
locale-specific behavior like ICU does.

Whichever the reason, the result is the same: we need to skip the test
on Windows, just as we already do for ICU, at least until a
Windows-savvy developer comes up with a better idea.  The technique for
detecting the OS is borrowed from collate.windows.win1252.sql.

Discussion: https://postgr.es/m/CA%2BhUKGJ1LeC3aE2qQYTK95rFVON3ZVoTQpTKJqxkHdtEyawH4A%40mail.gmail.com
---
 contrib/citext/expected/citext_utf8.out   | 3 +++
 contrib/citext/expected/citext_utf8_1.out | 3 +++
 contrib/citext/sql/citext_utf8.sql| 3 +++
 3 files changed, 9 insertions(+)

diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 5d988dcd485..19538db674e 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -6,8 +6,11 @@
  * Turkish dotted I is not correct for many ICU locales. citext always
  * uses the default collation, so it's not easy to restrict the test
  * to the "tr-TR-x-icu" collation where it will succeed.
+ *
+ * Also disable for Windows.  It fails similarly, at least in some locales.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
+   version() ~ '(Visual C\+\+|mingw32|windows)' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
 FROM pg_database
 WHERE datname=current_database())
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 7065a5da190..874ec8519e1 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -6,8 +6,11 @@
  * Turkish dotted I is not correct for many ICU locales. citext always
  * uses the default collation, so it's not easy to restrict the test
  * to the "tr-TR-x-icu" collation where it will succeed.
+ *
+ * Also disable for Windows.  It fails similarly, at least in some locales.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
+   version() ~ '(Visual C\+\+|mingw32|windows)' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
 FROM pg_database
 WHERE datname=current_database())
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index 34b232d64e2..ba283320797 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -6,9 +6,12 @@
  * Turkish dotted I is not correct for many ICU locales. citext always
  * uses the default collation, so it's not easy to restrict the test
  * to the "tr-TR-x-icu" collation where it

Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Thomas Munro

On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin  wrote:
> 11.05.2024 06:26, Thomas Munro wrote:
> > Perhaps a no-image, no-change registered buffer should not be
> > including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> > useless for consistency checking too I guess, this issue aside,
> > because it doesn't change anything so there is nothing to check.
>
> Yes, I think something wrong is here. I've reduced the reproducer to:

Does it reproduce if you do this?

-   include_image = needs_backup || (info &
XLR_CHECK_CONSISTENCY) != 0;
+   include_image = needs_backup ||
+   ((info & XLR_CHECK_CONSISTENCY) != 0 &&
+(regbuf->flags & REGBUF_NO_CHANGE) == 0);

Unfortunately the back branches don't have that new flag from 00d7fb5e
so, even if this is the right direction (not sure, I don't understand
this clean registered buffer trick) then ... but wait, why are there
are no failures like this in the back branches (yet at least)?  Does
your reproducer work for 16?  I wonder if something relevant changed
recently, like f56a9def.  CC'ing Michael and Amit K for info.

Re: WAL record CRC calculated incorrectly because of underlying buffer modification

2024-05-10 Thread Thomas Munro

On Sat, May 11, 2024 at 3:57 AM Andres Freund  wrote:
> On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:
> > and discovered that XLogRecordAssemble() calculates CRC over a buffer,
> > that might be modified by another process.
>
> If, with "might", you mean that it's legitimate for that buffer to be
> modified, I don't think so.  The bug is that something is modifying the buffer
> despite it being exclusively locked.
>
> I.e. what we likely have here is a bug somewhere in the hash index code.

I don't have a good grip on the higher level locking protocols of
hash.c, but one microscopic thing jumps out:

/*
 * bucket buffer was not changed, but still needs to be
 * registered to ensure that we can acquire a cleanup lock on
 * it during replay.
 */
if (!xlrec.is_primary_bucket_page)
{
uint8flags = REGBUF_STANDARD |
REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;

XLogRegisterBuffer(0, bucket_buf, flags);
}

That registers a buffer that is pinned but not content-locked, and we
tell xloginsert.c not to copy its image into the WAL, but it does it
anyway because:

/*
 * If needs_backup is true or WAL checking is enabled for current
 * resource manager, log a full-page write for the current block.
 */
include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;

So I guess it copies the image on dodo, which has:

'PG_TEST_EXTRA' => 'ssl ldap
kerberos wal_consistency_checking libpq_encryption xid_wraparound'

Perhaps a no-image, no-change registered buffer should not be
including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
useless for consistency checking too I guess, this issue aside,
because it doesn't change anything so there is nothing to check.

Why is citext/regress failing on hamerkop?

2024-05-10 Thread Thomas Munro

For example, 'i'::citext = 'İ'::citext fails to be true.

It must now be using UTF-8 (unlike, say, Drongo) and non-C ctype,
given that the test isn't skipped. This isn't the first time that
we've noticed that Windows doesn't seem to know about İ→i (see [1]),
but I don't think anyone has explained exactly why, yet. It could be
that it just doesn't know about that in any locale, or that it is
locale-dependent and would only do that for Turkish, the same reason
we skip the test for ICU, or ...

Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.

One new development in Windows-land is that the system now does
actually support UTF-8 in the runtime libraries[2]. You can set it at
a system level, or for an application at build time, or by adding
".UTF-8" to a locale name when opening the locale (apparently much
more like Unix systems, but I don't know what exactly it does). I
wonder why we see this change now... did hamerkop have that ACP=UTF-8
setting applied on purpose, or if computers in Japan started doing
that by default instead of using Shift-JIS, or if it only started
picking UTF-8 around the time of the Meson change somehow, or the
initdb-template change. It's a little hard to tell from the logs.

[1]
https://www.postgresql.org/message-id/CAC%2BAXB10p%2BmnJ6wrAEm6jb51%2B8%3DBfYzD%3Dw6ftHRbMjMuSFN3kQ%40mail.gmail.com
[2]
https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

Re: First draft of PG 17 release notes

2024-05-09 Thread Thomas Munro

On Thu, May 9, 2024 at 4:04 PM Bruce Momjian  wrote:
> I welcome feedback.  For some reason it was an easier job than usual.

> 2024-01-25 [820b5af73] jit: Require at least LLVM 10.

> Require LLVM version 10 or later (Peter Eisentraut)

Peter reviewed, I authored, and I think you intend to list authors in
parentheses.

Re: cannot abort transaction 2737414167, it was already committed

2024-05-08 Thread Thomas Munro

On Thu, Dec 28, 2023 at 11:42 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > In CommitTransaction() there is a stretch of code beginning s->state =
> > TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call
> > out to various subsystems' AtEOXact_XXX() functions.  There is no way
> > to roll back in that state, so anything that throws ERROR from those
> > routines is going to get something much like $SUBJECT.  Hmm, we'd know
> > which exact code path got that EIO from your smoldering core if we'd
> > put an explicit critical section there (if we're going to PANIC
> > anyway, it might as well not be from a different stack after
> > longjmp()...).
>
> +1, there's basically no hope of debugging this sort of problem
> as things stand.

I was reminded of this thread by Justin's other file system snafu thread.

Naively defining a critical section to match the extent of the
TRANS_COMMIT state doesn't work, as a bunch of code under there uses
palloc().  That reminds me of the nearby RelationTruncate() thread,
and there is possibly even some overlap, plus more in this case...
ugh.

Hmm, AtEOXact_RelationMap() is one of those steps, but lives just
outside the crypto-critical-section created by TRANS_COMMIT, though
has its own normal CS for logging.  I wonder, given that "updating the
map file is effectively commit of the relocation", why wouldn't it
have a variant of the problem solved by DELAY_CHKPT_START for normal
commit records, under diabolical scheduling?  It's a stretch, but: You
log XLOG_RELMAP_UPDATE, a concurrent checkpoint runs with REDO after
that record, you crash before/during durable_rename(), and then you
perform crash recovery.  Now your catalog is still using the old
relfilenode on the primary, but any replica following along replays
XLOG_RELMAP_UPDATE and is using the new relfilenode, frozen in time,
for queries, while replaying changes to the old relfilenode.  Right?

Re: backend stuck in DataFileExtend

2024-05-07 Thread Thomas Munro

On Wed, May 8, 2024 at 6:54 AM Justin Pryzby  wrote:
> On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote:
> > https://github.com/openzfs/zfs/issues/11641
> >
> > I don't know enough to say anything useful about that but it certainly
> > smells similar...
>
> Wow - I'd completely forgotten about that problem report.
> The symptoms are the same, even with a zfs version 3+ years newer.
> I wish the ZFS people would do more with their problem reports.

If I had to guess, my first idea would be that your 1MB or ginormous
16MB recordsize (a relatively new option) combined with PostgreSQL's
8KB block-at-a-time random order I/O patterns are tickling strange
corners and finding a bug that no one has seen before.  I would
imagine that almost everyone in the galaxy who uses very large records
does so with 'settled' data that gets streamed out once sequentially
(for example I think some of the OpenZFS maintainers are at Lawrence
Livermore National Lab where I guess they might pump around petabytes
of data produced by particle physics research or whatever it might be,
probably why they they are also adding direct I/O to avoid caches
completely...).  But for us, if we have lots of backends reading,
writing and extending random 8KB fragments of a 16MB page concurrently
(2048 pages per record!), maybe we hit some broken edge...  I'd be
sure to include that sort of detail in any future reports.

Normally I suppress urges to blame problems on kernels, file systems
etc and in the past accusations that ZFS was buggy turned out to be
bugs in PostgreSQL IIRC, but user space sure seems to be off the hook
for this one...

Re: backend stuck in DataFileExtend

2024-05-06 Thread Thomas Munro

On Tue, May 7, 2024 at 6:21 AM Justin Pryzby  wrote:
> FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.
...
> Yes, they're running centos7 with the indicated kernels.

So far we've got:

* spurious EIO when opening a file (your previous report)
* hanging with CPU spinning (?) inside pwritev()
* old kernel, bleeding edge ZFS

>From an (uninformed) peek at the ZFS code, if it really is spinning
there is seems like a pretty low level problem: it's finish the write,
and now is just trying to release (something like our unpin) and
unlock the buffers, which involves various code paths that might touch
various mutexes and spinlocks, and to get stuck like that I guess it's
either corrupted itself or it is deadlocking against something else,
but what?  Do you see any other processes (including kernel threads)
with any stuck stacks that might be a deadlock partner?

While looking around for reported issues I found your abandoned report
against an older ZFS version from a few years ago, same old Linux
version:

https://github.com/openzfs/zfs/issues/11641

I don't know enough to say anything useful about that but it certainly
smells similar...

I see you've been busy reporting lots of issues, which seems to
involve big data, big "recordsize" (= ZFS block sizes), compression
and PostgreSQL:

https://github.com/openzfs/zfs/issues?q=is%3Aissue+author%3Ajustinpryzby

Re: 039_end_of_wal: error in "xl_tot_len zero" test

2024-05-05 Thread Thomas Munro

On Thu, Feb 15, 2024 at 10:40 PM Anton Voloshin
 wrote:
> On 19/01/2024 01:35, Thomas Munro wrote:
> > I don't yet have an opinion on the best way to
> > do it though.  Would it be enough to add emit_message($node, 0) after
> > advance_out_of_record_splitting_zone()?
>
> Yes, indeed that seems to be enough. At least I could not produce any
> more "xl_tot_len zero" failures with that addition.
>
> I like this solution the best.

Oh, it looks like this new build farm animal "skimmer" might be
reminding us about this issue:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skimmer=HEAD

I don't know why it changed, but since this is an LSN/page alignment
thing, it could be due to external things like an OS upgrade adding
more locales or something that affects initdb.  Will look soon and
fix.

Re: Streaming I/O, vectored I/O (WIP)

2024-04-30 Thread Thomas Munro

On Wed, May 1, 2024 at 2:51 PM David Rowley  wrote:
> On Wed, 24 Apr 2024 at 14:32, David Rowley  wrote:
> > I've attached a patch with a few typo fixes and what looks like an
> > incorrect type for max_ios. It's an int16 and I think it needs to be
> > an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
> > anything when max_ios is int16.
>
> No feedback, so I'll just push this in a few hours unless anyone has anything.

Patch looks correct, thanks.  Please do.  (Sorry, running a bit behind
on email ATM...  I also have a few more typos around here from an
off-list email from Mr Lakhin, will get to that soon...)

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-29 Thread Thomas Munro

On Tue, Apr 30, 2024 at 7:17 AM Thomas Munro  wrote:
> On Tue, Apr 30, 2024 at 6:47 AM Maksim Milyutin  wrote:
> >> Should not we call at the end the StrategyFreeBuffer() function to add 
> >> target buffer to freelist and not miss it after invalidation?
>
> > Please take a look at this issue, current implementation of 
> > EvictUnpinnedBuffer() IMO is erroneous - evicted buffers are lost 
> > permanently and will not be reused again

I don't think that's true: it is not lost permanently, it'll be found
by the regular clock hand.  Perhaps it should be put on the freelist
so it can be found again quickly, but I'm not sure that's a bug, is
it?  If it were true, even basic testing eg select
count(pg_buffercache_evict(bufferid)) from pg_buffercache would leave
the system non-functional, but it doesn't, the usual CLOCK algorithm
just does its thing.

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-29 Thread Thomas Munro

On Tue, Apr 30, 2024 at 6:47 AM Maksim Milyutin  wrote:
>> Should not we call at the end the StrategyFreeBuffer() function to add 
>> target buffer to freelist and not miss it after invalidation?

> Please take a look at this issue, current implementation of 
> EvictUnpinnedBuffer() IMO is erroneous - evicted buffers are lost permanently 
> and will not be reused again

Hi Maksim,

Oops, thanks, will fix.

Re: WIP: Vectored writeback

2024-04-26 Thread Thomas Munro

tive strategy.  A
ring that starts small, and grows/shrinks in response to dirty data
(instead of "rejecting").  That would have at least superficial
similarities to the ARC algorithm, the "adaptive" bit that controls
ring size (it's interested in recency vs frequency, but here it's more
like "we're willing to waste more memory on dirty data, because we
need to keep it around longer, to avoid flushing the WAL, but not
longer than that" which may be a different dimension to value cached
data on, I'm not sure).

Of course there must be some workloads/machines where using a strategy
(instead of BAS_BULKREAD when it degrades to BAS_NORMAL behaviour)
will be slower because of WAL flushes, but that's not a fair fight:
the flip side of that coin is that you've trashed the buffer pool,
which is an external cost paid by someone else, ie it's anti-social,
BufferAccessStrategy's very raison d'être.

Anyway, in the meantime, I hacked heapam.c to use BAS_BULKWRITE just
to see how it would work with this patch set.  (This causes an
assertion to fail in some test, something about the stats for
different IO contexts that was upset by IOCONTEXT_BULKWRITE, which I
didn't bother to debug, it's only a demo hack.)  Unsurprisingly, it
looks like this:

postgres=# delete from t;

...
pread(25,...,131072,0xc89e000) = 131072 (0x2)   <-- already committed
pread(25,...,131072,0xc8be000) = 131072 (0x2)   read-stream behaviour
kill(75954,SIGURG) = 0 (0x0)<-- hey WAL writer!
pread(25,...,131072,0xc8de000) = 131072 (0x2)
pread(25,...,131072,0xc8fe000) = 131072 (0x2)
...
pwrite(25,...,131072,0x1520) = 131072 (0x2) <-- write-behind!
pwrite(25,...,131072,0x151e) = 131072 (0x2)
pwrite(25,...,131072,0x151c) = 131072 (0x2)
...

UPDATE and INSERT conceptually work too, but they suffer from other
stupid page-at-a-time problems around extension so it's more fun to
look at DELETE first.

The whole write-behind notion, and the realisation that we already
have it and should just make it into a "real thing", jumped out at me
while studying Melanie's VACUUM pass 1 and VACUUM pass 2 patches for
adding read streams.  Rebased and attached here.  That required
hacking on the new tidstore.c stuff a bit.  (We failed to get the
VACUUM read stream bits into v17, but the study of that led to the
default BAS_VACUUM size being cranked up to reflect modern realities,
and generally sent me down this rabbit hole for a while.)

Some problems:
* If you wake the WAL writer more often, throughput might actually go
down on high latency storage due to serialisation of WAL flushes.  So
far I have declined to try to write an adaptive algorithm to figure
out whether to do it, and where the threshold should be.  I suspect it
might involve measuring time and hill-climbing...  One option is to
abandon this part (ie just do no worse than master at WAL flushing),
or at least consider that a separate project.
* This might hold too many pins!  It does respect the limit mechanism,
but that can let you have a lot of pins (it's a bit TOCTOU-racy too,
we might need something smarter).  One idea would be to release pins
while writes are in the LSN queue, and reacquire them with
ReadRecentBuffer() as required, since we don't really care if someone
else evicts them in the meantime.
* It seems a bit weird that we *also* have the WritebackContext
machinery.  I could probably subsume that whole mechanism into
write_stream.c.  If you squint, sync_file_range() is a sort of dual of
POSIX_FADV_WILLNEED, which the read counterpart looks after.
* I would like to merge Heikki's bulk write stuff into this somehow,
not yet thought about it much.

The patches are POC-quality only and certainly have bugs/missed edge
cases/etc.  Thoughts, better ideas, references to writing about this
problem space, etc, welcome.
From 63cb8f88fd65ef34536c7d4360b964ca5e6cf62d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 25 Apr 2024 23:45:48 +1200
Subject: [PATCH v2 01/11] Teach WritebackContext to work with block ranges.

Instead of having to feed it one block at a time, allow writeback of
ranges to scheduled, in preparation for I/O combining.
---
 src/backend/storage/buffer/bufmgr.c | 28 
 src/include/storage/buf_internals.h |  4 +++-
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f91..d7e434daaf1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1992,7 +1992,7 @@ again:
 		LWLockRelease(content_lock);
 
 		ScheduleBufferTagForWriteback(, io_context,
-	  _hdr->tag);
+	  _hdr->tag, 1);
 	}
 
 
@@ -3486,7 +3486,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
 	 * IOContext will always be IOCONT

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-04-23 Thread Thomas Munro

Rebased over ca89db5f.

I looked into whether we could drop the "old pass manager" code
too[1].  Almost, but nope, even the C++ API lacks a way to set the
inline threshold before LLVM 16, so that would cause a regression.
Although we just hard-code the threshold to 512 with a comment that
sounds like it's pretty arbitrary, a change to the default (225?)
would be unjustifiable just for code cleanup.  Oh well.

[1] 
https://github.com/macdice/postgres/commit/0d40abdf1feb75210c3a3d2a35e3d6146185974c


v2-0001-jit-Require-at-least-LLVM-14-if-enabled.patch
Description: Binary data


v2-0002-jit-Use-opaque-pointers-in-all-supported-LLVM-ver.patch
Description: Binary data

Re: fix tablespace handling in pg_combinebackup

2024-04-22 Thread Thomas Munro

On Tue, Apr 23, 2024 at 8:05 AM Robert Haas  wrote:
> I reworked the test cases so that they don't (I think) rely on
> symlinks working as they do on normal platforms.

Cool.

(It will remain a mystery for now why perl readlink() can't read the
junction points that PostgreSQL creates (IIUC), but the OS can follow
them and PostgreSQL itself can read them with apparently similar code.
I find myself wondering if symlinks should go on the list of "things
we pretended Windows had out of convenience, that turned out to be
more inconvenient than we expected, and we'd do better to tackle
head-on with a more portable idea".  Perhaps we could just use a
tablespace map file instead to do our own path construction, or
something like that.  I suspect that would be one of those changes
that is technically easy, but community-wise hard as it affects a
load of backup tools and procedures...)

Re: fix tablespace handling in pg_combinebackup

2024-04-21 Thread Thomas Munro

On Mon, Apr 22, 2024 at 12:00 AM Alexander Lakhin  wrote:
>  From what I can see, the following condition (namely, -l):
>  if ($path =~ /^pg_tblspc\/(\d+)$/ && -l "$backup_path/$path")
>  {
>  push @tsoids, $1;
>  return 0;
>  }
>
> is false for junction points on Windows (cf [1]), but the target path is:

Ah, yes, right, -l doesn't like junction points.  Well, we're already
using the Win32API::File package (it looks like a CPAN library, but I
guess the Windows perl distros like Strawberry are all including it
for free?).  See PostgreSQL::Test::Utils::is_symlink(), attached.
That seems to work as expected, but someone who actually knows perl
can surely make it better.  Then I hit the next problem:

readlink 
C:\cirrus\build/testrun/pg_combinebackup/002_compare_backups\data/t_002_compare_backups_primary_data/backup/backup1/pg_tblspc/16415:
Inappropriate I/O control operation at
C:/cirrus/src/test/perl/PostgreSQL/Test/Cluster.pm line 927.

https://cirrus-ci.com/task/5162332353986560

I don't know where exactly that error message is coming from, but
assuming that Strawberry Perl contains this code:

https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L2041
https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L1976

... then it's *very* similar to what we're doing in our own
pgreadlink() code.  I wondered if the buffer size might be too small
for our path, but it doesn't seem so:

https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L1581C1-L1581C35

(I think MAX_PATH is 256 on Windows.)

If there is some unfixable problem with what they're doing in their
readlink(), then I guess it should be possible to read the junction
point directly in Perl using Win32API::File::DeviceIoControl()... but
I can't see what's wrong with it!  Maybe it's not the same code?

Attached are the new test support functions, and the fixup to Robert's
6bf5c42b that uses them.  To be clear, this doesn't work, yet.  It has
got to be close though...

0001-More-Windows-pseudo-symlink-support-for-Perl-code.patch
Description: Binary data

0002-fixup.patch
Description: Binary data

Re: fix tablespace handling in pg_combinebackup

2024-04-19 Thread Thomas Munro

I don't know how to fix 82023d47^ on Windows[1][2], but in case it's
useful, here's a small clue.  I see that Perl's readlink() does in
fact know how to read "junction points" on Windows[3], so if that was
the only problem it might have been very close to working.  One
difference is that our own src/port/dirmod.c's pgreadlink() also
strips \??\ from the returned paths (something to do with the
various forms of NT path), but when I tried that:

my $olddir = readlink("$backup_path/pg_tblspc/$tsoid")
|| die "readlink
$backup_path/pg_tblspc/$tsoid: $!";

+   # strip NT path prefix (see src/port/dirmod.c
pgreadlink())
+   $olddir =~ s/^\\\?\?\\// if
$PostgreSQL::Test::Utils::windows_os;

... it still broke[4].  So I'm not sure what's going on...

[1] https://github.com/postgres/postgres/runs/24040897199
[2] 
https://api.cirrus-ci.com/v1/artifact/task/5550091866472448/testrun/build/testrun/pg_combinebackup/002_compare_backups/log/002_compare_backups_pitr1.log
[3] 
https://github.com/Perl/perl5/blob/f936cd91ee430786a1bb6068a4a7c8362610dd5f/win32/win32.c#L2041
[4] https://cirrus-ci.com/task/6746621638082560

Re: cfbot is failing all tests on FreeBSD/Meson builds

2024-04-18 Thread Thomas Munro

On Fri, Apr 19, 2024 at 10:36 AM Thomas Munro  wrote:
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276535
>
> So perhaps it's time for me to undo what I did before...  looking now.

It turned out that I still needed the previous work-around, but I was
too clever for my own boots last time.  For the record, here is the
new change to the image building script:

https://github.com/anarazel/pg-vm-images/commit/faff91cd40d6af0cbc658f5c11da47e2aa88d332

I should have listened to Bilal's prediction[1] about this the first
time.  But this time, I know that the real fix is coming in the next
package very soon, per bugzilla link above.

One thing I noticed is that 010_tab_completion is actually being
skipped, with that fix.  It used to run, I'm sure.  Not sure why, but
I'll look more seriously when the 1.21 or whatever shows up.  At least
we should soon have green CI again in the meantime.

[1] https://github.com/anarazel/pg-vm-images/pull/92

Re: Allow tests to pass in OpenSSL FIPS mode

2024-04-18 Thread Thomas Munro

On Fri, Apr 19, 2024 at 4:00 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Probably not this thread's fault, but following the breadcrumbs to the
> > last thread to touch the relevant test lines in
> > authentication/001_password, is it expected that we have these
> > warnings?
>
> > psql::1: WARNING:  roles created by regression test cases
> > should have names starting with "regress_"
>
> I think the policy is that we enforce that for cases reachable
> via "make installcheck" (to avoid possibly clobbering global
> objects in a live installation), but not for cases only reachable
> via "make check", such as TAP tests.  So I'm not that concerned
> about this, although if someone is feeling anal enough to rename
> the test role I won't stand in the way.

Got it, thanks.  Not me, just asking.

Re: Allow tests to pass in OpenSSL FIPS mode

2024-04-18 Thread Thomas Munro

On Sat, Nov 18, 2023 at 7:46 AM Peter Eisentraut  wrote:
> All done, thanks.

Probably not this thread's fault, but following the breadcrumbs to the
last thread to touch the relevant test lines in
authentication/001_password, is it expected that we have these
warnings?

psql::1: WARNING:  roles created by regression test cases
should have names starting with "regress_"

Re: AIX support

2024-04-18 Thread Thomas Munro

On Fri, Apr 19, 2024 at 6:01 AM Andres Freund  wrote:
> On 2024-04-18 11:15:43 +, Sriram RK wrote:
> > We (IBM-AIX team) looked into this issue
> >
> > https://www.postgresql.org/message-id/20240225194322...@rfd.leadboat.com
> >
> > This is related to the compiler issue. The compilers xlc(13.1) and gcc(8.0)
> > have issues. But we verified that this issue is resolved with the newer
> > compiler versions openXL(xlc17.1) and gcc(12.0) onwards.
>
> The reason we used these compilers was that those were the only ones we had
> kinda somewhat reasonable access to, via the gcc projects'
> "compile farm" https://portal.cfarm.net/
> We have to rely on whatever the aix machines there provide. They're not
> particularly plentiful resource-wise either.

To be fair, those OSUOSL machines are donated by IBM:

https://osuosl.org/services/powerdev/

It's just that they seem to be mostly focused on supporting Linux on
POWER, with only a couple of AIX hosts (partitions/virtual machines?)
made available via portal.cfarm.net, and they only very recently added
a modern AIX 7.3 host. That's cfarm119, upgraded in September-ish,
long after many threads on this list that between-the-lines threatened
to drop support.

> This is generally one of the big issues with AIX support. There are other
> niche-y OSs that don't have a lot of users, e.g. openbsd, but in contrast to
> AIX I can just start an openbsd VM within a few minutes and iron out whatever
> portability issue I'm dealing with.

Yeah.  It is a known secret that you can run AIX inside Qemu/kvm (it
appears that IBM has made changes to it to make that possible, because
earlier AIX versions didn't like Qemu's POWER emulation or
virtualisation, there are blog posts about it), but IBM doesn't
actually make the images available to non-POWER-hardware owners (you
need a serial number).  If I were an OS vendor and wanted developers
to target my OS for free, at the very top of my TODO list I would
have: provide an easy to use image for developers to be able to spin
something up in minutes and possibly even use in CI systems.  That's
the reason I can fix any minor portability issue on Linux, illumos,
*BSD quickly and Windows with only moderate extra pain.  Even Oracle
knows this, see Solaris CBE.

> > We want to make a note that postgres is used extensively in our IBM product
> > and is being exploited by multiple customers.
>
> To be blunt: Then it'd have been nice to see some investment in that before
> now. Both on the code level and the infrastructure level (i.e. access to
> machines etc).

In the AIX space generally, there were even clues that funding had
been completely cut even for packaging PostgreSQL.  I was aware of two
packaging projects (not sure how they were related):

1.  The ATOS packaging group, who used to show up on our mailing lists
and discuss code changes, which announced it was shutting down:

https://github.com/power-devops/bullfreeware

2.  And last time I looked a few months back, the IBM AIX Toolbox
packaging project only had PostgreSQL 10 or 11 packages, already out
of support by us, meaning that their maintainer had given up, too:

https://www.ibm.com/support/pages/aix-toolbox-open-source-software-downloads-alpha

However I see that recently (last month?) someone has added PostgreSQL
15, so something has only just reawoken there?

There are quite a lot of threads about AIX problems, but they are
almost all just us non-AIX-users trying to unbreak stupid stuff on the
build farm, which at some points began to seem distinctly quixotic:
chivalric hackers valiantly trying to keep the entire Unix family tree
working even though we don't remember why and th versions involved are
out of support even by the vendor.  Of the three old giant commercial
Unixes, HP-UX was dropped without another mention (it really was a
windmill after all), Solaris is somehow easier to deal with (I could
guess it's because it influenced Linux and BSD so much, ELF and linker
details spring to mind), while AIX fails on every dimension:
unrepresented by users, lacking in build farm, unavailable to
non-customers, and unusual among Unixen.

Re: cfbot is failing all tests on FreeBSD/Meson builds

2024-04-18 Thread Thomas Munro

On Thu, Feb 8, 2024 at 3:53 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Tue, Jan 30, 2024 at 5:06 PM Tom Lane  wrote:
> >> Thomas Munro  writes:
> >>> Ahh, there is one: https://github.com/cpan-authors/IO-Tty/issues/38
>
> >> Just for the archives' sake: I hit this today on a fresh install
> >> of FreeBSD 14.0, which has pulled in p5-IO-Tty-1.18.  Annoying...
>
> > FWIW here's what I did to downgrade:
>
> Thanks for the recipe --- this worked for me, although I noticed
> it insisted on installing perl5.34-5.34.3_3 alongside 5.36.
> Doesn't seem to be a problem though --- the main perl installation
> is still 5.36.

Looks like CI is broken in this way again, as of ~13 hours ago.
Looking into that...

> Also, I'm not entirely convinced that the above-cited issue report is
> complaining about the same thing that's biting us.  The reported error
> messages are completely different.

You could be right about that.  It seems there was a clash between an
upstream commit and a patch in FBSD's port tree, which has just been
removed:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276535

So perhaps it's time for me to undo what I did before...  looking now.

Re: Solaris tar issues, or other reason why margay fails 010_pg_basebackup?

2024-04-18 Thread Thomas Munro

On Fri, Apr 19, 2024 at 12:57 AM Marcel Hofstetter
 wrote:
> SKIP_READLINE_TESTS works. margay is now green again.

Great!  FTR there was a third thing revealed by margay since you
enabled the TAP tests: commit e2a23576.

I would guess that the best chance of getting the readline stuff to
actually work would be to interest someone who hacks on
IPC::Run-and-related-stuff (*cough* Noah *cough*) and who has Solaris
access to look at that... I would guess it needs a one-line fix
relating to raw/cooked behaviour, but as the proverbial mechanic said,
most of the fee is for knowing where to hit it...

Re: Cannot find a working 64-bit integer type on Illumos

2024-04-18 Thread Thomas Munro

On Thu, Apr 18, 2024 at 6:09 PM Japin Li  wrote:
> /home/japin/postgres/build/../src/common/config_info.c:198:11: error: 
> comparison of integer expressions of different signedness: 'int' and 'size_t' 
> {aka 'long unsigned int'} [-Werror=sign-compare]
>   198 |  Assert(i == *configdata_len);

Right, PostgreSQL doesn't compile cleanly with the "sign-compare"
warning.  There have been a few threads about that, and someone might
want to think harder about it, but it's a different topic unrelated to
.

> /home/japin/postgres/build/../src/include/lib/simplehash.h:1138:9: error: 
> format '%llu' expects argument of type 'long long unsigned int', but argument 
> 4 has type 'uint64' {aka 'long unsigned int'} [-Werror=format=]

It seems my v1 patch's configure probe for INT64_FORMAT was broken.
In the v2 patch I tried not doing that probe at all, and instead
inviting  into our world (that's the standardised way to
produce format strings, which has the slight complication that we are
intercepting printf calls...).  I suspect that'll work better for you.

Re: Cannot find a working 64-bit integer type on Illumos

2024-04-18 Thread Thomas Munro

On Thu, Apr 18, 2024 at 8:47 PM Peter Eisentraut  wrote:
> Maybe this means something like our int64 is long long int but the
> system's int64_t is long int underneath, but I don't see how that would
> matter for the limit macros.

Agreed, so I don't think it's long vs long long (when they have the same width).

I wonder if this comment is a clue:

static char *
inet_net_ntop_ipv6(const u_char *src, int bits, char *dst, size_t size)
{
/*
 * Note that int32_t and int16_t need only be "at least" large enough to
 * contain a value of the specified size.  On some systems, like Crays,
 * there is no such thing as an integer variable with 16 bits. Keep this
 * in mind if you think this function should have been coded to use
 * pointer overlays.  All the world's not a VAX.
 */

I'd seen that claim before somewhere else but I can't recall where.
So there were systems using those names in an ad hoc unspecified way
before C99 nailed this stuff down?  In modern C, int32_t is definitely
an exact width type (but there are other standardised variants like
int_fast32_t to allow for Cray-like systems that would prefer to use a
wider type, ie "at least", 32 bits wide, so I guess that's what
happened to that idea?).

Or perhaps it's referring to worries about the width of char, short,
int or the assumption of two's-complement.  I think if any of that
stuff weren't as assumed we'd have many problems in many places, so
I'm not seeing a problem.  (FTR C23 finally nailed down
two's-complement as a requirement, and although C might not say so,
POSIX says that char is a byte, and our assumption that int = int32_t
is pretty deeply baked into PostgreSQL, so it's almost impossible to
imagine that short has a size other than 16 bits; but these are all
assumptions made by the OLD coding, not by the patch I posted).  In
short, I guess that isn't what was meant.

Re: Cannot find a working 64-bit integer type on Illumos

2024-04-18 Thread Thomas Munro

On Thu, Apr 18, 2024 at 8:47 PM Peter Eisentraut  wrote:
> I'm not sure I understand the problem here.  Do you mean that in theory
> a platform's PRId64 could be something other than "l" or "ll"?

Yes.  I don't know why anyone would do that, and the systems I checked
all have the obvious definitions, eg "ld", "lld" etc.  Perhaps it's an
acceptable risk?  It certainly gives us a tidier result.

For discussion, here is a variant that fully embraces  and
the PRI*64 macros.

v2-0001-Use-int64_t-support-from-stdint.h-and-inttypes.h.patch
Description: Binary data

v2-0002-Remove-traces-of-BeOS.patch
Description: Binary data

Re: Cannot find a working 64-bit integer type on Illumos

2024-04-17 Thread Thomas Munro

On Sat, Mar 23, 2024 at 3:23 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > . o O ( int64_t, PRIdi64, etc were standardised a quarter of a century ago )
>
> Yeah.  Now that we require C99 it's probably reasonable to assume
> that those things exist.  I wouldn't be in favor of ripping out our
> existing notations like UINT64CONST, because the code churn would be
> substantial and the gain minimal.  But we could imagine reimplementing
> that stuff atop  and then getting rid of the configure-time
> probes.

I played around with this a bit, but am not quite there yet.

printf() is a little tricky.  The standard wants us to use
's PRId64 etc, but that might confuse our snprintf.c (in
theory, probably not in practice).  "ll" should have the right size on
all systems, but gets warnings from the printf format string checker
on systems where "l" is the right type.  So I think we still need to
probe for INT64_MODIFIER at configure-time.  Here's one way, but I can
see it's not working on Clang/Linux... perhaps instead of that dubious
incantation I should try compiling some actual printfs and check for
warnings/errors.

I think INT64CONST should just point to standard INT64_C().

For limits, why do we have this:

- * stdint.h limits aren't guaranteed to have compatible types with our fixed
- * width types. So just define our own.

?  I mean, how could they not have compatible types?

I noticed that configure.ac checks if int64 (no "_t") might be defined
already by system header pollution, but meson.build doesn't.  That's
an inconsistency that should be fixed, but which way?  Hmm, commit
15abc7788e6 said that was done for BeOS, which we de-supported.  So
maybe we should get rid of that?

0001-Use-int64_t-and-friends-from-stdint.h.patch
Description: Binary data

0002-Remove-traces-of-BeOS.patch
Description: Binary data

Re: Solaris tar issues, or other reason why margay fails 010_pg_basebackup?

2024-04-17 Thread Thomas Munro

On Thu, Apr 18, 2024 at 1:40 AM Marcel Hofstetter
 wrote:
> Using gnu tar helps to make pg_basebackup work.

Thanks!  I guess that'll remain a mystery.

> It fails now at a later step.

Oh, this rings a bell:

[14:54:58] t/010_tab_completion.pl ..
Dubious, test returned 29 (wstat 7424, 0x1d00)

We had another thread[1] where we figured out that Solaris's termios
defaults include TABDLY=TAB3, meaning "expand tabs to spaces on
output", and that was upsetting our tab-completion test.  Other Unixes
used to vary on this point too, but they all converged on not doing
that, except Solaris, apparently.  Perhaps IPC::Run could fix that by
calling ->set_raw() on the pseudo-terminal, but I'm not very sure
about that.

This test suite is passing on pollock because it doesn't have IO::Pty
installed.  Could you try uninstalling that perl package for now, so
we can see what breaks next?

[06:34:40] t/010_tab_completion.pl .. skipped: IO::Pty is needed to
run this test

[1] 
https://www.postgresql.org/message-id/flat/MEYP282MB1669E2E11495A2DEAECE8736B6A7A%40MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM

Re: Solaris tar issues, or other reason why margay fails 010_pg_basebackup?

2024-04-17 Thread Thomas Munro

On Wed, Apr 17, 2024 at 7:17 PM Marcel Hofstetter
 wrote:
> Is there a way to configure which tar to use?
>
> gnu tar would be available.
>
> -bash-5.1$ ls -l /usr/gnu/bin/tar
> -r-xr-xr-x   1 root bin  1226248 Jul  1  2022 /usr/gnu/bin/tar

Cool.  I guess you could fix the test either by setting
TAR=/usr/gnu/bin/tar or PATH=/usr/gnu/bin:$PATH.

If we want to understand *why* it doesn't work, someone would need to
dig into that.  It's possible that PostgreSQL is using some GNU
extension (if so, apparently the BSDs' tar is OK with it too, and I
guess AIX's and HP-UX's was too in the recent times before we dropped
those OSes).  I vaguely recall (maybe 20 years ago, time flies) that
Solaris tar wasn't able to extract some tarballs but I can't remember
why...  I'm also happy to leave it at "Sun's tar doesn't work for us,
we don't know why" if you are.

Solaris tar issues, or other reason why margay fails 010_pg_basebackup?

2024-04-16 Thread Thomas Munro

Hi,

I noticed that margay (Solaris) has started running more of the tests
lately, but is failing in pg_basebaseup/010_pg_basebackup.  It runs
successfully on wrasse (in older branches, Solaris 11.3 is desupported
in 17/master), and also on pollock (illumos, forked from common
ancestor Solaris 10 while it was open source).

Hmm, wrasse is using "/opt/csw/bin/gtar xf ..." and pollock is using
"/usr/gnu/bin/tar xf ...", while margay is using "/usr/bin/tar xf
...".  The tar command is indicating success (it's run by
system_or_bail and it's not bailing), but the replica doesn't want to
come up:

pg_ctl: directory
"/home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_replica_data/pgdata"
is not a database cluster directory"

So one idea would be that our tar format is incompatible with Sun tar
in some way that corrupts the output, or there is some still
difference in the nesting of the directory structure it creates, or
something like that.  I wonder if this is already common knowledge in
the repressed memories of this list, but I couldn't find anything
specific.  I'd be curious to know why exactly, if so (in terms of
POSIX conformance etc, who is doing something wrong).

s/shm_mq_iovec/struct iovec/

2024-04-14 Thread Thomas Munro

Hi,

I was grepping for iovec users and noticed that the shm_mq stuff
defines its own iovec struct.  Is there any reason not to use the
standard one, now that we can?  Will add to next commitfest.
From 20b44cab0bb9f6218270aa0ae150ac0e560b49fe Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 15 Apr 2024 10:11:10 +1200
Subject: [PATCH] Use standard iovec in shm_mq.h interface.

Commit 2bd9e412 (2014) introduced "shm_mq_iovec", perhaps because
POSIX struct iovec was missing on Windows.  Commit 13a021f3 (2021)
introduced a standard-conforming replacement for Windows, for use in
various other parts of the tree.  Now that we can, we might as well use
the standard struct in the shared memory queue API too.
---
 src/backend/libpq/pqmq.c | 10 +-
 src/backend/storage/ipc/shm_mq.c | 29 -
 src/include/storage/shm_mq.h | 10 ++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 00a44ca803f..d737663e230 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -117,7 +117,7 @@ mq_is_send_pending(void)
 static int
 mq_putmessage(char msgtype, const char *s, size_t len)
 {
-	shm_mq_iovec iov[2];
+	struct iovec iov[2];
 	shm_mq_result result;
 
 	/*
@@ -147,10 +147,10 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 
 	pq_mq_busy = true;
 
-	iov[0].data = 
-	iov[0].len = 1;
-	iov[1].data = s;
-	iov[1].len = len;
+	iov[0].iov_base = 
+	iov[0].iov_len = 1;
+	iov[1].iov_base = unconstify(char *, s);
+	iov[1].iov_len = len;
 
 	Assert(pq_mq_handle != NULL);
 
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 9235fcd08ec..0d73bbf1879 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -21,6 +21,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
+#include "port/pg_iovec.h"
 #include "postmaster/bgworker.h"
 #include "storage/shm_mq.h"
 #include "storage/spin.h"
@@ -329,10 +330,10 @@ shm_mq_result
 shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
 			bool force_flush)
 {
-	shm_mq_iovec iov;
+	struct iovec iov;
 
-	iov.data = data;
-	iov.len = nbytes;
+	iov.iov_base = unconstify(void *, data);
+	iov.iov_len = nbytes;
 
 	return shm_mq_sendv(mqh, , 1, nowait, force_flush);
 }
@@ -358,7 +359,7 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
  * ring size.
  */
 shm_mq_result
-shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
+shm_mq_sendv(shm_mq_handle *mqh, struct iovec *iov, int iovcnt, bool nowait,
 			 bool force_flush)
 {
 	shm_mq_result res;
@@ -374,7 +375,7 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
 
 	/* Compute total size of write. */
 	for (i = 0; i < iovcnt; ++i)
-		nbytes += iov[i].len;
+		nbytes += iov[i].iov_len;
 
 	/* Prevent writing messages overwhelming the receiver. */
 	if (nbytes > MaxAllocSize)
@@ -423,9 +424,9 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
 		Size		chunksize;
 
 		/* Figure out which bytes need to be sent next. */
-		if (offset >= iov[which_iov].len)
+		if (offset >= iov[which_iov].iov_len)
 		{
-			offset -= iov[which_iov].len;
+			offset -= iov[which_iov].iov_len;
 			++which_iov;
 			if (which_iov >= iovcnt)
 break;
@@ -442,16 +443,17 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
 		 * MAXALIGN'd.
 		 */
 		if (which_iov + 1 < iovcnt &&
-			offset + MAXIMUM_ALIGNOF > iov[which_iov].len)
+			offset + MAXIMUM_ALIGNOF > iov[which_iov].iov_len)
 		{
 			char		tmpbuf[MAXIMUM_ALIGNOF];
 			int			j = 0;
 
 			for (;;)
 			{
-if (offset < iov[which_iov].len)
+if (offset < iov[which_iov].iov_len)
 {
-	tmpbuf[j] = iov[which_iov].data[offset];
+	char *data = iov[which_iov].iov_base;
+	tmpbuf[j] = data[offset];
 	j++;
 	offset++;
 	if (j == MAXIMUM_ALIGNOF)
@@ -459,7 +461,7 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
 }
 else
 {
-	offset -= iov[which_iov].len;
+	offset -= iov[which_iov].iov_len;
 	which_iov++;
 	if (which_iov >= iovcnt)
 		break;
@@ -487,10 +489,11 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
 		 * isn't a multiple of MAXIMUM_ALIGNOF.  Otherwise, we need to
 		 * MAXALIGN_DOWN the write size.
 		 */
-		chunksize = iov[which_iov].len - offset;
+		chunksize = iov[which_iov].iov_len - offset;
 		if (which_iov + 1 < iovcnt)
 			chunksize = MAXALIGN_DOWN(chunksize);
-		res = shm_mq_send_bytes(mqh, chunksize, [which_iov].data[offset],
+		res = shm_mq_send_bytes(mqh, chunksize,
+(char *) iov[which_iov].iov_base + offset,
 nowait

Re: Cleaning up threading code

2024-04-13 Thread Thomas Munro

On Mon, Jul 3, 2023 at 8:43 PM Heikki Linnakangas  wrote:
> On 10/06/2023 05:23, Thomas Munro wrote:
> > 2.  I don't like the way we have to deal with POSIX vs Windows at
> > every site where we use threads, and each place has a different style
> > of wrappers.  I considered a few different approaches to cleaning this
> > up:
> >
> > * provide centralised and thorough pthread emulation for Windows; I
> > don't like this, I don't even like all of pthreads and there are many
> > details to get lost in
> > * adopt C11 ; unfortunately it is too early, so you'd need
> > to write/borrow replacements for at least 3 of our 11 target systems
> > * invent our own mini-abstraction for a carefully controlled subset of stuff
>
> Google search on "c11 threads on Windows" found some emulation wrappers:
> https://github.com/jtsiomb/c11threads and
> https://github.com/tinycthread/tinycthread, for example. Would either of
> those work for us?
>
> Even if we use an existing emulation wrapper, I wouldn't mind having our
> own pg_* abstration on top of it, to document which subset of the POSIX
> or C11 functions we actually use.

Yeah.  I am still interested in getting our thread API tidied up, and
I intend to do that for PG 18.  The patch I posted on that front,
which can be summarised as a very lightweight subset of standard
 except with pg_ prefixes everywhere mapping to Windows or
POSIX threads, still needs one tiny bit more work: figuring out how to
get the TLS destructors to run on Windows FLS or similar, or
implementing destructors myself (= little destructor callback list
that a thread exit hook would run, work I was hoping to avoid by using
something from the OS libs... I will try again soon).  Andres also
opined upthread that we should think about offering a thread_local
storage class and getting away from TLS with keys.

One thing to note is that the ECPG code is using TLS with destructors
(in fact they are b0rked in master, git grep "FIXME: destructor" so
ECPG leaks memory on Windows, so the thing that I'm trying to fix in
pg_threads.h is actually fixing a long-standing bug),  and although
thread_local has destructors in C++ it doesn't in C, so if we decided
to add the storage class but not bother with the tss_create feature,
then ECPG would need to do cleanup another way.  I will look into that
option.

One new development since last time I wrote the above stuff is that
the Microsoft toolchain finally implemented the library components of
C11 :

https://devblogs.microsoft.com/cppblog/c11-threads-in-visual-studio-2022-version-17-8-preview-2/

It seems like it would be a long time before we could contemplate
using that stuff though, especially given our duty of keeping libpq
and ECPG requirements low and reasonable.  However, it seems fairly
straightforward to say that we require C99 + some way to access a
C11-like thread local storage class.  In practice that means a
pg_thread_local macro that points to MSVC __declspec(thread) (which
works since MSVC 2014?) or GCC/Clang __thread_local (which works since
GCC4.8 in 2013?) or whatever.  Of course every serious toolchain
supports this stuff somehow or other, having been required to for C11.

I can't immediately see any build farm animals that would be affected.
Grison's compiler info is out of date, it's really running
8.something.  The old OpenBSD GCC 4.2 animal was upgraded, and antique
AIX got the boot: that's not even a coincidence, those retirements
came about because those systems didn't support arbitrary alignment,
another C11 feature that we now effectively require.  (We could have
worked around it it we had to but on but they just weren't reasonable
targets.)

So I'll go ahead and add the storage class to the next version, and
contemplate a couple of different options for the tss stuff, including
perhaps leaving it out if that seems doable.

https://stackoverflow.com/questions/29399494/what-is-the-current-state-of-support-for-thread-local-across-platforms

Re: In MacOS, psql reacts on SIGINT in a strange fashion (Linux is fine)

2024-04-13 Thread Thomas Munro

On Sun, Apr 14, 2024 at 11:49 AM Tom Lane  wrote:
> Dmitry Koterov  writes:
> > I wish it was zsh... I tested it with zsh, but with bash (and with
> > low-level kill syscall), I observed the same effect unfortunately.
>
> > So it's still a puzzle.
>
> > 1. Even more, when I send a kill() low-level syscall using e.g. Perl - perl
> > -e 'kill("INT", 16107)' - it is the same.
> > 2. If any other program but psql is used (e.g. vim or my custom Perl script
> > which ignores SIGINT), the effect is not reproducible.
>
> > Can it be e.g. readline? Or something related to tty or session settings
> > which psql could modify (I did not find any in the source code though).
>
> OK, I tried dtruss'ing psql on macOS.  What I see is that with
> Apple's libedit, the response to SIGINT includes this:
>
> kill(0, 2)   = 0 0
>
> that is, "SIGINT my whole process group".  If I build with libreadline
> from MacPorts, I just see
>
> kill(30902, 2)   = 0 0
>
> (30902 being the process's own PID).  I'm not real sure why either
> library finds it necessary to re-signal the process --- maybe they
> trap SIGINT and then try to hide that by reinstalling the app's
> normal SIGINT handler and re-delivering the signal.  But anyway,
> libedit seems to be vastly exceeding its authority here.  If
> signaling the whole process group were wanted, it would have been
> the responsibility of the original signaller to do that.

Probably to intercept SIGWINCH -- I assume it's trying to propagate
the signal to the "real" signal handler, but it's propagating it a
little too far.

https://github.com/NetBSD/src/blob/1de18f216411bce77e26740327b0782976a89965/lib/libedit/sig.c#L110

Re: In MacOS, psql reacts on SIGINT in a strange fashion (Linux is fine)

2024-04-13 Thread Thomas Munro

On Sun, Apr 14, 2024 at 11:18 AM Dmitry Koterov
 wrote:
> Can it be e.g. readline? Or something related to tty or session settings 
> which psql could modify (I did not find any in the source code though).

I was wondering about that.  Are you using libedit or libreadline?
What happens if you build without readline/edit support?  From a quick
glance at libedit, it does a bunch of signal interception, but I
didn't check the details.  It is interested in stuff like SIGWINCH,
the window-resized-by-user signal.

Re: DROP DATABASE is interruptible

2024-04-10 Thread Thomas Munro

On Tue, Mar 12, 2024 at 9:00 PM Alexander Lakhin  wrote:
> I see two backends waiting:
> law  2420132 2420108  0 09:05 ?00:00:00 postgres: node: law 
> postgres [local] DROP DATABASE waiting
> law  2420135 2420108  0 09:05 ?00:00:00 postgres: node: law 
> postgres [local] startup waiting

Ugh.

Re: Requiring LLVM 14+ in PostgreSQL 18

2024-04-10 Thread Thomas Munro

On Wed, Apr 10, 2024 at 1:38 PM Thomas Munro  wrote:
> Therefore, some time after the tree re-opens for hacking, we could rip
> out a bunch of support code for LLVM 10-13, and then rip out support
> for pre-opaque-pointer mode.  Please see attached.

... or of course closer to the end of the cycle if that's what people
prefer for some reason, I don't mind too much as long as it happens.

I added this to the commitfest app, and it promptly failed for cfbot.
That's expected: CI is still using Debian 11 "bullseye", which only
has LLVM 11.  It became what Debian calls "oldstable" last year, and
reaches the end of oldstable in a couple of months from now.  Debian
12 "bookworm" is the current stable release, and it has LLVM 14, so we
should probably go and update those CI images...

Re: Potential stack overflow in incremental base backup

2024-04-10 Thread Thomas Munro

On Thu, Apr 11, 2024 at 12:11 AM Robert Haas  wrote:
> On Wed, Apr 10, 2024 at 6:21 AM Thomas Munro  wrote:
> > Could we just write the blocks directly into the output array, and
> > then transpose them directly in place if start_blkno > 0?  See
> > attached.  I may be missing something, but the only downside I can
> > think of is that the output array is still clobbered even if we decide
> > to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
> > requires a warning in the comment at the top.
>
> Yeah. This approach makes the name "relative_block_numbers" a bit
> confusing, but not running out of memory is nice, too.

Pushed.  That fixes the stack problem.

Of course we still have this:

/*
 * Since this array is relatively large, avoid putting it on the stack.
 * But we don't need it at all if this is not an incremental backup.
 */
if (ib != NULL)
relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);

To rescue my initdb --rel-segsize project[1] for v18, I will have a go
at making that dynamic.  It looks like we don't actually need to
allocate that until we get to the GetFileBackupMethod() call, and at
that point we have the file size.  If I understand correctly,
statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
block number buffer there if required, right?  That way, you would
only need shedloads of virtual memory if you used initdb
--rel-segsize=shedloads and you actually have shedloads of data in a
table/segment.  For example, with initdb --rel-segsize=1TB and a table
that contains 1TB+ of data, that'd allocate 512MB.  It sounds
borderline OK when put that way.  It sounds not OK with
--rel-segsize=infinite and 32TB of data -> palloc(16GB).  I suppose
one weasel-out would be to say we only support segments up to (say)
1TB, until eventually we figure out how to break this code's
dependency on segments.  I guess we'll have to do that one day to
support incremental backups of other smgr implementations that don't
even have segments (segments are a private detail of md.c after all,
not part of the smgr abstraction).

[1] https://commitfest.postgresql.org/48/4305/

Re: Potential stack overflow in incremental base backup

2024-04-10 Thread Thomas Munro

On Fri, Mar 8, 2024 at 6:53 AM Robert Haas  wrote:
> ... We could
> avoid transposing relative block numbers to absolute block numbers
> whenever start_blkno is 0,  ...

Could we just write the blocks directly into the output array, and
then transpose them directly in place if start_blkno > 0?  See
attached.  I may be missing something, but the only downside I can
think of is that the output array is still clobbered even if we decide
to return BACK_UP_FILE_FULLY because of the 90% rule, but that just
requires a warning in the comment at the top.

v2-0001-Fix-potential-stack-overflow-in-incremental-baseb.patch
Description: Binary data

Re: Speed up clean meson builds by ~25%

2024-04-10 Thread Thomas Munro

On Wed, Apr 10, 2024 at 5:03 PM Tom Lane  wrote:
> I don't doubt that there are other clang versions where the problem
> bites a lot harder.  What result do you get from the test I tried
> (turning mm_strdup into a no-op macro)?

#define mm_strdup(x) (x) does this:

Apple clang 15: master: 14s -> 9s
MacPorts clang 16, master: 170s -> 10s

Re: Speed up clean meson builds by ~25%

2024-04-09 Thread Thomas Munro

On Wed, Apr 10, 2024 at 11:44 AM Tom Lane  wrote:
> ... On my Mac laptop
> (Apple clang version 15.0.0), the compile time for preproc.o went from
> 6.7sec to 5.5sec.

Having seen multi-minute compile times on FreeBSD (where clang is the
system compiler) and Debian (where I get packages from apt.llvm.org),
I have been quietly waiting for this issue to hit Mac users too (where
a clang with unknown proprietary changes is the system compiler), but
it never did.  Huh.

I tried to understand a bit more about Apple's version soup.  This
seems to be an up-to-date table (though I don't understand their
source of information):

https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_(since_visionOS_support)_2

According to cc -v on my up-to-date MacBook Air, it has "Apple clang
version 15.0.0 (clang-1500.3.9.4)", which, if the table is correct,
means that it's using LLVM 16.0.0 (note, not 16.0.6, the final version
of that branch of [open] LLVM, and the version I saw the issue with on
FreeBSD and Debian).  They relabel everything to match the Xcode
version that shipped it, and they're currently off by one.

I wondered if perhaps the table just wasn't accurate in the final
digits, so I looked for clues in strings in the binary, and sure
enough it contains "LLVM 15.0.0".  My guess would be that they've
clobbered the major version, but not the rest: the Xcode version is
15.3, and I don't see a 3, so I guess this is really derived from LLVM
16.0.0.

One explanation would be that they rebase their proprietary bits and
pieces over the .0 version of each major release, and then cherry-pick
urgent fixes and stuff later, not pulling in the whole minor release;
they also presumably have to maintain it for much longer than the LLVM
project's narrow support window.  Who knows.  So now I wonder if it
could be that LLVM 16.0.6 does this, but LLVM 16.0.0 doesn't.

I installed clang-16 (16.0.6) with MacPorts, and it does show the problem.

Requiring LLVM 14+ in PostgreSQL 18

2024-04-09 Thread Thomas Munro

Hi

PostgreSQL 18 will ship after these vacuum horizon systems reach EOL[1]:

animal | arch | llvm_version | os | os_release | end_of_support
---+-+--+++
branta | s390x | 10.0.0 | Ubuntu | 20.04 | 2025-04-01
splitfin | aarch64 | 10.0.0 | Ubuntu | 20.04 | 2025-04-01
urutau | s390x | 10.0.0 | Ubuntu | 20.04 | 2025-04-01
massasauga | aarch64 | 11.1.0 | Amazon | 2 | 2025-06-30
snakefly | aarch64 | 11.1.0 | Amazon | 2 | 2025-06-30

Therefore, some time after the tree re-opens for hacking, we could rip
out a bunch of support code for LLVM 10-13, and then rip out support
for pre-opaque-pointer mode.  Please see attached.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2B-g61yq7Ce4aoZtBDO98b4GXH8Cu3zxVk-Zn1Vh7TKpA%40mail.gmail.com
From f5de5c6535b825033b1829eaf340baacc10ed654 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 19 Oct 2023 04:45:46 +1300
Subject: [PATCH 1/2] jit: Require at least LLVM 14, if enabled.

Remove support for LLVM versions 10-13.  The default on all non-EOL'd
OSes represented in our build farm will be at least LLVM 14 when
PostgreSQL 18 ships.

Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
---
 config/llvm.m4  |   8 +-
 configure   |   8 +-
 doc/src/sgml/installation.sgml  |   4 +-
 meson.build |   2 +-
 src/backend/jit/llvm/llvmjit.c  | 109 
 src/backend/jit/llvm/llvmjit_error.cpp  |  25 --
 src/backend/jit/llvm/llvmjit_inline.cpp |  13 ---
 src/backend/jit/llvm/llvmjit_wrap.cpp   |   4 -
 8 files changed, 11 insertions(+), 162 deletions(-)

diff --git a/config/llvm.m4 b/config/llvm.m4
index 44769d819a0..6ca088b5a45 100644
--- a/config/llvm.m4
+++ b/config/llvm.m4
@@ -13,7 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
   AC_REQUIRE([AC_PROG_AWK])
 
   AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command])
-  PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10)
+  PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-18 llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14)
 
   # no point continuing if llvm wasn't found
   if test -z "$LLVM_CONFIG"; then
@@ -25,14 +25,14 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
 AC_MSG_ERROR([$LLVM_CONFIG does not work])
   fi
   # and whether the version is supported
-  if echo $pgac_llvm_version | $AWK -F '.' '{ if ([$]1 >= 10) exit 1; else exit 0;}';then
-AC_MSG_ERROR([$LLVM_CONFIG version is $pgac_llvm_version but at least 10 is required])
+  if echo $pgac_llvm_version | $AWK -F '.' '{ if ([$]1 >= 14) exit 1; else exit 0;}';then
+AC_MSG_ERROR([$LLVM_CONFIG version is $pgac_llvm_version but at least 14 is required])
   fi
   AC_MSG_NOTICE([using llvm $pgac_llvm_version])
 
   # need clang to create some bitcode files
   AC_ARG_VAR(CLANG, [path to clang compiler to generate bitcode])
-  PGAC_PATH_PROGS(CLANG, clang clang-17 clang-16 clang-15 clang-14 clang-13 clang-12 clang-11 clang-10)
+  PGAC_PATH_PROGS(CLANG, clang clang-18 clang-17 clang-16 clang-15 clang-14)
   if test -z "$CLANG"; then
 AC_MSG_ERROR([clang not found, but required when compiling --with-llvm, specify with CLANG=])
   fi
diff --git a/configure b/configure
index cfbd2a096f4..ba4ce5a5282 100755
--- a/configure
+++ b/configure
@@ -5065,7 +5065,7 @@ if test "$with_llvm" = yes; then :
 
 
   if test -z "$LLVM_CONFIG"; then
-  for ac_prog in llvm-config llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14 llvm-config-13 llvm-config-12 llvm-config-11 llvm-config-10
+  for ac_prog in llvm-config llvm-config-18 llvm-config-17 llvm-config-16 llvm-config-15 llvm-config-14
 do
   # Extract the first word of "$ac_prog", so it can be a program name with args.
 set dummy $ac_prog; ac_word=$2
@@ -5129,8 +5129,8 @@ fi
 as_fn_error $? "$LLVM_CONFIG does not work" "$LINENO" 5
   fi
   # and whether the version is supported
-  if echo $pgac_llvm_version | $AWK -F '.' '{ if ($1 >= 10) exit 1; else exit 0;}';then
-as_fn_error $? "$LLVM_CONFIG version is $pgac_llvm_version but at least 10 is required" "$LINENO" 5
+  if echo $pgac_llvm_version | $AWK -F '.' '{ if ($1 >= 14) exit 1; else exit 0;}';then
+as_fn_error $? "$LLVM_CONFIG version is $pgac_llvm_version but at least 14 is required" "$LINENO" 5
   fi
   { $as_echo "$as_me:${as_lineno-$LINENO}: using llvm $pgac_llvm_version" >&5
 $as_echo "$as_me: using llvm $pgac_llvm_version" >&6;}
@@ -5138,7 +5138,7 @@ $as_echo "$as_me: using llvm $pgac_llvm_version" >&6;}
   # need clang to create some bitcode files
 
   if test -z "$CLANG"; then
-  for ac_prog in clang clang-17 clang

Re: broken JIT support on Fedora 40

2024-04-09 Thread Thomas Munro

On Tue, Apr 9, 2024 at 10:05 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> +   /* In assertion builds, run the LLVM verify pass. */
> +#ifdef USE_ASSERT_CHECKING
> +   LLVMPassBuilderOptionsSetVerifyEach(options, true);
> +#endif

Thanks, that seems nicer.  I think the question is whether it will
slow down build farm/CI/local meson test runs to a degree that exceeds
its value.  Another option would be to have some other opt-in macro,
like the existing #ifdef LLVM_PASS_DEBUG, for people who maintain
JIT-related stuff to turn on.

Supposing we go with USE_ASSERT_CHECKING, I have another question:

-   const char *nm = "llvm.lifetime.end.p0i8";
+   const char *nm = "llvm.lifetime.end.p0";

Was that a mistake, or did the mangling rules change in some version?
I don't currently feel inclined to go and test this on the ancient
versions we claim to support in back-branches.  Perhaps we should just
do this in master, and then it'd be limited to worrying about LLVM
versions 10-18 (see 820b5af7), which have the distinct advantage of
being available in package repositories for testing.  Or I suppose we
could back-patch, but only do it if LLVM_VERSION_MAJOR >= 10.  Or we
could do it unconditionally, and wait for ancient-LLVM build farm
animals to break if they're going to.

I pushed the illegal attribute fix though.  Thanks for the detective work!

(It crossed my mind that perhaps deform functions should have their
own template function, but if someone figures out that that's a good
idea, I think we'll *still* need that change just pushed.)

Re: broken JIT support on Fedora 40

2024-04-09 Thread Thomas Munro

On Sat, Apr 6, 2024 at 5:01 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > > Yep, I think this is it. I've spent some hours trying to understand why
> > > suddenly deform function has noundef ret attribute, when it shouldn't --
> > > this explains it and the proposed change fixes the crash. One thing that
> > > is still not clear to me though is why this copied attribute doesn't
> > > show up in the bitcode dumped right before running inline pass (I've
> > > added this to troubleshoot the issue).
> >
> > One thing to consider in this context is indeed adding "verify" pass as
> > suggested in the PR, at least for the debugging configuration. Without the 
> > fix
> > it immediately returns:
> >
> >   Running analysis: VerifierAnalysis on deform_0_1
> >   Attribute 'noundef' applied to incompatible type!
> >
> >   llvm error: Broken function found, compilation aborted!
>
> Here is what I have in mind. Interestingly enough, it also shows few
> more errors besides "noundef":
>
> Intrinsic name not mangled correctly for type arguments! Should be: 
> llvm.lifetime.end.p0
> ptr @llvm.lifetime.end.p0i8
>
> It refers to the function from create_LifetimeEnd, not sure how
> important is this.

Would it be too slow to run the verify pass always, in assertion
builds?  Here's a patch for the original issue, and a patch to try
that idea + a fix for that other complaint it spits out.  The latter
would only run for LLVM 17+, but that seems OK.
From 57af42d9a1b47b7361c7200a17a210f2ca37bd5d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 9 Apr 2024 18:10:30 +1200
Subject: [PATCH 1/2] Fix illegal attribute LLVM attribute propagation.

Commit 72559438 started copying more attributes from AttributeTemplate
to the functions we generate on the fly.  In the case of deform
functions, which return void, this meant that "noundef", from
AttributeTemplate's return value (a Datum) was copied to a void type.
Older LLVM releases were OK with that, but LLVM 18 crashes.  "noundef"
is rejected for void by the verify pass, which would have alerted us to
this problem (something we'll consider checking automatically in a later
commit).

Update our llvm_copy_attributes() function to skip copying the attribute
for the return value, if the target function returns void.

Thanks to Dmitry Dolgov for help chasing this down.

Back-patch to all supported releases, like 72559438.

Reported-by: Pavel Stehule 
Reviewed-by: Dmitry Dolgov <9erthali...@gmail.com>
Discussion: https://postgr.es/m/CAFj8pRACpVFr7LMdVYENUkScG5FCYMZDDdSGNU-tch%2Bw98OxYg%40mail.gmail.com
---
 src/backend/jit/llvm/llvmjit.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index ec0fdd49324..92b4993a98a 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -552,8 +552,11 @@ llvm_copy_attributes(LLVMValueRef v_from, LLVMValueRef v_to)
 	/* copy function attributes */
 	llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeFunctionIndex);

-	/* and the return value attributes */
-	llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex);
+	if (LLVMGetTypeKind(LLVMGetFunctionReturnType(v_to)) != LLVMVoidTypeKind)
+	{
+		/* and the return value attributes */
+		llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex);
+	}

 	/* and each function parameter's attribute */
 	param_count = LLVMCountParams(v_from);
-- 
2.44.0

From 91d8b747686aa07341337e1cd0b7c2244e955adb Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 9 Apr 2024 18:57:48 +1200
Subject: [PATCH 2/2] Run LLVM verify pass on all generated IR.

We fixed a recent problem that crashed LLVM 18, but Dmitry pointed out
that we'd have known about that all along if we'd automatically run the
verify pass on the IR we generate.

Turn that on in assertion builds.  That reveals one other complaint
about a name, which is also fixed here.

Suggested-by: Dmitry Dolgov <9erthali...@gmail.com>
Discussion: https://postgr.es/m/CAFj8pRACpVFr7LMdVYENUkScG5FCYMZDDdSGNU-tch%2Bw98OxYg%40mail.gmail.com
---
 src/backend/jit/llvm/llvmjit.c  | 11 +--
 src/backend/jit/llvm/llvmjit_expr.c |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 92b4993a98a..bc429e970f2 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -703,10 +703,17 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMErrorRef err;
 	const char *passes;

+	/* In assertion builds, run the LLVM verify pass. */
+#ifdef USE_ASSERT_CHECKING
+#define PASSES_PREFIX "verify,"
+#else
+#define PASSES_PREFIX ""
+#endif
+
 	if (context->b

Re: Speed up clean meson builds by ~25%

2024-04-08 Thread Thomas Munro

On Tue, Apr 9, 2024 at 5:01 PM Michael Paquier  wrote:
> On Mon, Apr 08, 2024 at 12:23:56PM +0300, Nazir Bilal Yavuz wrote:
> > On Mon, 8 Apr 2024 at 11:00, Alexander Lakhin  wrote:
> >> As I wrote in [1], I didn't observe the issue with clang-18, so maybe it
> >> is fixed already.
> >> Perhaps it's worth rechecking...
> >>
> >> [1] 
> >> https://www.postgresql.org/message-id/d2bf3727-bae4-3aee-65f6-caec2c4ebaa8%40gmail.com
> >
> > I had this problem on my local computer. My build times are:
> >
> > gcc: 20s
> > clang-15: 24s
> > clang-16: 105s
> > clang-17: 111s
> > clang-18: 25s
>
> Interesting.  A parallel build of ecpg shows similar numbers here:
> clang-16: 101s
> clang-17: 112s
> clang-18: 14s
> gcc: 10s

I don't expect it to get fixed BTW, because it's present in 16.0.6,
and .6 is the terminal release, if I understand their system
correctly.  They're currently only doing bug fixes for 18, and even
there not for much longer. Interesting that not everyone saw this at
first, perhaps the bug arrived in a minor release that some people
didn't have yet?  Or perhaps there is something special required to
trigger it?

Re: Vectored I/O in bulk_write.c

2024-04-08 Thread Thomas Munro

Here's a rebase.  I decided against committing this for v17 in the
end.  There's not much wrong with it AFAIK, except perhaps an
unprincipled chopping up of writes with large io_combine_limit due to
simplistic flow control, and I liked the idea of having a decent user
of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit
faster, but... I'd like to try something more ambitious that
streamifies this and also the "main" writeback paths.  I shared some
patches for that that are counterparts to this, over at[1].

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGK1in4FiWtisXZ%2BJo-cNSbWjmBcPww3w3DBM%2BwhJTABXA%40mail.gmail.com
From 7ee50aae3d4eba0df5bce05c196f411abb0bd9ab Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 8 Apr 2024 18:19:41 +1200
Subject: [PATCH v7 1/3] Use smgrwritev() for both overwriting and extending.

Since mdwrite() and mdextend() were very similar and both needed
vectored variants, merge them into a single interface.  This reduces
duplication and fills out the set of operations.  We still want to be
able to assert that callers know the difference between overwriting and
extending, and to activate slightly different behavior during recovery,
so add a new "flags" argument to control that.

The goal here is to provide the missing vectored interface without which
the new bulk write facility from commit 8af25652 can't write to the file
system in bulk.  A following commit will connect them together.

Like smgrwrite(), the traditional single-block smgrextend() function
with skipFsync boolean argument is now translated to smgrwritev() by
an inlinable wrapper function.

The smgrzeroextend() interface remains distinct; the idea was floated of
merging that too, but so far without consensus.

Reviewed-by: Heikki Linnakangas 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c |  10 +--
 src/backend/storage/smgr/md.c   | 120 
 src/backend/storage/smgr/smgr.c | 100 ++-
 src/include/storage/md.h|   4 +-
 src/include/storage/smgr.h  |  19 -
 5 files changed, 94 insertions(+), 159 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 44836751b71..5166a839da8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4011,11 +4011,11 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 *
 	 * In recovery, we cache the value returned by the first lseek(SEEK_END)
 	 * and the future writes keeps the cached value up-to-date. See
-	 * smgrextend. It is possible that the value of the first lseek is smaller
-	 * than the actual number of existing blocks in the file due to buggy
-	 * Linux kernels that might not have accounted for the recent write. But
-	 * that should be fine because there must not be any buffers after that
-	 * file size.
+	 * smgrzeroextend and smgrwritev. It is possible that the value of the
+	 * first lseek is smaller than the actual number of existing blocks in the
+	 * file due to buggy Linux kernels that might not have accounted for the
+	 * recent write. But that should be fine because there must not be any
+	 * buffers after that file size.
 	 */
 	for (i = 0; i < nforks; i++)
 	{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bf0f3ca76d1..73d077ca3ea 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -447,79 +447,10 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 	pfree(path);
 }
 
-/*
- * mdextend() -- Add a block to the specified relation.
- *
- * The semantics are nearly the same as mdwrite(): write at the
- * specified position.  However, this is to be used for the case of
- * extending a relation (i.e., blocknum is at or beyond the current
- * EOF).  Note that we assume writing a block beyond current EOF
- * causes intervening file space to become filled with zeroes.
- */
-void
-mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 const void *buffer, bool skipFsync)
-{
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec*v;
-
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
-	/* This assert is too expensive to have on normally ... */
-#ifdef CHECK_WRITE_VS_EXTEND
-	Assert(blocknum >= mdnblocks(reln, forknum));
-#endif
-
-	/*
-	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
-	 * more --- we mustn't create a block whose number actually is
-	 * InvalidBlockNumber.  (Note that this failure should be unreachable
-	 * because of upstream checks in bufmgr.c.)
-	 */
-	if (blocknum == InvalidBlockNumber)
-		ereport(ERROR,
-(e

Streaming relation data out of order

2024-04-08 Thread Thomas Munro

Hi

This idea is due to Robert Haas, who complained that he feared that
the streaming I/O API already worked like this.  It doesn't, but it
could!  Here is a concept patch to try it out.

Normally, read_stream_next_buffer() spits out buffers in the order
that the user's callback generated block numbers.  This option says
that any order would be OK.

I had been assuming that this sort of thing would come with real
asynchronous I/O: if I/Os complete out of order and the caller
explicitly said she doesn't care about block order, we can stream them
as the completion events arrive.  But even with synchronous I/O, we
could stream already-in-cache blocks before ones that require I/O.
Letting the caller chew on blocks that are already available maximises
the time between fadvise() and preadv() for misses, which minimises
the likelihood that the process will have to go to "D" sleep.

The patch is pretty trivial: if started with the
READ_STREAM_OUT_OF_ORDER flag, "hit" buffers are allowed to jump in
front of "miss" buffers in the queue.  The attached coding may not be
optimal, it's just a proof of concept.

ANALYZE benefits from this, for example, with certain kinds of
partially cached initial states and fast disks (?).  I'm not sure how
generally useful it is though.  I'm posting it because I wonder if it
could be interesting for the streaming bitmap heapscan project, and I
wonder how many other things don't care about the order.
From d549d3c0bac4a6e609781775e63277b370eb48ff Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 6 Apr 2024 13:28:28 +1300
Subject: [PATCH] Add READ_STREAM_OUT_OF_ORDER.

Just a proof-of-concept.
---
 src/backend/storage/aio/read_stream.c | 42 +++
 src/include/storage/read_stream.h |  7 +
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 2e45dcdcd4b..c1b95a42d39 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	bool		advice_enabled;
+	bool		out_of_order_enabled;
 
 	/*
 	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
@@ -307,12 +308,38 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 >buffers[stream->queue_size],
 sizeof(stream->buffers[0]) * overflow);
 
-	/* Compute location of start of next read, without using % operator. */
-	buffer_index += nblocks;
-	if (buffer_index >= stream->queue_size)
-		buffer_index -= stream->queue_size;
-	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
-	stream->next_buffer_index = buffer_index;
+	/*
+	 * If this was a hit and there is already I/O in progress, can we
+	 * re-prioritize it so that I/O has longer to complete and the caller can
+	 * consume this buffer immediately?
+	 */
+	if (!need_wait &&
+		nblocks == 1 &&
+		stream->ios_in_progress > 0 &&
+		stream->out_of_order_enabled)
+	{
+		int16		oldest_buffer_index;
+
+		/* Insert before "oldest" buffer. */
+		oldest_buffer_index = stream->oldest_buffer_index - 1;
+		if (oldest_buffer_index < 0)
+			oldest_buffer_index += stream->queue_size;
+		stream->buffers[oldest_buffer_index] = stream->buffers[buffer_index];
+		if (stream->per_buffer_data_size > 0)
+			memcpy(get_per_buffer_data(stream, oldest_buffer_index),
+   get_per_buffer_data(stream, buffer_index),
+   stream->per_buffer_data_size);
+		stream->oldest_buffer_index = oldest_buffer_index;
+	}
+	else
+	{
+		/* Compute location of start of next read, without using % operator. */
+		buffer_index += nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		stream->next_buffer_index = buffer_index;
+	}
 
 	/* Adjust the pending read to cover the remaining portion, if any. */
 	stream->pending_read_blocknum += nblocks;
@@ -515,6 +542,9 @@ read_stream_begin_relation(int flags,
 		stream->advice_enabled = true;
 #endif
 
+	if (flags & READ_STREAM_OUT_OF_ORDER)
+		stream->out_of_order_enabled = true;
+
 	/*
 	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
 	 * above.  If we had real asynchronous I/O we might need a slightly
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..d23d51af00c 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -41,6 +41,13 @@
  */
 #define READ_STREAM_FULL 0x04
 
+/*
+ * We usually stream buffers in the order the callback generates block
+ * numbers, but if the caller can cope with it, there are sometimes
+ * opportunities to reorder blocks to reduce I/O stalls.
+ */
+#define READ_STREAM_OUT_OF_ORDER 0x08
+
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
-- 
2.44.0

Experimental prefetching of buffer memory

2024-04-08 Thread Thomas Munro

FUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+$1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index cfbd2a096f4..9d9ed38fe57 100755
--- a/configure
+++ b/configure
@@ -15546,6 +15546,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+__builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 67e738d92b1..e991e56613c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 5acf083ce3c..eecd8aa6f45 100644
--- a/meson.build
+++ b/meson.build
@@ -1707,6 +1707,7 @@ builtins = [
   'constant_p',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346cd..1cb191b02a4 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -425,6 +425,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* 
  *Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f8d3e3b6b84..6f0b6a51417 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -546,6 +546,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
-- 
2.44.0

From 0f1a87954e27cd6e59e3ef45b610677b13a3985b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 15:06:32 +1300
Subject: [PATCH 2/2] Prefetch page header memory when streaming relations.

read_stream.c can always see at least one page ahead of the one the
caller is accessing.  Take the opportunity to prefetch the cache line
that holds the next page's header.  For some scans, that can generate a
decent speedup, though real world results will depend on how much work
the CPU actually does before it gets around to accessing the next page.

Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..2e45dcdcd4b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -617,7 +617,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->advice_enabled ?
 		READ_BUFFERS_ISSUE_ADVICE : 0)))
 			{
-/* Fast return.

Re: post-freeze damage control

2024-04-08 Thread Thomas Munro

On Tue, Apr 9, 2024 at 7:47 AM Robert Haas  wrote:
> - The streaming read API stuff was all committed very last minute. I
> think this should have been committed much sooner. It's probably not
> going to break the world; it's more likely to have performance
> consequences. But if it had gone in sooner, we'd have had more time to
> figure that out.

OK, let me give an update on this work stream (pun intended).

One reason for the delay in committing was precisely that we were
fretting about regressions risks.  We tried pretty hard to identify
and grind down every regression we could find, and cases with
outstanding not-fully-understood or examined problems in that area
have been booted into the next cycle for more work: streaming bitmap
heapscan, several streaming vacuum patches, and more, basically things
that seem to have more complex interactions with other machinery.  The
only three places using streaming I/O that went in were:

041b9680: Use streaming I/O in ANALYZE.
b7b0f3f2: Use streaming I/O in sequential scans.
3a352df0: Use streaming I/O in pg_prewarm.

The first is a good first exercise in streaming random blocks;
hopefully no one would be too upset about an unexpected small
regression in ANALYZE, but as it happens it goes faster hot and cold
according to all reports.  The second is a good first exercise in
streaming sequential blocks, and it ranges from faster to no
regression, according to testing and reports.  The third is less
important, but it also goes faster.

Of those, streaming seq scan is clearly the most relevant to real
workloads that someone might be upset about, and I made a couple of
choices that you might say had damage control in mind:

* A conservative choice not to get into the business of the issuing
new hints to the kernel for random jumps in cold scans, even though we
think we probably should for better performance: more research needed
precisely to avoid unexpected interactions (cf the booted bitmap
heapscan where that sort of thing seems to be afoot).
* A GUC to turn off I/O combining if it somehow upsets your storage in
ways we didn't foresee (io_combine_limit=1).

For fully cached hot scans, it does seem to be quite sensitive to tiny
changes in a hot code path that I and others spent a lot of time
optimising and testing during the CF.  Perhaps it is possible that
someone else's microarchitecture or compiler could show a regression
that I don't see, and I will certainly look into it with vim and
vigour if so.  In that case we could consider a tiny
micro-optimisation that I've shared already (it seemed a little novel
so I'd rather propose it in the new cycle if I can), or, if it comes
to it based on evidence and inability to address a problem quickly,
reverting just b7b0f3f2 which itself is a very small patch.

An aspect you didn't mention is correctness.  I don't actually know
how to prove that buffer manager protocols are correct beyond thinking
and torture testing, ie what kind of new test harness machinery could
be used to cross-check more things about buffer pool state explicitly,
and that is a weakness I'm planning to look into.

I realise that "these are the good ones, you should see all the stuff
we decided not to commit!" is not an argument, I'm just laying out how
I see the patches that went in and why I thought they were good.  It's
almost an architectural change, but done in tiny footsteps.  I
appreciate that people would have liked to see those particular tiny
footsteps in some of the other fine months available for patching the
tree, and some of the earlier underpinning patches that were part of
the same patch series did go in around New Year, but clearly my
"commit spreading" didn't go as well as planned after that (not helped
by Jan/Feb summer vacation season down here).

Mr Paquier this year announced his personal code freeze a few weeks
back on social media, which seemed like an interesting idea I might
adopt.  Perhaps that is what some other people are doing without
saying so, and perhaps the time they are using for that is the end of
the calendar year.  I might still be naturally inclined to crunch-like
behaviour, but it wouldn't be at the same time as everyone else,
except all the people who follow the same advice.

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-07 Thread Thomas Munro

On Mon, Apr 8, 2024 at 11:53 AM Melanie Plageman
 wrote:
> I've reviewed v6. I think you should mention in the docs that it only
> works for shared buffers -- so specifically not buffers containing
> blocks of temp tables.

Thanks for looking!  The whole pg_buffercache extension is for working
with shared buffers only, as mentioned at the top.  I have tried to
improve that paragraph though, as it only mentioned examining them.

> In the function pg_buffercache_invalidate(), why not use the
> BufferIsValid() function?
>
> -   if (buf < 1 || buf > NBuffers)
> +   if (!BufferIsValid(buf) || buf > NBuffers)

It doesn't check the range (it has assertions, not errors).

> I thought the below would be more clear for the comment above
> InvalidateUnpinnedBuffer().
>
> - * Returns true if the buffer was valid and it has now been made invalid.
> - * Returns false if the wasn't valid, or it couldn't be evicted due to a pin,
> - * or if the buffer becomes dirty again while we're trying to write it out.
> + * Returns true if the buffer was valid and has now been made invalid. 
> Returns
> + * false if it wasn't valid, if it couldn't be evicted due to a pin, or if 
> the
> + * buffer becomes dirty again while we're trying to write it out.

Fixed.

> Some of that probably applies for the docs too (i.e. you have some
> similar wording in the docs). There is actually one typo in your
> version, so even if you don't adopt my suggestion, you should fix that
> typo.

Yeah, thanks, improved similarly there.

> I didn't notice anything else out of place. I tried it and it worked
> as expected. I'm excited to have this feature!

Thanks!

> I didn't read through this whole thread, but was there any talk of
> adding other functions to let me invalidate a bunch of buffers at once
> or even some options -- like invalidate every 3rd buffer or whatever?
> (Not the concern of this patch, but just wondering because that would
> be a useful future enhancement IMO).

TBH I tried to resist people steering in that direction because you
can also just define a SQL function to do that built on this, and if
you had specialised functions they'd never be quite right.  IMHO we
succeeded in minimising the engineering and maximising flexibility,
'cause it's for hackers.  Crude, but already able to express a wide
range of stuff by punting the problem to SQL.

Thanks to Palak for the patch.  Pushed.

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-07 Thread Thomas Munro

On Mon, Apr 8, 2024 at 12:10 PM Andres Freund  wrote:
> On 2024-04-07 11:07:58 +1200, Thomas Munro wrote:
> > I thought of a better name for the bufmgr.c function though:
> > InvalidateUnpinnedBuffer().  That name seemed better to me after I
> > festooned it with warnings about why exactly it's inherently racy and
> > only for testing use.
>
> I still dislike that, fwiw, due to the naming similarity to
> InvalidateBuffer(), which throws away dirty buffer contents too. Which
> obviously isn't acceptable from "userspace".  I'd just name it
> pg_buffercache_evict() - given that the commit message's first paragraph uses
> "it is useful to be able to evict arbitrary blocks" that seems to describe
> things at least as well as "invalidate"?

Alright, sold.  I'll go with EvictUnpinnedBuffer() in bufmgr.c and
pg_buffercache_evict() in the contrib module.

Re: Use streaming read API in ANALYZE

2024-04-07 Thread Thomas Munro

On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
 wrote:
> On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> > On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> > > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman 
> > > Date: Sun, 7 Apr 2024 15:38:41 -0400
> > > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> > >
> > > A previous commit stopped using BlockSampler_HasMore() for flow control
> > > in acquire_sample_rows(). There seems little use now for
> > > BlockSampler_HasMore(). It should be sufficient to return
> > > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> > > would have returned false. Remove BlockSampler_HasMore().
> > >
> > > Author: Melanie Plageman, Nazir Bilal Yavuz
> > > Discussion: 
> > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> >
> > The justification here seems somewhat odd. Sure, the previous commit stopped
> > using BlockSampler_HasMore in acquire_sample_rows - but only because it was
> > moved to block_sampling_streaming_read_next()?
>
> It didn't stop using it. It stopped being useful. The reason it existed,
> as far as I can tell, was to use it as the while() loop condition in
> acquire_sample_rows(). I think it makes much more sense for
> BlockSampler_Next() to return InvalidBlockNumber when there are no more
> blocks -- not to assert you don't call it when there aren't any more
> blocks.
>
> I didn't want to change BlockSampler_Next() in the same commit as the
> streaming read user and we can't remove BlockSampler_HasMore() without
> changing BlockSampler_Next().

I agree that the code looks useless if one condition implies the
other, but isn't it good to keep that cross-check, perhaps
reformulated as an assertion?  I didn't look too hard at the maths, I
just saw the words "It is not obvious that this code matches Knuth's
Algorithm S ..." and realised I'm not sure I have time to develop a
good opinion about this today.  So I'll leave the 0002 change out for
now, as it's a tidy-up that can easily be applied in the next cycle.
From c3b8df8e4720d8b0dfb4c892c0aa3ddaef8f401f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 8 Apr 2024 14:38:58 +1200
Subject: [PATCH v12] Remove obsolete BlockSampler_HasMore().

Commit 041b9680 stopped using BlockSampler_HasMore() for flow control in
acquire_sample_rows(). There seems to be little use for it now. We can
just return InvalidBlockNumber from BlockSampler_Next() when
BlockSample_HasMore() would have returned false.

Author: Melanie Plageman 
Author: Nazir Bilal Yavuz 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c|  4 +---
 src/backend/utils/misc/sampling.c | 10 +++---
 src/include/utils/sampling.h  |  1 -
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index da27a13a3f..e9fa3470cf 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -,9 +,7 @@ block_sampling_read_stream_next(ReadStream *stream,
 void *callback_private_data,
 void *per_buffer_data)
 {
-	BlockSamplerData *bs = callback_private_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(callback_private_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702..6e2bca9739 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;/* probability to skip block */
 	double		V;/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52ba..fb5d6820a2 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 	 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.44.0

Re: Use streaming read API in ANALYZE

2024-04-07 Thread Thomas Munro

On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
 wrote:
> On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> > >  src/backend/commands/analyze.c | 89 ++
> > >  1 file changed, 26 insertions(+), 63 deletions(-)
> >
> > That's a very nice demonstration of how this makes good prefetching 
> > easier...
>
> Agreed. Yay streaming read API and Bilal!

+1

I found a few comments to tweak, just a couple of places that hadn't
got the memo after we renamed "read stream", and an obsolete mention
of pinning buffers.  I adjusted those directly.

I ran some tests on a random basic Linux/ARM cloud box with a 7.6GB
table, and I got:

  coldhot
master: 9025ms  199ms
patched, io_combine_limit=1:9025ms  191ms
patched, io_combine_limit=default:  8729ms  191ms

Despite being random, occasionally some I/Os must get merged, allowing
slightly better random throughput when accessing disk blocks through a
3000 IOPS drinking straw.  Looking at strace, I see 29144 pread* calls
instead of 30071, which fits that theory.  Let's see... if you roll a
fair 973452-sided dice 30071 times, how many times do you expect to
roll consecutive numbers?  Each time you roll there is a 1/973452
chance that you get the last number + 1, and we have 30071 tries
giving 30071/973452 = ~3%.  9025ms minus 3% is 8754ms.  Seems about
right.

I am not sure why the hot number is faster exactly.  (Anecdotally, I
did notice that in the cases that beat master semi-unexpectedly like
this, my software memory prefetch patch doesn't help or hurt, while in
some cases and on some CPUs there is little difference, and then that
patch seems to get a speed-up like this, which might be a clue.
*Shrug*, investigation needed.)

Pushed.  Thanks Bilal and reviewers!

Re: Streaming read-ready sequential scan code

2024-04-07 Thread Thomas Munro

On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman
 wrote:
> Attached v13 0001 is your fix and 0002 is a new version of the
> sequential scan streaming read user. Off-list Andres mentioned that I
> really ought to separate the parallel and serial sequential scan users
> into two different callbacks. I've done that in the attached. It
> actually makes the code used by the callbacks nicer and more readable
> anyway (putting aside performance). I was able to measure a small
> performance difference as well.

Thanks.  I changed a couple of very trivial things before pushing.

+BlockNumber (*cb) (ReadStream *pgsr, void *private_data,
+   void *per_buffer_data);

This type has a friendly name: ReadStreamBlockNumberCB.

+scan->rs_read_stream =
read_stream_begin_relation(READ_STREAM_SEQUENTIAL,

I've been on the fence about that flag for sequential scan...  Some
days I want to consider changing to READ_STREAM_DEFAULT and relying on
our "anti-heuristics" to suppress advice, which would work out the
same in most cases but might occasionally win big.  It might also
hurt, I dunno, so I suspect we'd have to make it better first, which
my patch in [1] was a first swing at, but I haven't researched that
enough.  So, kept this way!

- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.

Changed to:

 * Read the next block of the scan relation from the read stream and save it
 * in the scan descriptor.  It is already pinned.

+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+void *per_buffer_data)

Changed argument names to match the function pointer type definition,
"stream" and "callback_private_data".

BTW looking at the branching in read-stream user patches that have an
initialisation step like yours, I wonder if it might every make sense
to be able to change the callback on the fly from inside the callback,
so that you finish up with a branchless one doing most of the work.  I
have no idea if it's worth it...

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com

Re: Streaming read-ready sequential scan code

2024-04-06 Thread Thomas Munro

I found a bug in read_stream.c that could be hit with Melanie's
streaming seq scan patch with parallelism enabled and certain buffer
pool conditions.  Short version: there is an edge case where an "if"
needed to be a "while", or we could lose a few blocks.  Here's the fix
for that, longer explanation in commit message.
From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 7 Apr 2024 12:36:44 +1200
Subject: [PATCH] Fix bug in read_stream.c.

When we determine that a wanted block can't be combined with the current
pending read, it's time to start that pending read to get it out of the
way.  An "if" in that code path should have been a "while", because it
might take more than one go to get that job done.  Otherwise the
remaining part of a partially started read could be clobbered and we
could lose some blocks.  This was only broken for smaller ranges, as the
more common case of io_combine_limit-sized ranges is handled earlier in
the code and knows how to loop.

Discovered while testing parallel sequential scans of partially cached
tables.  They have a ramp-down phase with ever smaller ranges of
contiguous blocks, to be fair to parallel workers as the work runs out.

Defect in commit b5a9b18c.
---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 9a70a81f7ae..f54dacdd914 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		}

 		/* We have to start the pending read before we can build another. */
-		if (stream->pending_read_nblocks > 0)
+		while (stream->pending_read_nblocks > 0)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
-- 
2.44.0

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-06 Thread Thomas Munro

On second thoughts, I think the original "invalidate" terminology was
fine, no need to invent a new term.

I thought of a better name for the bufmgr.c function though:
InvalidateUnpinnedBuffer().  That name seemed better to me after I
festooned it with warnings about why exactly it's inherently racy and
only for testing use.

I suppose someone could propose an additional function
pg_buffercache_invalidate(db, tbspc, rel, fork, blocknum) that would
be slightly better in the sense that it couldn't accidentally evict
some innocent block that happened to replace the real target just
before it runs, but I don't think it matters much for this purpose and
it would still be racy on return (vacuum decides to load your block
back in) so I don't think it's worth bothering with.

So this is the version I plan to commit.
From 6a13349b788c8539b2d349f9553706d7c563f8f8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 7 Apr 2024 09:13:17 +1200
Subject: [PATCH v6] Add pg_buffercache_invalidate() function for testing.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When testing buffer pool logic, it is useful to be able to evict
arbitrary blocks.  This function can be used in SQL queries over the
pg_buffercache view to set up a wide range of buffer pool states.  Of
course, buffer mappings might change concurrently so you might evict a
block other than the one you had in mind, and another session might
bring it back in at any time.  That's OK for the intended purpose of
setting up developer testing scenarios, and more complicated interlocking
schemes to give stronger guararantees about that would likely be less
flexible for actual testing work anyway.  Superuser-only.

Author: Palak Chaturvedi 
Author: Thomas Munro  (docs, small tweaks)
Reviewed-by: Nitin Jadhav 
Reviewed-by: Andres Freund 
Reviewed-by: Cary Huang 
Reviewed-by: Cédric Villemain 
Reviewed-by: Jim Nasby 
Reviewed-by: Maxim Orlov 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/calfch19pw48zwwzuorspsav9hqt0upyabpc4boz4w+c7ff5...@mail.gmail.com
---
 contrib/pg_buffercache/Makefile   |  2 +-
 contrib/pg_buffercache/meson.build|  1 +
 .../pg_buffercache--1.4--1.5.sql  |  6 ++
 contrib/pg_buffercache/pg_buffercache.control |  2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c | 20 ++
 doc/src/sgml/pgbuffercache.sgml   | 37 +--
 src/backend/storage/buffer/bufmgr.c   | 63 +++
 src/include/storage/bufmgr.h  |  2 +
 8 files changed, 125 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index d6b58d4da9..eae65ead9e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,7 @@ OBJS = \
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
-	pg_buffercache--1.3--1.4.sql
+	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index c86e33cc95..1ca3452918 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -22,6 +22,7 @@ install_data(
   'pg_buffercache--1.2--1.3.sql',
   'pg_buffercache--1.2.sql',
   'pg_buffercache--1.3--1.4.sql',
+  'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
new file mode 100644
index 00..92c530bc19
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
@@ -0,0 +1,6 @@
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.5'" to load this file. \quit
+
+CREATE FUNCTION pg_buffercache_invalidate(IN int)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
+LANGUAGE C PARALLEL SAFE VOLATILE STRICT;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index a82ae5f9bb..5ee875f77d 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3316732365..9617bfa47b 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -63,6 +63,7 @@ typedef struct
 PG_FUNCTION_INFO_V1(pg_buffercach

Re: LogwrtResult contended spinlock

2024-04-05 Thread Thomas Munro

On Sat, Apr 6, 2024 at 6:55 AM Alvaro Herrera  wrote:
> Pushed 0001.

Could that be related to the 3 failures on parula that look like this?

TRAP: failed Assert("node->next == 0 && node->prev == 0"), File:
"../../../../src/include/storage/proclist.h", Line: 63, PID: 29119
2024-04-05 16:16:26.812 UTC [29114:15] pg_regress/drop_operator LOG:
statement: DROP OPERATOR <|(bigint, bigint);
postgres: postgres regression [local] CREATE
ROLE(ExceptionalCondition+0x4c)[0x9b3fdc]
postgres: postgres regression [local] CREATE ROLE[0x8529e4]
postgres: postgres regression [local] CREATE
ROLE(LWLockWaitForVar+0xec)[0x8538fc]
postgres: postgres regression [local] CREATE ROLE[0x54c7d4]
postgres: postgres regression [local] CREATE ROLE(XLogFlush+0xf0)[0x552600]
postgres: postgres regression [local] CREATE ROLE[0x54a9b0]
postgres: postgres regression [local] CREATE ROLE[0x54bbdc]

Hmm, the comments for LWLockWaitForVar say:

 * Be aware that LWLockConflictsWithVar() does not include a memory barrier,
 * hence the caller of this function may want to rely on an explicit barrier or
 * an implied barrier via spinlock or LWLock to avoid memory ordering issues.

But that seems to be more likely to make LWLockWaitForVar suffer data
races (ie hang), not break assertions about LWLock sanity, so I don't
know what's going on there.  I happened to have a shell on a Graviton
box, but I couldn't reproduce it after a while...

Re: Streaming read-ready sequential scan code

2024-04-05 Thread Thomas Munro

On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
 wrote:
> On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro  wrote:
> > The interesting column is hot.  The 200ms->211ms regression is due to
> > the extra bookkeeping in the slow path.  The rejiggered fastpath code
> > fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
> > you reproduce that?
>
> I am able to reproduce the fast path solving the issue using Heikki's
> example here [1] but in shared buffers (hot).
>
> master:   25 ms
> stream read:   29 ms
> stream read + fast path: 25 ms

Great, thanks.

> I haven't looked into or reviewed the memory prefetching part.
>
> While reviewing 0002, I realized that I don't quite see how
> read_stream_get_block() will be used in the fastpath -- which it
> claims in its comments.

Comments improved.

> Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Just for testing purposes.  Behaviour should be identical to external
observers either way, it's just a hand-rolled specialisation for
certain parameters, and it's useful to be able to verify that and
measure the speedup.  I think it's OK to leave a bit of
developer/testing scaffolding in the tree if it's likely to be useful
again and especially if like this case it doesn't create any dead
code.  (Perhaps in later work we might find the right way to get the
compiler to do the specialisation?  It's so simple though.)

The occasional CI problem I mentioned turned out to be
read_stream_reset() remembering a little too much state across
rescans.  Fixed.

Thanks both for the feedback on the ring buffer tweaks.  Comments
updated.  Here is the full stack of patches I would like to commit
very soon, though I may leave the memory prefetching part out for a
bit longer to see if I can find any downside.
From 54efd755040507b55672092907d53b4db30f5a06 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 12:08:24 +1300
Subject: [PATCH v12 1/6] Improve read_stream.c's fast path.

Unfortunately the "fast path" for cached scans that don't do any I/O was
accidentally coded in a way that could only be triggered by pg_prewarm's
usage patter.  We really need it to work for the streaming sequential
scan patch.  Refactor.

Reviewed-by: Melanie Plageman 
Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 75 +++
 1 file changed, 31 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..b9e11a28312 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -169,7 +169,7 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 /*
  * Ask the callback which block it would like us to read next, with a small
  * buffer in front to allow read_stream_unget_block() to work and to allow the
- * fast path to work in batches.
+ * fast path to skip this function and work directly from the array.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
@@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (likely(stream->fast_path))
 	{
 		BlockNumber next_blocknum;
-		bool		need_wait;
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
-		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 
 		/* We're going to return the buffer we pinned last time. */
@@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		buffer = stream->buffers[oldest_buffer_index];
 		Assert(buffer != InvalidBuffer);
 
-		/*
-		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
-		 * I/O entry (they're all free).
-		 */
-		need_wait = StartReadBuffer(>ios[0].op,
-	>buffers[oldest_buffer_index],
-	stream->pending_read_blocknum,
-	stream->advice_enabled ?
-	READ_BUFFERS_ISSUE_ADVICE : 0);
-
-		/* Choose the block the next call will pin. */
+		/* Choose the next block to pin. */
 		if (unlikely(stream->blocknums_next == stream->blocknums_count))
 			read_stream_fill_blocknums(stream);
 		next_blocknum = stream->blocknums[stream->blocknums_next++];
 
-		/*
-		 * Fast return if the next call doesn't require I/O for the buffer we
-		 * just pinned, and we have a block number to give it as a pending
-		 * read.
-		 */
-		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		if (likely(next_blocknum != InvalidBlockNumber))
 		{
-			stream->pending_read_blocknum = next_blocknum;
-			return buffer;
-		}
-
-		/*
-		 * For an

Re: broken JIT support on Fedora 40

2024-04-05 Thread Thomas Munro

On Sun, Mar 31, 2024 at 12:49 PM Thomas Munro  wrote:
> https://github.com/llvm/llvm-project/pull/87093

Oh, with those clues, I think I might see...  It is a bit strange that
we copy attributes from AttributeTemplate(), a function that returns
Datum, to our void deform function.  It works (I mean doesn't crash)
if you just comment this line out:

llvm_copy_attributes(AttributeTemplate, v_deform_fn);

... but I guess that disables inlining of the deform function?  So
perhaps we just need to teach that thing not to try to copy the return
value's attributes, which also seems to work here:

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index ec0fdd49324..92b4993a98a 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -552,8 +552,11 @@ llvm_copy_attributes(LLVMValueRef v_from,
LLVMValueRef v_to)
/* copy function attributes */
llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeFunctionIndex);

-   /* and the return value attributes */
-   llvm_copy_attributes_at_index(v_from, v_to, LLVMAttributeReturnIndex);
+   if (LLVMGetTypeKind(LLVMGetFunctionReturnType(v_to)) !=
LLVMVoidTypeKind)
+   {
+   /* and the return value attributes */
+   llvm_copy_attributes_at_index(v_from, v_to,
LLVMAttributeReturnIndex);
+   }

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

Yeah, I plead benchmarking myopia, sorry.  The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does.  Oops.
With the attached minor rearrangement, it works fine.  I also poked
some more at that memory prefetcher.  Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 56) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

cold   warmhot
master2479ms  886ms  200ms
seqscan   2498ms  716ms  211ms <-- regression
seqscan + fastpath2493ms  711ms  200ms <-- fixed, I think?
seqscan + memprefetch 2499ms  716ms  182ms
seqscan + fastpath + memprefetch  2505ms  710ms  170ms <-- \O/

Cold has no difference.  That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

cold
master1775ms
seqscan + fastpath + memprefetch  1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot.  The 200ms->211ms regression is due to
the extra bookkeeping in the slow path.  The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
you reproduce that?

The memory prefetching trick, on top of that, seems to be a good
optimisation so far.  Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess.  Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet.  We
know there are many more opportunities like that[1] but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.

Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI.  Looking into that, going to find my Mac...

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com
From 74b8cde45a8babcec7b52b06bdb8ea046a0a966f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v10 1/4] Use streaming I/O in heapam sequential scan.


---
 src/backend/access/heap/heapam.c | 100 +--
 src/include/access/heapam.h  |  15 +
 2 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..f7946a39fd9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * 
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+		   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+		   scan->rs_prefetch_block,
+		   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* 
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next b

Re: Built-in CTYPE provider

2024-04-04 Thread Thomas Munro

Hi,

+command_ok(
+   [
+   'initdb', '--no-sync',
+   '--locale-provider=builtin', '-E UTF-8',
+   '--builtin-locale=C.UTF-8', "$tempdir/data8"
+   ],
+   'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8');

This Sun animal recently turned on --enable-tap-tests, and that ↑ failed[1]:

# Running: initdb --no-sync --locale-provider=builtin -E UTF-8
--builtin-locale=C.UTF-8
/home/marcel/build-farm-15/buildroot/HEAD/pgsql.build/src/bin/initdb/tmp_check/tmp_test_XvK1/data8
The files belonging to this database system will be owned by user "marcel".
This user must also own the server process.

The database cluster will be initialized with this locale configuration:
  locale provider:   builtin
  default collation: C.UTF-8
  LC_COLLATE:  en_US
  LC_CTYPE:en_US
  LC_MESSAGES: C
  LC_MONETARY: en_US
  LC_NUMERIC:  en_US
  LC_TIME: en_US
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (UTF8) and the encoding that
the selected locale uses (LATIN1) do not match. This would lead to
misbehavior in various character string processing functions.
initdb: hint: Rerun initdb and either do not specify an encoding
explicitly, or choose a matching combination.
[14:04:12.462](0.036s) not ok 28 - locale provider builtin with -E
UTF-8 --builtin-locale=C.UTF-8

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay=2024-04-04%2011%3A42%3A40

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman
 wrote:
> So, sequential scan does not have per-buffer data. I did some logging
> and the reason most fully-in-SB sequential scans don't use the fast
> path is because read_stream->pending_read_nblocks is always 0.

Hnghghghgh... right, sorry I guessed the wrong reason, it turns out
that I made a fast path just a little too specialised for pg_prewarm.
Working on it...

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Thu, Apr 4, 2024 at 8:02 PM David Rowley  wrote:
> 3a4a3537a
> latency average = 34.497 ms
> latency average = 34.538 ms
>
> 3a4a3537a + read_stream_for_seqscans.patch
> latency average = 40.923 ms
> latency average = 41.415 ms
>
> i.e. no meaningful change from the refactor, but a regression from a
> cached workload that changes the page often without doing much work in
> between with the read stread patch.

I ran Heikki's test except I ran the "insert" 4 times to get a table
of 4376MB according to \d+.  On my random cloud ARM server (SB=8GB,
huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
when the data is in LInux cache and PG is not prewarmed, roughly as he
reported.  Good.

If I pg_prewarm first, I see that slowdown 260ms -> 294ms.  Trying
things out to see what works, I got that down to 243ms (ie beat
master) by inserting a memory prefetch:

--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
**per_buffer_data)
/* Prepare for the next call. */
read_stream_look_ahead(stream, false);

+   
__builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));

Maybe that's a solution to a different problem that just happens to
more than make up the difference in this case, and it may be
questionable whether that cache line will survive long enough to help
you, but this one-tuple-per-page test likes it...  Hmm, to get a more
realistic table than the one-tuple-per-page on, I tried doubling a
tenk1 table until it reached 2759MB and then I got a post-prewarm
regression 702ms -> 721ms, and again I can beat master by memory
prefetching: 689ms.

Annoyingly, with the same table I see no difference between the actual
pg_prewarm('giga') time: around 155ms for both.  pg_prewarm is able to
use the 'fast path' I made pretty much just to be able to minimise
regression in that (probably stupid) all-cached tes that doesn't even
look at the page contents.  Unfortunately seq scan can't use it
because it has per-buffer data, which is one of the things it can't do
(because of a space management issue).  Maybe I should try to find a
way to fix that.

> I'm happy to run further benchmarks, but for the remainder of the
> committer responsibility for the next patches, I'm going to leave that
> to Thomas.

Thanks!

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro  wrote:
> Alright what about this?

Forgot to git add a change, new version.
From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH v2 1/2] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Discussion: 
https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 
 src/backend/storage/buffer/freelist.c | 35 +++
 src/include/storage/bufmgr.h  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c 
b/src/backend/storage/aio/read_stream.c
index 4f21262ff5..eab87f6f73 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
size_t  size;
int16   queue_size;
int16   max_ios;
+   int strategy_pin_limit;
uint32  max_pinned_buffers;
Oid tablespace_id;
SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
max_pinned_buffers = Min(max_pinned_buffers,
 PG_INT16_MAX - 
io_combine_limit - 1);
 
+   /* Give the strategy a chance to limit the number of buffers we pin. */
+   strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+   max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
/* Don't allow this backend to pin more than its share of buffers. */
if (SmgrIsTemp(smgr))
LimitAdditionalLocalPins(_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c 
b/src/backend/storage/buffer/freelist.c
index 3611357fa3..c69590d6d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned
+ *
+ * Strategies can specify the maximum number of buffers that a user should pin
+ * at once when performing look-ahead.  Callers should combine this number with
+ * other relevant limits and take the minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+   if (strategy == NULL)
+   return NBuffers;
+
+   switch (strategy->btype)
+   {
+   case BAS_BULKREAD:
+
+   /*
+* Since BAS_BULKREAD uses StrategyRejectBuffer(), 
dirty buffers
+* shouldn't be a problem and the caller is free to pin 
up to the
+* entire ring at once.
+*/
+   return strategy->nbuffers;
+
+   default:
+
+   /*
+* Tell call not to pin more than half the buffers in 
the ring.
+* This is a trade-off between look ahead distance and 
deferring
+* writeback and associated WAL traffic.
+*/
+   return strategy->nbuffers / 2;
+   }
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6..07ba1a6050 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy 
GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType 
btype,

  int ring_size_kb);
 extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
-- 
2.39.3 (Apple Git-146)

From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 4 Apr 2024 21:11:06 +1300
Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f.  Commit
1cbbee03 made it configurable but retaine

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Thu, Apr 4, 2024 at 11:13 AM Andres Freund  wrote:
> The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
> should we apply that only if the ring the strategy doesn't use the
> StrategyRejectBuffer() logic?

Hmm, I don't really know, but that sounds plausible.  What do you
think about the attached?

> I think for VACUUM we should probably go a bit further. There's no comparable
> L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
> expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Alright what about this?
From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH 1/2] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Discussion: 
https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 
 src/backend/storage/buffer/freelist.c | 35 +++
 src/include/storage/bufmgr.h  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c 
b/src/backend/storage/aio/read_stream.c
index 4f21262ff5..eab87f6f73 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
size_t  size;
int16   queue_size;
int16   max_ios;
+   int strategy_pin_limit;
uint32  max_pinned_buffers;
Oid tablespace_id;
SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
max_pinned_buffers = Min(max_pinned_buffers,
 PG_INT16_MAX - 
io_combine_limit - 1);
 
+   /* Give the strategy a chance to limit the number of buffers we pin. */
+   strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+   max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
/* Don't allow this backend to pin more than its share of buffers. */
if (SmgrIsTemp(smgr))
LimitAdditionalLocalPins(_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c 
b/src/backend/storage/buffer/freelist.c
index 3611357fa3..c69590d6d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned
+ *
+ * Strategies can specify the maximum number of buffers that a user should pin
+ * at once when performing look-ahead.  Callers should combine this number with
+ * other relevant limits and take the minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+   if (strategy == NULL)
+   return NBuffers;
+
+   switch (strategy->btype)
+   {
+   case BAS_BULKREAD:
+
+   /*
+* Since BAS_BULKREAD uses StrategyRejectBuffer(), 
dirty buffers
+* shouldn't be a problem and the caller is free to pin 
up to the
+* entire ring at once.
+*/
+   return strategy->nbuffers;
+
+   default:
+
+   /*
+* Tell call not to pin more than half the buffers in 
the ring.
+* This is a trade-off between look ahead distance and 
deferring
+* writeback and associated WAL traffic.
+*/
+   return strategy->nbuffers / 2;
+   }
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6..07ba1a6050 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy 
GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType 
btype,

  int ring_size_kb);
 extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int GetAcc

WIP: Vectored writeback

2024-04-04 Thread Thomas Munro

Hi,

Here are some vectored writeback patches I worked on in the 17 cycle
and posted as part of various patch sets, but didn't get into a good
enough shape to take further.  They "push" vectored writes out, but I
think what they need is to be turned inside out and converted into
users of a new hypothetical write_stream.c, so that we have a model
that will survive contact with asynchronous I/O and would "pull"
writes from a stream that controls I/O concurrency.  That all seemed a
lot less urgent to work on than reads, hence leaving on ice for now.
There is a lot of code that reads, and a small finite amount that
writes.  I think the patches show some aspects of the problem-space
though, and they certainly make checkpointing faster.  They cover 2
out of 5ish ways we write relation data: checkpointing, and strategies
AKA ring buffers.

They make checkpoints look like this, respecting io_combine_limit,
instead of lots of 8kB writes:

pwritev(9,[...],2,0x0) = 131072 (0x2)
pwrite(9,...,131072,0x2) = 131072 (0x2)
pwrite(9,...,131072,0x4) = 131072 (0x2)
pwrite(9,...,131072,0x6) = 131072 (0x2)
pwrite(9,...,131072,0x8) = 131072 (0x2)
...

Two more ways data gets written back are: bgwriter and regular
BAS_NORMAL buffer eviction, but they are not such natural candidates
for write combining.  Well, if you know you're going to write out a
buffer, *maybe* it's worth probing the buffer pool to see if adjacent
block numbers are also present and dirty?  I don't know.  Before and
after?  Or maybe it's better to wait for the tree-based mapping table
of legend first so it becomes cheaper to navigate in block number
order.

The 5th way is raw file copy that doesn't go through the buffer pool,
such as CREATE DATABASE ... STRATEGY=FILE_COPY, which already works
with big writes, and CREATE INDEX via bulk_write.c which is easily
converted to vectored writes, and I plan to push the patches for that
shortly.  I think those should ultimately become stream-based too.

Anyway, I wanted to share these uncommitfest patches, having rebased
them over relevant recent commits, so I could leave them in working
state in case anyone is interested in this file I/O-level stuff...
From c6d227678c586387a49c30c4f9a61f62c9b04b1c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 13 Mar 2024 17:02:42 +1300
Subject: [PATCH v5 1/3] Provide vectored variant of FlushBuffer().

FlushBuffers() is just like FlushBuffer() except that it tries to write
out multiple consecutive disk blocks at a time with smgrwritev().  The
traditional simple FlushBuffer() call is now implemented in terms of it.
---
 src/backend/storage/buffer/bufmgr.c | 240 
 src/include/storage/buf_internals.h |  10 ++
 2 files changed, 185 insertions(+), 65 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..a815e58e307 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -57,6 +57,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
@@ -526,6 +527,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 		IOObject io_object, IOContext io_context);
+static int	FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
+		 IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 	   ForkNumber forkNum,
 	   BlockNumber nForkBlock,
@@ -3705,9 +3708,19 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 	*blknum = bufHdr->tag.blockNum;
 }
 
+struct shared_buffer_write_error_info
+{
+	BufferDesc *buf;
+	int			nblocks;
+};
+
 /*
- * FlushBuffer
- *		Physically write out a shared buffer.
+ * FlushBuffers
+ *		Physically write out shared buffers.
+ *
+ * The buffers do not have to be consecutive in memory but must refer to
+ * consecutive blocks of the same relation fork in increasing order of block
+ * number.
  *
  * NOTE: this actually just passes the buffer contents to the kernel; the
  * real write to disk won't happen until the kernel feels like it.  This
@@ -3715,66 +3728,149 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * However, we will need to force the changes to disk via fsync before
  * we can checkpoint WAL.
  *
- * The caller must hold a pin on the buffer and have share-locked the
+ * The caller must hold a pin on the buffers and have share-locked the
  * buffer contents.  (Note: a share-lock does not prevent updates of
  * hint bits in the buffer, so the page could change while t

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-03 Thread Thomas Munro

On Fri, Mar 8, 2024 at 6:20 AM Maxim Orlov  wrote:
> Quite an interesting patch, in my opinion.  I've decided to work on it a bit, 
> did some refactoring (sorry) and add
> basic tests. Also, I try to take into account as much as possible notes on 
> the patch, mentioned by Cédric Villemain.

Thanks!  Unfortunately I don't think it's possible to include a
regression test that looks at the output, because it'd be
non-deterministic.  Any other backend could pin or dirty the buffer
you try to evict, changing the behaviour.

> > and maybe better to go with FlushOneBuffer() ?
> It's a good idea, but I'm not sure at the moment.  I'll try to dig some 
> deeper into it.  At least, FlushOneBuffer does
> not work for a local buffers.  So, we have to decide whatever 
> pg_buffercache_invalidate should or should not
> work for local buffers.  For now, I don't see why it should not.  Maybe I 
> miss something?

I think it's OK to ignore local buffers for now.  pg_buffercache
generally doesn't support/show them so I don't feel inclined to
support them for this.  I removed a few traces of local support.

It didn't seem appropriate to use the pg_monitor role for this, so I
made it superuser-only.  I don't think it makes much sense to use this
on any kind of production system so I don't think we need a new role
for it, and existing roles don't seem too appropriate.  pageinspect et
al use the same approach.

I added a VOLATILE qualifier to the function.

I added some documentation.

I changed the name to pg_buffercache_evict().

I got rid of the 'force' flag which was used to say 'I only want to
evict this buffer it is clean'.  I don't really see the point in that,
we might as well keep it simple.  You could filter buffers on
"isdirty" if you want.

I added comments to scare anyone off using EvictBuffer() for anything
much, and marking it as something for developer convenience.  (I am
aware of an experimental patch that uses this same function as part of
a buffer pool resizing operation, but that has other infrastructure to
make that safe and would adjust those remarks accordingly.)

I wondered whether it should really be testing for  BM_TAG_VALID
rather than BM_VALID.  Arguably, but it doesn't seem important for
now.  The distinction would arise if someone had tried to read in a
buffer, got an I/O error and abandoned ship, leaving a buffer with a
valid tag but not valid contents.  Anyone who tries to ReadBuffer() it
will then try to read it again, but in the meantime this function
won't be able to evict it (it'll just return false).  Doesn't seem
that obvious to me that this obscure case needs to be handled.  That
doesn't happen *during* a non-error case, because then it's pinned and
we already return false in this code for pins.

I contemplated whether InvalidateBuffer() or InvalidateVictimBuffer()
would be better here and realised that Andres's intuition was probably
right when he suggested the latter up-thread.  It is designed with the
right sort of arbitrary concurrent activity in mind, where the former
assumes things about locking and dropping, which could get us into
trouble if not now maybe in the future.

I ran the following diabolical buffer blaster loop while repeatedly
running installcheck:

do
$$
begin
  loop
perform pg_buffercache_evict(bufferid)
   from pg_buffercache
  where random() <= 0.25;
  end loop;
End;
$$;

The only ill-effect was a hot laptop.

Thoughts, objections, etc?

Very simple example of use:

create or replace function uncache_relation(name text)
returns boolean
begin atomic;
  select bool_and(pg_buffercache_evict(bufferid))
from pg_buffercache
   where reldatabase = (select oid
  from pg_database
 where datname = current_database())
 and relfilenode = pg_relation_filenode(name);
end;

More interesting for those of us hacking on streaming I/O stuff was
the ability to evict just parts of things and see how the I/O merging
and I/O depth react.

v5-0001-Add-pg_buffercache_evict-function-for-testing.patch
Description: Binary data

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-03 Thread Thomas Munro

On Thu, Apr 4, 2024 at 11:51 AM Peter Eisentraut  wrote:
> On 30.03.24 22:27, Thomas Munro wrote:
> > Hmm, OK so it doesn't have 3 available in parallel from base repos.
> > But it's also about to reach end of "full support" in 2 months[1], so
> > if we applied the policies we discussed in the LLVM-vacuuming thread
> > (to wit: build farm - EOL'd OSes), then...  One question I'm unclear
> > on is whether v17 will be packaged for RHEL8.
>
> The rest of the thread talks about the end of support of RHEL 7, but you
> are here talking about RHEL 8.   It is true that "full support" for RHEL
> 8 ended in May 2024, but that is the not the one we are tracking.  We
> are tracking the 10-year one, which I suppose is now called "maintenance
> support".

I might have confused myself with the two EOLs and some wishful
thinking.  I am a lot less worked up about this general topic now that
RHEL has moved to "rolling" LLVM updates in minor releases, removing a
physical-pain-inducing 10-year vacuuming horizon (that's 20 LLVM major
releases and they only fix bugs in one...).  I will leave openssl
discussions to those more knowledgeable about that.

> So if the above package list is correct, then we ought to keep
> supporting openssl 1.1.* until 2029.

That's a shame.  But it sounds like the developer burden isn't so
different from 1.1.1 to 3.x, so maybe it's not such a big deal from
our point of view.  (I have no opinion on the security ramifications
of upstream's EOL, but as a layman it sounds completely bonkers to use
it.  I wonder why the packaging community wouldn't just arrange to
have a supported-by-upstream 3.x package in their RPM repo when they
supply the newest PostgreSQL versions for the oldest RHEL, but again
not my area so I'll shut up).

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Thomas Munro

On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
 wrote:
> On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas  wrote:
> > On 01/04/2024 22:58, Melanie Plageman wrote:
> > > Attached v7 has version 14 of the streaming read API as well as a few
> > > small tweaks to comments and code.
> >
> > I saw benchmarks in this thread to show that there's no regression when
> > the data is in cache, but I didn't see any benchmarks demonstrating the
> > benefit of this. So I ran this quick test:
>
> Good point! It would be good to show why we would actually want this
> patch. Attached v8 is rebased over current master (which now has the
> streaming read API).

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
due to intentional changes.  No efficiency is coming from batching
anything.  Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?  (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)

> On the topic of BAS_BULKREAD buffer access strategy, I think the least
> we could do is add an assert like this to read_stream_begin_relation()
> after calculating max_pinned_buffers.
>
> Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
>
> Perhaps we should do more? I think with a ring size of 16 MB, large
> SELECTs are safe for now. But I know future developers will make
> changes and it would be good not to assume they will understand that
> pinning more buffers than the size of the ring effectively invalidates
> the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'.  The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.  The size of main
memory and L2 cache have increased dramatically since those strategies
were invented.  I think we should at least double them, and more
likely quadruple them.  I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c.  It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.

Re: Streaming I/O, vectored I/O (WIP)

2024-04-02 Thread Thomas Munro

On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro  wrote:
> So this is the version I'm going to commit shortly, barring objections.

And done, after fixing a small snafu with smgr-only reads coming from
CreateAndCopyRelationData() (BM_PERMANENT would be
incorrectly/unnecessarily set for unlogged tables).

Here are the remaining patches discussed in this thread.  They give
tablespace-specific io_combine_limit, effective_io_readahead_window
(is this useful?), and up-to-1MB io_combine_limit (is this useful?).
I think the first two would probably require teaching reloption.c how
to use guc.c's parse_int() and unit flags, but I won't have time to
look at that for this release so I'll just leave these here.

On the subject of guc.c, this is a terrible error message... did I do
something wrong?

postgres=# set io_combine_limit = '42MB';
ERROR:  5376 8kB is outside the valid range for parameter
"io_combine_limit" (1 .. 32)
From 84b8280481312cdd1efcb7efa1182d4647cbe00a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 30 Mar 2024 19:09:44 +1300
Subject: [PATCH v16 1/4] ALTER TABLESPACE ... SET (io_combine_limit = ...).

This is the per-tablespace version of the GUC of the same name.

XXX reloptions.c lacks the ability to accept units eg '64kB'!  Which is
why I haven't included it with the main feature commit.

Suggested-by: Tomas Vondra https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/ref/alter_tablespace.sgml |  9 +++---
 src/backend/access/common/reloptions.c | 12 +++-
 src/backend/storage/aio/read_stream.c  | 39 --
 src/backend/utils/cache/spccache.c | 14 +
 src/bin/psql/tab-complete.c|  3 +-
 src/include/commands/tablespace.h  |  1 +
 src/include/utils/spccache.h   |  1 +
 7 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..faf0c6e7fbc 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -84,16 +84,17 @@ ALTER TABLESPACE name RESET ( ,
   ,
   ,
-  ).  This may be useful if
+  ),
+  ).  This may be useful if
   one tablespace is located on a disk which is faster or slower than the
   remainder of the I/O subsystem.
  
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..1e1c611fab2 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -371,6 +371,15 @@ static relopt_int intRelOpts[] =
 		0, 0, 0
 #endif
 	},
+	{
+		{
+			"io_combine_limit",
+			"Limit on the size of data reads and writes.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, MAX_IO_COMBINE_LIMIT
+	},
 	{
 		{
 			"parallel_workers",
@@ -2089,7 +2098,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
 		{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
-		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
+		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
+		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..907c80e6bf9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -114,6 +114,7 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		io_combine_limit;
 	bool		advice_enabled;
 
 	/*
@@ -241,7 +242,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -329,7 +330,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -389,7 +390,7 @@ read_stream_look_ahead(ReadStream *stream

Re: Building with musl in CI and the build farm

2024-04-02 Thread Thomas Munro

On Wed, Mar 27, 2024 at 11:27 AM Wolfgang Walther
 wrote:
> The animal runs in a docker container via GitHub Actions in [2].

Great idea :-)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3752 matches

Mail list logo