from:"Noah Misch"

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-10-01 Thread Noah Misch

On Tue, Oct 01, 2024 at 11:55:48AM -0700, Masahiko Sawada wrote:
> On Mon, Sep 30, 2024 at 11:49 AM Noah Misch  wrote:
> > This has initdb set the field to the current C implementation's signedness. 
> >  I
> > wonder if, instead, initdb should set signedness=true unconditionally.  Then
> > the only source of signedness=false will be pg_upgrade.
> >
> > Advantage: signedness=false will get rarer over time.  If we had known about
> > this problem during the last development cycle that forced initdb (v8.3), we
> > would have made all clusters signed or all clusters unsigned.  Making
> > pg_upgrade the only source of signedness=false will cause the population of
> > database clusters to converge toward that retrospective ideal.
> 
> I think it's a good idea. Being able to treat one case as a rarer one
> rather than treating both cases equally may have various advantages in
> the future, for example when developing or investigating a problem.
> 
> > Disadvantage: testing signedness=false will require more planning.
> 
> If testing signedness=false always requires pg_upgrade, there might be
> some cumbersomeness. Inventing a testing-purpose-only tool (e.g., a
> CLI program) that changes the signedness might make tests easier.
> 
> > What other merits should we consider as part of deciding?
> 
> Considering that the population of database cluster signedness will
> converge to signedness=true in the future, we can consider using
> -fsigned-char to prevent similar problems for the future. We need to
> think about possible side-effects as well, though.

It's good to think about -fsigned-char.  While I find it tempting, several
things would need to hold for us to benefit from it:

- Every supported compiler has to offer it or an equivalent.
- The non-compiler parts of every supported C implementation need to
  cooperate.  For example, CHAR_MIN must change in response to the flag.  See
  the first comment in cash_in().
- Libraries we depend on can't do anything incompatible with it.

Given that, I would lean toward not using -fsigned-char.  It's unlikely all
three things will hold.  Even if they do, the benefit is not large.

Re: Inval reliability, especially for inplace updates

2024-09-30 Thread Noah Misch

Rebased.
Author: Noah Misch 
Commit: Noah Misch 

For inplace update, send nontransactional invalidations.

The inplace update survives ROLLBACK.  The inval didn't, so another
backend's DDL could then update the row without incorporating the
inplace update.  In the test this fixes, a mix of CREATE INDEX and ALTER
TABLE resulted in a table with an index, yet relhasindex=f.  That is a
source of index corruption.

Core code no longer needs XLOG_INVALIDATIONS, but this leaves removing
it for future work.  Extensions could be relying on that mechanism, so
that removal would not be back-patch material.  Back-patch to v12 (all
supported versions).

Reviewed by FIXME and (in earlier versions) Andres Freund.

Discussion: https://postgr.es/m/20240523000548.58.nmi...@google.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index da5e656..7c82a95 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6327,6 +6327,9 @@ heap_inplace_update_and_unlock(Relation relation,
HeapTupleHeader htup = oldtup->t_data;
uint32  oldlen;
uint32  newlen;
+   int nmsgs = 0;
+   SharedInvalidationMessage *invalMessages = NULL;
+   boolRelcacheInitFileInval = false;
 
Assert(ItemPointerEquals(&oldtup->t_self, &tuple->t_self));
oldlen = oldtup->t_len - htup->t_hoff;
@@ -6334,6 +6337,29 @@ heap_inplace_update_and_unlock(Relation relation,
if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
elog(ERROR, "wrong tuple length");
 
+   /*
+* Construct shared cache inval if necessary.  Note that because we only
+* pass the new version of the tuple, this mustn't be used for any
+* operations that could change catcache lookup keys.  But we aren't
+* bothering with index updates either, so that's true a fortiori.
+*/
+   CacheInvalidateHeapTupleInplace(relation, tuple, NULL);
+
+   /* Like RecordTransactionCommit(), log only if needed */
+   if (XLogStandbyInfoActive())
+   nmsgs = inplaceGetInvalidationMessages(&invalMessages,
+   
   &RelcacheInitFileInval);
+
+   /*
+* Unlink relcache init files as needed.  If unlinking, acquire
+* RelCacheInitLock until after associated invalidations.  By doing this
+* in advance, if we checkpoint and then crash between inplace
+* XLogInsert() and inval, we don't rely on StartupXLOG() ->
+* RelationCacheInitFileRemove().  That uses elevel==LOG, so replay 
would
+* neglect to PANIC on EIO.
+*/
+   PreInplace_Inval();
+
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
 
@@ -6363,9 +6389,16 @@ heap_inplace_update_and_unlock(Relation relation,
XLogRecPtr  recptr;
 
xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
+   xlrec.dbId = MyDatabaseId;
+   xlrec.tsId = MyDatabaseTableSpace;
+   xlrec.relcacheInitFileInval = RelcacheInitFileInval;
+   xlrec.nmsgs = nmsgs;
 
XLogBeginInsert();
-   XLogRegisterData((char *) &xlrec, SizeOfHeapInplace);
+   XLogRegisterData((char *) &xlrec, MinSizeOfHeapInplace);
+   if (nmsgs != 0)
+   XLogRegisterData((char *) invalMessages,
+nmsgs * 
sizeof(SharedInvalidationMessage));
 
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
@@ -6377,17 +6410,28 @@ heap_inplace_update_and_unlock(Relation relation,
PageSetLSN(BufferGetPage(buffer), recptr);
}
 
-   END_CRIT_SECTION();
-
-   heap_inplace_unlock(relation, oldtup, buffer);
+   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
/*
-* Send out shared cache inval if necessary.  Note that because we only
-* pass the new version of the tuple, this mustn't be used for any
-* operations that could change catcache lookup keys.  But we aren't
-* bothering with index updates either, so that's true a fortiori.
+* Send invalidations to shared queue.  SearchSysCacheLocked1() assumes 
we
+* do this before UnlockTuple().
 *
-* XXX ROLLBACK discards the invalidation.  See test inplace-inval.spec.
+* If we're mutating a tuple visible only to this transaction, there's 
an
+* equivalent transactional inval from the action that created the 
tuple,
+* and this inval is

Re: AIO v2.0

2024-09-30 Thread Noah Misch

On Mon, Sep 30, 2024 at 10:49:17AM -0400, Andres Freund wrote:
> We also discussed the topic at 
> https://postgr.es/m/20240925020022.c5.nmisch%40google.com
> > ... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
> > decision.  From what I've heard so far of the performance effects, if it 
> > were
> > me, I would keep the bounce buffers.  I'd pursue BM_SETTING_HINTS and bounce
> > buffer removal as a distinct project after the main AIO capability.  Bounce
> > buffers have an implementation.  They aren't harming other design decisions.
> > The AIO project is big, so I'd want to err on the side of not designating
> > other projects as its prerequisites.
> 
> Given the issues that modifying pages while in flight causes, not just with PG
> level checksums, but also filesystem level checksum, I don't feel like it's a
> particularly promising approach.
> 
> However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
> to be completed before we can move forward with AIO. If I split out the write
> portion from the read portion a bit further, the main AIO changes and the
> shared-buffer read user can be merged before there's a dependency on the hint
> bit stuff being done.
> 
> Does that seem reasonable?

Yes.

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-30 Thread Noah Misch

On Wed, Sep 25, 2024 at 03:46:27PM -0700, Masahiko Sawada wrote:
> On Wed, Sep 18, 2024 at 6:46 PM Noah Misch  wrote:
> > On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
> > > On Mon, Sep 16, 2024 at 9:24 AM Noah Misch  wrote:
> > > > On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
> > > > > On Tue, Sep 10, 2024 at 3:05 PM Noah Misch  wrote:
> > > > > > On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > > > > > > Got it.  So now I'm wondering if we need all the complexity of 
> > > > > > > storing
> > > > > > > stuff in the GIN metapages.  Could we simply read the (primary's)
> > > > > > > signedness out of pg_control and use that?

> > > Thanks. I like the simplicity of this approach. If we agree with this
> > > approach, I'd like to proceed with it.
> >
> > Works for me.
> >
> > > Regardless of what approach we take, I wanted to provide some
> > > regression tests for these changes, but I could not come up with a
> > > reasonable idea. It would be great if we could do something like
> > > 027_stream_regress.pl on cross-architecture replication. But just
> > > executing 027_stream_regress.pl on cross-architecture replication
> > > could not be sufficient since we would like to confirm query results
> > > as well. If we could have a reasonable tool or way, it would also help
> > > find other similar bugs related architectural differences.
> >
> > Perhaps add a regress.c function that changes the control file flag and
> > flushes the change to disk?
> 
> I've tried this idea but found out that extensions can flush the
> controlfile on the disk after flipping the char signedness value, they
> cannot update the controlfile data on the shared memory. And, when the
> server shuts down, the on-disk controlfile is updated with the
> on-memory controlfile data, so the char signedness is reverted.
> 
> We can add a function to set the char signedness of on-memory
> controlfile data or add a new option to pg_resetwal to change the char
> signedness on-disk controlfile data but they seem to be overkill to me
> and I'm concerned about misusing these option and function.

Before the project is over, pg_upgrade will have some way to set the control
file signedness to the source cluster signedness.  I think pg_upgrade should
also take an option for overriding the source cluster signedness for source
clusters v17 and older.  That way, a user knowing they copied the v17 source
cluster from x86 to ARM can pg_upgrade properly without the prerequisite of
acquiring an x86 VM.  Once that all exists, perhaps it will be clearer how
best to change signedness for testing.

> While this change does not encourage the use of 'char' without
> explicitly specifying its signedness, it provides a hint to ensure
> consistent behavior for indexes (e.g., GIN and GiST) and tables that
> already store data sorted by the 'char' type on disk, especially in
> cross-platform replication scenarios.

Let's put that sort of information in the GetDefaultCharSignedness() header
comment.  New code should use explicit "signed char" or "unsigned char" when
it matters for data file compatibility.  GetDefaultCharSignedness() exists for
code that deals with pre-v18 data files, where we accidentally let C
implementation signedness affect persistent data.

> @@ -4256,6 +4257,18 @@ WriteControlFile(void)
>  
>   ControlFile->float8ByVal = FLOAT8PASSBYVAL;
>  
> + /*
> +  * The signedness of the char type is implementation-defined. For 
> example
> +  * on x86 architecture CPUs, the char data type is typically treated as
> +  * signed by default whereas on aarch architecture CPUs it is typically
> +  * treated as unsigned by default.
> +  */
> +#if CHAR_MIN != 0
> + ControlFile->default_char_signedness = true;
> +#else
> + ControlFile->default_char_signedness = false;
> +#endif

This has initdb set the field to the current C implementation's signedness.  I
wonder if, instead, initdb should set signedness=true unconditionally.  Then
the only source of signedness=false will be pg_upgrade.

Advantage: signedness=false will get rarer over time.  If we had known about
this problem during the last development cycle that forced initdb (v8.3), we
would have made all clusters signed or all clusters unsigned.  Making
pg_upgrade the only source of signedness=false will cause the population of
database clusters to converge toward that retrospective ideal.

Disadvantage: testing signedness=false will require more planning.

What other merits should we consider as part of deciding?

Re: Primary and standby setting cross-checks

2024-09-24 Thread Noah Misch

On Thu, Aug 29, 2024 at 09:52:06PM +0300, Heikki Linnakangas wrote:
> Currently, if you configure a hot standby server with a smaller
> max_connections setting than the primary, the server refuses to start up:
> 
> LOG:  entering standby mode
> FATAL:  recovery aborted because of insufficient parameter settings
> DETAIL:  max_connections = 10 is a lower setting than on the primary server,
> where its value was 100.

> happen anyway:
> 
> 2024-08-29 21:44:32.634 EEST [668327] FATAL:  out of shared memory
> 2024-08-29 21:44:32.634 EEST [668327] HINT:  You might need to increase
> "max_locks_per_transaction".
> 2024-08-29 21:44:32.634 EEST [668327] CONTEXT:  WAL redo at 2/FD40FCC8 for
> Standby/LOCK: xid 996 db 5 rel 154045
> 2024-08-29 21:44:32.634 EEST [668327] WARNING:  you don't own a lock of type
> AccessExclusiveLock
> 2024-08-29 21:44:32.634 EEST [668327] LOG:  RecoveryLockHash contains entry
> for lock no longer recorded by lock manager: xid 996 database 5 relation
> 154045
> TRAP: failed Assert("false"), File: "../src/backend/storage/ipc/standby.c",

> Granted, if you restart the server, it will probably succeed because
> restarting the server will kill all the other queries that were holding
> locks. But yuck.

Agreed.

> So how to improve this? I see a few options:
> 
> a) Downgrade the error at startup to a warning, and allow starting the
> standby with smaller settings in standby. At least with a smaller
> max_locks_per_transactions. The other settings also affect the size of
> known-assigned XIDs array, but if the CSN snapshots get committed, that will
> get fixed. In most cases there is enough lock memory anyway, and it will be
> fine. Just fix the assertion failure so that the error message is a little
> nicer.
> 
> b) If you run out of lock space, kill running queries, and prevent new ones
> from starting. Track the locks in startup process' private memory until
> there is enough space in the lock manager, and then re-open for queries. In
> essence, go from hot standby mode to warm standby, until it's possible to go
> back to hot standby mode again.

Either seems fine.  Having never encountered actual lock exhaustion from this,
I'd lean toward (a) for simplicity.

> Thoughts, better ideas?

I worry about future code assuming a MaxBackends-sized array suffices for
something.  That could work almost all the time, breaking only when a standby
replays WAL from a server having a larger array.  What could we do now to
catch that future mistake promptly?  As a start, 027_stream_regress.pl could
use low settings on its standby.

Re: AIO writes vs hint bits vs checksums

2024-09-24 Thread Noah Misch

On Tue, Sep 24, 2024 at 04:30:25PM -0400, Andres Freund wrote:
> On 2024-09-24 12:43:40 -0700, Noah Misch wrote:
> > On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> > > Besides that, the need to copy the buffers makes checkpoints with AIO
> > > noticeably slower when checksums are enabled - it's not the checksum but 
> > > the
> > > copy that's the biggest source of the slowdown.

How big is that copy's contribution to the slowdown there?  A measurable CPU
overhead on writes likely does outweigh the unmeasurable overhead on index
scans, but ...

> > > Does this sound like a reasonable idea?  Counterpoints?

> > How should we think about comparing the distributed cost of the buffer
> > header manipulations during index scans vs. the costs of bounce buffers?
> 
> Well, the cost of bounce buffers would be born as long as postgres is up,
> whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
> really show up.

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision.  From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers.  I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability.  Bounce
buffers have an implementation.  They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

> Zooming out (a lot) more: I like the idea of having a way to get the
> permission to perform some kinds of modifications on a page without an
> exlusive lock. While obviously a lot more work, I do think there's some
> potential to have some fast-paths that perform work on a page level without
> blocking out readers. E.g. some simple cases of insert could correctly be done
> without blocking out readers (by ordering the update of the max page offset

True.

Re: race condition in pg_class

2024-09-24 Thread Noah Misch

On Thu, Sep 19, 2024 at 02:33:46PM -0700, Noah Misch wrote:
> On Mon, Sep 09, 2024 at 10:55:32AM +0530, Nitin Motiani wrote:
> > On Sat, Sep 7, 2024 at 12:25 AM Noah Misch  wrote:
> > > https://commitfest.postgresql.org/49/5090/ remains in status="Needs 
> > > review".
> > > When someone moves it to status="Ready for Committer", I will commit
> > > inplace090, inplace110, and inplace120 patches.  If one of you is 
> > > comfortable
> > > with that, please modify the status.
> > 
> > Done.
> 
> FYI, here are the branch-specific patches.  I plan to push these after the v17
> release freeze lifts next week.

Pushed, but the pushes contained at least one defect:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=akepa&dt=2024-09-24%2022%3A29%3A02

I will act on that and other buildfarm failures that show up.

Re: Recovering from detoast-related catcache invalidations

2024-09-24 Thread Noah Misch

On Sun, Jan 14, 2024 at 12:14:11PM -0800, Noah Misch wrote:
> On Fri, Jan 12, 2024 at 03:47:13PM -0500, Tom Lane wrote:
> > Oh!  After nosing around a bit more I remembered systable_recheck_tuple,
> > which is meant for exactly this purpose.  So v4 attached.
> 
> systable_recheck_tuple() is blind to heap_inplace_update(), so it's not a
> general proxy for invalidation messages.  The commit for $SUBJECT (ad98fb1)
> doesn't create any new malfunctions, but I expect the systable_recheck_tuple()
> part will change again before the heap_inplace_update() story is over
> (https://postgr.es/m/flat/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com).

Commit f9f47f0 (2024-06-27) addressed inplace updates here.

Re: AIO writes vs hint bits vs checksums

2024-09-24 Thread Noah Misch

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:
> So far the AIO patchset has solved this by introducing a set of "bounce
> buffers", which can be acquired and used as the source/target of IO when doing
> it in-place into shared buffers isn't viable.
> 
> I am worried about that solution however, as either acquisition of bounce
> buffers becomes a performance issue (that's how I did it at first, it was hard
> to avoid regressions) or we reserve bounce buffers for each backend, in which
> case the memory overhead for instances with relatively small amount of
> shared_buffers and/or many connections can be significant.

> But: We can address this and improve performance over the status quo! Today we
> determine tuple visiblity determination one-by-one, even when checking the
> visibility of an entire page worth of tuples. That's not exactly free. I've
> prototyped checking visibility of an entire page of tuples at once and it
> indeed speeds up visibility checks substantially (in some cases seqscans are
> over 20% faster!).

Nice!  It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward.  Is that about right?

> Once we have page-level visibility checks we can get the right to set hint
> bits once for an entire page instead of doing it for every tuple - with that
> in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
> wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

> Having a page level approach to setting hint bits has other advantages:
> 
> E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
> on the page and we don't mark a page dirty on hot standby. Which often will
> result in hint bits notpersistently set on replicas until the page is frozen.

Nice way to improve that.

> Does this sound like a reasonable idea?  Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use.  BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there.  How
should we think about comparing the distributed cost of the buffer header
manipulations during index scans vs. the costs of bounce buffers?

Thanks,
nm

Re: pgsql: Don't enter parallel mode when holding interrupts.

2024-09-20 Thread Noah Misch

On Thu, Sep 19, 2024 at 09:25:05AM -0400, Robert Haas wrote:
> On Wed, Sep 18, 2024 at 3:27 AM Laurenz Albe  wrote:
> > On Wed, 2024-09-18 at 02:58 +0000, Noah Misch wrote:
> > > Don't enter parallel mode when holding interrupts.
> > >
> > > Doing so caused the leader to hang in wait_event=ParallelFinish, which
> > > required an immediate shutdown to resolve.  Back-patch to v12 (all
> > > supported versions).
> > >
> > > Francesco Degrassi
> > >
> > > Discussion: 
> > > https://postgr.es/m/CAC-SaSzHUKT=vzj8mpxydc_urpfax+yoa1hktcf4roz_q6z...@mail.gmail.com
> >
> > Does that warrant mention on this page?
> > https://www.postgresql.org/docs/current/when-can-parallel-query-be-used.html
> 
> IMHO, no. This seems too low-level and too odd to mention.

Agreed.  If I were documenting it, I would document it with the material for
writing opclasses.  It's probably too esoteric to document even there.

> TBH, I'm kind of surprised to learn that it's possible to start
> executing a query while holding an LWLock. I see Tom is expressing
> some doubts on the original thread, too. I wonder if we should instead
> be erroring out if an LWLock is held at the start of query execution
> -- or even earlier, like when we try to call a plpgsql function while
> holding one. Leaving parallel query aside, what would prevent us from
> attempting to reacquire the exact same LWLock that we already hold and
> self-deadlocking? Or attempting to acquire some other LWLock and
> deadlocking that way? I don't really feel like this is a parallel
> query problem. I don't think we should be trying to run any
> user-defined code while holding an LWLock, unless that code is written
> in C (or C++, Rust, etc.). Trying to run procedural code at that point
> doesn't seem reasonable.

Nothing prevents those lwlock deadlocks.  If you think it's worth breaking the
things folks use today (see original thread) in order to prevent that, please
do share that on the original thread.  I'm fine either way.  I think given
infinite resources across both postgresql.org and all extension maintainers, I
would do what you're thinking in v18 while in back branches, I would change
"erroring out" to "warn when assertions are enabled".  I also think it's a
low-priority bug, given the only known ways to reach it are C code or a custom
opclass.  Since resources aren't infinite, I'm inclined toward one of (a) stop
here or (b) all branches "warn when assertions are enabled" and maybe block
the plancache route discussed on the original thread.

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-18 Thread Noah Misch

On Tue, Sep 17, 2024 at 09:43:41PM -0700, Masahiko Sawada wrote:
> On Mon, Sep 16, 2024 at 9:24 AM Noah Misch  wrote:
> > On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
> > > On Tue, Sep 10, 2024 at 3:05 PM Noah Misch  wrote:
> > > > On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > > > > Got it.  So now I'm wondering if we need all the complexity of storing
> > > > > stuff in the GIN metapages.  Could we simply read the (primary's)
> > > > > signedness out of pg_control and use that?
> >
> > > I've attached a PoC patch for this idea. We write  the default char
> > > signedness to the control file at initdb time. Then when comparing two
> > > trgms, pg_trgm opclasses use a comparison function based on the char
> > > signedness of the cluster. I've confirmed that the patch fixes the
> > > reported case at least.
> >
> > I agree that proves the concept.
> 
> Thanks. I like the simplicity of this approach. If we agree with this
> approach, I'd like to proceed with it.

Works for me.

> Regardless of what approach we take, I wanted to provide some
> regression tests for these changes, but I could not come up with a
> reasonable idea. It would be great if we could do something like
> 027_stream_regress.pl on cross-architecture replication. But just
> executing 027_stream_regress.pl on cross-architecture replication
> could not be sufficient since we would like to confirm query results
> as well. If we could have a reasonable tool or way, it would also help
> find other similar bugs related architectural differences.

Perhaps add a regress.c function that changes the control file flag and
flushes the change to disk?

Re: AIO v2.0

2024-09-17 Thread Noah Misch

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:
> On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> > On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

> > Reattaching descriptors and memory in each child may work, or one could just
> > block io_method=io_uring under EXEC_BACKEND.
> 
> I think the latter option is saner

Works for me.

> > > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > > +{
> 
> > > + if (ret == -EINTR)
> > > + {
> > > + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
> > > + continue;
> > > + }
> >
> > Since io_uring_submit() is a wrapper around io_uring_enter(), this should 
> > also
> > retry on EAGAIN.  "man io_uring_enter" has:
> >
> > EAGAIN The kernel was unable to allocate memory for the request, or
> > otherwise ran out of resources to handle it. The application should wait
> > for some completions and try again.
> 
> Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
> in flight for each uring instance. That's different to a use of uring to
> e.g. wait for incoming network data on thousands of sockets, where you could
> have essentially unbounded amount of requests outstanding.
> 
> What would we wait for? What if we were holding a critical lock in that
> moment? Would it be safe to just block for some completions? What if there's
> actually no IO in progress?

I'd try the following.  First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED.  This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns.  I don't specifically know of a way it's unsafe.  Do just one
pass of that; there may be newer IOs in progress afterward.  If submit still
gets EAGAIN, sleep a bit and retry.  Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations.  This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.

> > > +FileStartWriteV(struct PgAioHandle *ioh, File file,
> > > + int iovcnt, off_t offset,
> > > + uint32 wait_event_info)
> > > +{
> > > ...
> >
> > FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
> > name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).
> 
> Hm - that doesn't necessarily seem right to me. I don't think the caller
> should assume that the IO will just be prepared and not already completed by
> the time FileStartWriteV() returns - we might actually do the IO
> synchronously.

Yes.  Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO.  Still, I think this shouldn't use the term "Start" while no state
name uses that term.  What else could remove that mismatch?

> > Is there any specific decision you'd like to settle before patch 6 exits
> > WIP?
> 
> Patch 6 specifically? That I really mainly kept separate for review - it

No.  I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"

> doesn't seem particularly interesting to commit it earlier than 7, or do you
> think differently?

No, I agree a lone commit of 6 isn't a win.  Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit.  I've not thought through that
grouping much.

> In case you mean 6+7 or 6 to ~11, I can think of the following:
> 
> - I am worried about the need for bounce buffers for writes of checksummed
>   buffers. That quickly ends up being a significant chunk of memory,
>   particularly when using a small shared_buffers with a higher than default
>   number of connection. I'm currently hacking up a prototype that'd prevent us
>   from setting hint bits with just a share lock. I'm planning to start a
>   separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker.  I recommend
dealing with the worry by reducing the limit initially (128 blocks?).  Can
always raise it later.

> - The header split doesn't yet quite seem right yet

I won't have a strong opinion on that one.  The aio.c/aio_io.c split did catch
my attention.  I made a note to check it again once those files get header
comments.

> - I'd like to implement retries in the later patches, to m

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-16 Thread Noah Misch

On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
> On Tue, Sep 10, 2024 at 3:05 PM Noah Misch  wrote:
> > On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > > Got it.  So now I'm wondering if we need all the complexity of storing
> > > stuff in the GIN metapages.  Could we simply read the (primary's)
> > > signedness out of pg_control and use that?

> I've attached a PoC patch for this idea. We write  the default char
> signedness to the control file at initdb time. Then when comparing two
> trgms, pg_trgm opclasses use a comparison function based on the char
> signedness of the cluster. I've confirmed that the patch fixes the
> reported case at least.

I agree that proves the concept.

Re: AIO v2.0

2024-09-16 Thread Noah Misch

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

> +typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else.  I recommend referring to it in README.md.  Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?

> +{
> + /* not in use */
> + AHS_IDLE = 0,
> +
> + /* returned by pgaio_io_get() */
> + AHS_HANDED_OUT,
> +
> + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet 
> */
> + AHS_DEFINED,
> +
> + /* subjects prepare() callback has been called */
> + AHS_PREPARED,
> +
> + /* IO is being executed */
> + AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach.  For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

> +
> + /* IO finished, but result has not yet been processed */
> + AHS_REAPED,
> +
> + /* IO completed, shared completion has been called */
> + AHS_COMPLETED_SHARED,
> +
> + /* IO completed, local completion has been called */
> + AHS_COMPLETED_LOCAL,
> +} PgAioHandleState;

> +void
> +pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
> +{
> + PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, 
> ioh_node);
> +
> + Assert(ioh->resowner);
> +
> + ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
> + ioh->resowner = NULL;
> +
> + switch (ioh->state)
> + {
> + case AHS_IDLE:
> + elog(ERROR, "unexpected");
> + break;
> + case AHS_HANDED_OUT:
> + Assert(ioh == my_aio->handed_out_io || 
> my_aio->handed_out_io == NULL);
> +
> + if (ioh == my_aio->handed_out_io)
> + {
> + my_aio->handed_out_io = NULL;
> + if (!on_error)
> + elog(WARNING, "leaked AIO handle");
> + }
> +
> + pgaio_io_reclaim(ioh);
> + break;
> + case AHS_DEFINED:
> + case AHS_PREPARED:
> + /* XXX: Should we warn about this when is_commit? */

Yes.

> + pgaio_submit_staged();
> + break;
> + case AHS_IN_FLIGHT:
> + case AHS_REAPED:
> + case AHS_COMPLETED_SHARED:
> + /* this is expected to happen */
> + break;
> + case AHS_COMPLETED_LOCAL:
> + /* XXX: unclear if this ought to be possible? */
> + pgaio_io_reclaim(ioh);
> + break;
> + }

> +void
> +pgaio_io_ref_wait(PgAioHandleRef *ior)
> +{
> + uint64  ref_generation;
> + PgAioHandleState state;
> + boolam_owner;
> + PgAioHandle *ioh;
> +
> + ioh = pgaio_io_from_ref(ior, &ref_generation);
> +
> + am_owner = ioh->owner_procno == MyProcNumber;
> +
> +
> + if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> + return;
> +
> + if (am_owner)
> + {
> + if (state == AHS_DEFINED || state == AHS_PREPARED)
> + {
> + /* XXX: Arguably this should be prevented by callers? */
> + pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both.  AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.  Even so,
the next point might remove the need here:

> +void
> +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> +{
> + Assert(ioh->state == AHS_HANDED_OUT);
> + Assert(pgaio_io_has_subject(ioh));
> +
> + ioh->op = op;
> + ioh->state = AHS_DEFINED;
> + ioh->result = 0;
> +
> + /* allow a new IO to be staged */
> + my_aio->handed_out_io = NULL;
> +
> + pgaio_io_prepare_subject(ioh);
> +
> + ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here.  This code already needs to be safe for that (per
README.md).  When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common().  I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

> +if test "$with_liburing" = yes; then
> +  PKG_CHECK_MODULES(LIBURING, liburing)
> +fi

I used the attached makefile patch to build w/ liburing.

> +pgaio_uring_shmem_init(bool first_time)
> +{
> + uint32  TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - 
> MAX_IO_WORKERS;
> + bool

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-13 Thread Noah Misch

On Wed, Sep 11, 2024 at 09:00:33AM -0400, Robert Haas wrote:
> On Tue, Sep 10, 2024 at 4:58 PM Noah Misch  wrote:
> > ... a rule of "each wait event appears in one
> > pgstat_report_wait_start()" would be a rule I don't want.
> 
> As the original committer of the wait event stuff, I intended for the
> rule that you do not want to be the actual rule. However, I see that I
> didn't spell that out anywhere in the commit message, or the commit
> itself.
> 
> > I see this level of fine-grained naming
> > as making the event name a sort of stable proxy for FILE:LINE.  I'd value
> > exposing such a proxy, all else being equal, but I don't think wait event
> > names like AuthLdapBindLdapbinddn/AuthLdapBindUser are the right way.  Wait
> > event names should be more independent of today's code-level details.
> 
> I don't agree with that. One of the most difficult parts of supporting
> PostgreSQL, in my experience, is that it's often very difficult to
> find out what has gone wrong when a system starts behaving badly. It
> is often necessary to ask customers to install a debugger and do stuff
> with it, or give them an instrumented build, in order to determine the
> root cause of a problem that in some cases is not even particularly
> complicated. While needing to refer to specific source code details
> may not be a common experience for the typical end user, it is
> extremely common for me. This problem commonly arises with error
> messages

That is a problem.  Half the time, error verbosity doesn't disambiguate enough
for me, and I need backtrace_functions.  I now find it hard to believe how
long we coped without backtrace_functions.

I withdraw the objection to "each wait event appears in one
pgstat_report_wait_start()".

Re: Use read streams in pg_visibility

2024-09-11 Thread Noah Misch

On Wed, Sep 11, 2024 at 09:19:09AM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 11 Sept 2024 at 01:38, Noah Misch  wrote:
> > I also fixed the mix of tabs and spaces inside test file string literals.
> 
> I ran both pgindent and pgperltidy but they didn't catch them. Is
> there an automated way to catch them?

git diff --check

Re: Windows socket problems, interesting connection to AIO

2024-09-10 Thread Noah Misch

On Mon, Sep 02, 2024 at 09:20:21PM +1200, Thomas Munro wrote:
> 2.  If a Windows client tries to send() and gets an ECONNRESET/EPIPE
> error, then the network stack seems to drop already received data, so
> a following recv() will never see it.  In other words, it depends on
> whether the application-level protocol is strictly request/response
> based, or has sequence points at which both ends might send().  AFAIK
> the main consequence for real users is that FATAL recovery conflict,
> idle termination, etc messages are not delivered to clients, leaving
> just "server closed the connection unexpectedly".

> The new thought I had about the second category of problem is: if you
> use asynchronous networking APIs, then the kernel *can't* throw your
> data out, because it doesn't even have it.  If the server's FATAL
> message arrives before the client calls send(), then the data is
> already written to user space memory and the I/O is marked as
> complete.

Good point.

> just wanted to share this observation.

Thanks for sharing that and the test program.

Re: Use read streams in pg_visibility

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:
> Your patch is correct. I wrongly assumed it would catch blockno bug,
> the attached version catches it. I made blockno = 0 invisible and not
> frozen before copying the vm file. So, in the blockno buggy version;
> callback will skip that block but the main loop in the
> collect_corrupt_items() will not skip it. I tested it with your patch
> and there is exactly 1 blockno difference between expected and result
> output.

Pushed.  I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap.  I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

I also fixed the mix of tabs and spaces inside test file string literals.

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> Masahiko Sawada  writes:
> > On Tue, Sep 10, 2024 at 11:57 AM Tom Lane  wrote:
> >> Yeah, that seems like it could work.  But are we sure that replicas
> >> get a copy of the primary's control file rather than creating their
> >> own?
> 
> > Yes, I think so. Since at least the system identifiers of primary and
> > replicas must be identical for physical replication, if replicas use
> > their own control files then they cannot start the replication.
> 
> Got it.  So now I'm wondering if we need all the complexity of storing
> stuff in the GIN metapages.  Could we simply read the (primary's)
> signedness out of pg_control and use that?

Yes.

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 02:51:23PM -0400, Robert Haas wrote:
> On Tue, Sep 10, 2024 at 1:27 PM Noah Misch  wrote:
> > On Tue, Sep 10, 2024 at 02:29:57PM +0900, Michael Paquier wrote:
> > > You are adding twelve event points with only 5
> > > new wait names.  Couldn't it be better to have a one-one mapping
> > > instead, adding twelve entries in wait_event_names.txt?
> >
> > No, I think the patch's level of detail is better.  One shouldn't expect the
> > two ldap_simple_bind_s() calls to have different-enough performance
> > characteristics to justify exposing that level of detail to the DBA.
> > ldap_search_s() and InitializeLDAPConnection() differ more, but the DBA 
> > mostly
> > just needs to know the scale of their LDAP responsiveness problem.
> >
> > (Someday, it might be good to expose the file:line and/or backtrace 
> > associated
> > with a wait, like we do for ereport().  As a way to satisfy rare needs for
> > more detail, I'd prefer that over giving every pgstat_report_wait_start() a
> > different name.)
> 
> I think unique names are a good idea. If a user doesn't care about the
> difference between sdgjsA and sdjgsB, they can easily ignore the
> trailing suffix, and IME, people typically do that without really
> stopping to think about it. If on the other hand the two are lumped
> together as sdjgs and a user needs to distinguish them, they can't. So
> I see unique names as having much more upside than downside.

I agree a person can ignore the distinction, but that requires the person to
be consuming the raw event list.  It's reasonable to tell your monitoring tool
to give you the top N wait events.  Individual AuthnLdap* events may all miss
the cut even though their aggregate would have made the cut.  Before you know
to teach that monitoring tool to group AuthnLdap* together, it won't show you
any of those names.

I felt commit c789f0f also chose sub-optimally in this respect, particularly
with the DblinkGetConnect/DblinkConnect pair.  I didn't feel strongly enough
to complain at the time, but a rule of "each wait event appears in one
pgstat_report_wait_start()" would be a rule I don't want.  One needs
familiarity with the dblink implementation internals to grasp the
DblinkGetConnect/DblinkConnect distinction, and a plausible refactor of dblink
would make those names cease to fit.  I see this level of fine-grained naming
as making the event name a sort of stable proxy for FILE:LINE.  I'd value
exposing such a proxy, all else being equal, but I don't think wait event
names like AuthLdapBindLdapbinddn/AuthLdapBindUser are the right way.  Wait
event names should be more independent of today's code-level details.

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 02:29:57PM +0900, Michael Paquier wrote:
> You are adding twelve event points with only 5
> new wait names.  Couldn't it be better to have a one-one mapping
> instead, adding twelve entries in wait_event_names.txt?

No, I think the patch's level of detail is better.  One shouldn't expect the
two ldap_simple_bind_s() calls to have different-enough performance
characteristics to justify exposing that level of detail to the DBA.
ldap_search_s() and InitializeLDAPConnection() differ more, but the DBA mostly
just needs to know the scale of their LDAP responsiveness problem.

(Someday, it might be good to expose the file:line and/or backtrace associated
with a wait, like we do for ereport().  As a way to satisfy rare needs for
more detail, I'd prefer that over giving every pgstat_report_wait_start() a
different name.)

Re: Use read streams in pg_visibility

2024-09-09 Thread Noah Misch

On Mon, Sep 09, 2024 at 06:25:07PM +0300, Nazir Bilal Yavuz wrote:
> On Thu, 5 Sept 2024 at 18:54, Noah Misch  wrote:
> > On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:
> > > On Wed, 4 Sept 2024 at 21:43, Noah Misch  wrote:
> > > > https://postgr.es/m/caeudqaozv3wty5tv2t29jcwpydbmkbiwqkzd42s2ogzdixp...@mail.gmail.com
> > > > then observed that collect_corrupt_items() was now guaranteed to never 
> > > > detect
> > > > corruption.  I have pushed revert ddfc556 of the pg_visibility.c 
> > > > changes.  For
> > > > the next try, could you add that testing we discussed?
> > >
> > > Do you think that corrupting the visibility map and then seeing if
> > > pg_check_visible() and pg_check_frozen() report something is enough?
> >
> > I think so.  Please check that it would have caught both the blkno bug and 
> > the
> > above bug.
> 
> The test and updated patch files are attached. In that test I
> overwrite the visibility map file with its older state. Something like
> this:
> 
> 1- Create the table, then run VACUUM FREEZE on the table.
> 2- Shutdown the server, create a copy of the vm file, restart the server.
> 3- Run the DELETE command on some $random_tuples.
> 4- Shutdown the server, overwrite the vm file with the #2 vm file,
> restart the server.
> 5- pg_check_visible and pg_check_frozen must report $random_tuples as 
> corrupted.

Copying the vm file is a good technique.  In my test runs, this does catch the
"never detect" bug, but it doesn't catch the blkno bug.  Can you make it catch
both?  It's possible my hand-patching to recreate the blkno bug is what went
wrong, so I'm attaching what I used.  It consists of
v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
fixes for the "never detect" bug from your v6-0002:

--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -724,4 +724,4 @@ collect_corrupt_items(Oid relid, bool all_visible, bool 
all_frozen)
{
-   boolcheck_frozen = false;
-   boolcheck_visible = false;
+   boolcheck_frozen = all_frozen;
+   boolcheck_visible = all_visible;
Buffer  buffer;
@@ -757,5 +757,5 @@ collect_corrupt_items(Oid relid, bool all_visible, bool 
all_frozen)
 */
-   if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+   if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
check_frozen = false;
-   if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+   if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
check_visible = false;
diff --git a/contrib/pg_visibility/pg_visibility.c 
b/contrib/pg_visibility/pg_visibility.c
index 773ba92..ac575a1 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -41,6 +41,20 @@ typedef struct corrupt_items
ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+   boolall_frozen;
+   boolall_visible;
+   BlockNumber blocknum;
+   BlockNumber nblocks;
+   Relationrel;
+   Buffer *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -611,6 +625,35 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 }
 
 /*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+   
 void *callback_private_data,
+   
 void *per_buffer_data)
+{
+   struct collect_corrupt_items_read_stream_private *p = 
callback_private_data;
+
+   for (; p->blocknum < p->nblocks; p->blocknum++)
+   {
+   boolcheck_frozen = false;
+   boolcheck_visible = false;
+
+   if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, 
p->vmbuffer))
+   check_frozen = true;
+   if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, 
p->vmbuffer))
+   check_visible = true;
+   if (!check_visible && !check_frozen)
+   continue;
+
+   re

Re: Yet another way for pg_ctl stop to fail on Windows

2024-09-08 Thread Noah Misch

On Sun, Sep 08, 2024 at 06:00:00PM +0300, Alexander Lakhin wrote:
> 07.09.2024 21:11, Noah Misch wrote:

> > > Noah, what do you think of handling this error in line with handling of
> > > ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?
> > > 
> > > I tried the following change:
> > >      switch (GetLastError())
> > >      {
> > >      case ERROR_BROKEN_PIPE:
> > >      case ERROR_BAD_PIPE:
> > > +   case ERROR_PIPE_BUSY:
> > > and saw no issues.
> > That would be a strict improvement over returning EINVAL like we do today.  
> > We
> > do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY 
> > are
> > process exit and ENOMEM-like situations.  While that change is the best 
> > thing
> > if the process is exiting, it could silently drop the signal in ENOMEM-like
> > situations.  Consider the following alternative.  If sig==0, just return 0
> > like you propose, because the process isn't completely gone.  Otherwise, 
> > sleep
> > and retry the signal, like pgwin32_open_handle() retries after certain 
> > errors.
> > What do you think of that?

> I agree with your approach. It looks like Microsoft recommends to loop on
> ERROR_PIPE_BUSY: [1] (they say "Calling CallNamedPipe is equivalent to
> calling the CreateFile ..." at [2]).

I see Microsoft suggests WaitNamedPipeA() as opposed to just polling.
WaitNamedPipeA() should be more responsive.  Given how rare this has been, it
likely doesn't matter whether we use WaitNamedPipeA() or polling.  I'd lean
toward whichever makes the code simpler, probably polling.

> So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
> it's the way to go, IMHO. I'm not sure about an infinite loop they show,
> I'd vote for a loop with the same characteristics as in
> pgwin32_open_handle().

I agree with bounding the total time of each kill(), like
pgwin32_open_handle() does for open().

> [1] https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client
> [2] 
> https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea

Re: Yet another way for pg_ctl stop to fail on Windows

2024-09-07 Thread Noah Misch

On Sat, Sep 07, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> With extra logging added, I got:
> ### Stopping node "CIC_2PC_test" using mode fast
> # Running: pg_ctl -D 
> C:\src\postgresql\build/testrun/amcheck_3/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> -m fast stop
> waiting for server to shut down..!!!pgkill| GetLastError(): 231
> postmaster (9596) died untimely? res: -1, errno: 22
>  failed
> 
> Thus, CallNamedPipe() in pgkill() returned ERROR_PIPE_BUSY (All pipe
> instances are busy) and it was handled as an unexpected error.
> (The error code 231 returned 10 times out of 10 failures of this ilk for
> me.)

Thanks for discovering that.

> Noah, what do you think of handling this error in line with handling of
> ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?
> 
> I tried the following change:
>     switch (GetLastError())
>     {
>     case ERROR_BROKEN_PIPE:
>     case ERROR_BAD_PIPE:
> +   case ERROR_PIPE_BUSY:
> and saw no issues.

That would be a strict improvement over returning EINVAL like we do today.  We
do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
process exit and ENOMEM-like situations.  While that change is the best thing
if the process is exiting, it could silently drop the signal in ENOMEM-like
situations.  Consider the following alternative.  If sig==0, just return 0
like you propose, because the process isn't completely gone.  Otherwise, sleep
and retry the signal, like pgwin32_open_handle() retries after certain errors.
What do you think of that?

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-06 Thread Noah Misch

On Fri, Sep 06, 2024 at 07:37:09PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Sep 06, 2024 at 06:36:53PM -0400, Tom Lane wrote:
> >> I feel like all of these are leaning heavily on users to get it right,
> 
> > What do you have in mind?  I see things for the pg_upgrade implementation to
> > get right, but I don't see things for pg_upgrade users to get right.
> 
> Well, yeah, if you are willing to put pg_upgrade in charge of
> executing ALTER EXTENSION UPDATE, then that would be a reasonably
> omission-proof path.  But we have always intended the pg_upgrade
> process to *not* auto-update extensions, so I'm concerned about
> the potential side-effects of drilling a hole in that policy.
> As an example: even if we ensure that pg_trgm 1.6 to 1.7 is totally
> transparent except for this fix, what happens if the user's old
> database is still on some pre-1.6 version?  Is it okay to force an
> update that includes feature upgrades?

Those are disadvantages of some of the options.  I think it could be okay to
inject the mandatory upgrade here or just tell the user to update to 1.7
before attempting the upgrade (if not at 1.7, pg_upgrade would fail with an
error message to that effect).  Your counterproposal avoids the issue, and I'm
good with that design.

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-06 Thread Noah Misch

On Fri, Sep 06, 2024 at 06:36:53PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > Yes, that's one way to make it work.  If we do it that way, it would be
> > critical to make the ALTER EXTENSION UPDATE run before anything uses the
> > index.  Otherwise, we'll run new v18 "signed char" code on a v17 "unsigned
> > char" data file.  Running ALTER EXTENSION UPDATE early enough should be
> > feasible, so that's fine.  Some other options:
> 
> > - If v18 "pg_dump -b" decides to emit CREATE INDEX ... 
> > gin_trgm_ops_unsigned,
> >   then make it also emit the statements to create the opclass.
> 
> > - Ship pg_trgm--1.6--1.7.sql in back branches.  If the upgrade wants to use
> >   gin_trgm_ops_unsigned, require the user to ALTER EXTENSION UPDATE first.
> >   (In back branches, the C code behind gin_trgm_ops_unsigned could just 
> > raise
> >   an error if called.)
> 
> I feel like all of these are leaning heavily on users to get it right,

What do you have in mind?  I see things for the pg_upgrade implementation to
get right, but I don't see things for pg_upgrade users to get right.

The last option is disruptive for users of unsigned char platforms, since it
requires a two-step upgrade if upgrading from v11 or earlier.  The other two
don't have that problem.

> and therefore have a significant chance of breaking use-cases that
> were perfectly fine before.
> 
> Remind me of why we can't do something like this:
> 
> * Add APIs that allow opclasses to read/write some space in the GIN
> metapage.  (We could get ambitious and add such APIs for other AMs
> too, but doing it just for GIN is probably a prudent start.)  Ensure
> that this space is initially zeroed.
> 
> * In gin_trgm_ops, read a byte of this space and interpret it as
>   0 = unset
>   1 = signed chars
>   2 = unsigned chars
> If the value is zero, set the byte on the basis of the native
> char-signedness of the current platform.  (Obviously, a
> secondary server couldn't write the byte, and would have to
> wait for the primary to update the value.  In the meantime,
> it's no more broken than today.)
> 
> * Subsequently, use either signed or unsigned comparisons
> based on that value.
> 
> This would automatically do the right thing in all cases that
> work today, and it'd provide the ability for cross-platform
> replication to work in future.  You can envision cases where
> transferring a pre-existing index to a platform of the other
> stripe would misbehave, but they're the same cases that fail
> today, and the fix remains the same: reindex.

No reason you couldn't do it that way.  Works for me.

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-06 Thread Noah Misch

On Fri, Sep 06, 2024 at 02:07:10PM -0700, Masahiko Sawada wrote:
> On Fri, Aug 30, 2024 at 8:10 PM Noah Misch  wrote:
> > On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:
> > > On Sun, May 19, 2024 at 6:46 AM Noah Misch  wrote:
> > > > If I were standardizing pg_trgm on one or the other notion of "char", I 
> > > > would
> > > > choose signed char, since I think it's still the majority.  More 
> > > > broadly, I
> > > > see these options to fix pg_trgm:
> > > >
> > > > 1. Change to signed char.  Every arm64 system needs to scan pg_trgm 
> > > > indexes.
> > > > 2. Change to unsigned char.  Every x86 system needs to scan pg_trgm 
> > > > indexes.
> > >
> > > Even though it's true that signed char systems are the majority, it
> > > would not be acceptable to force the need to scan pg_trgm indexes on
> > > unsigned char systems.
> > >
> > > > 3. Offer both, as an upgrade path.  For example, pg_trgm could have 
> > > > separate
> > > >operator classes gin_trgm_ops and gin_trgm_ops_unsigned.  Running
> > > >pg_upgrade on an unsigned-char system would automatically map v17
> > > >gin_trgm_ops to v18 gin_trgm_ops_unsigned.  This avoids penalizing 
> > > > any
> > > >architecture with upgrade-time scans.
> > >
> > > Very interesting idea. How can new v18 users use the correct operator
> > > class? I don't want to require users to specify the correct signed or
> > > unsigned operator classes when creating a GIN index. Maybe we need to
> >
> > In brief, it wouldn't matter which operator class new v18 indexes use.  The
> > documentation would focus on gin_trgm_ops and also say something like:
> >
> >   There's an additional operator class, gin_trgm_ops_unsigned.  It behaves
> >   exactly like gin_trgm_ops, but it uses a deprecated on-disk 
> > representation.
> >   Use gin_trgm_ops in new indexes, but there's no disadvantage from 
> > continuing
> >   to use gin_trgm_ops_unsigned.  Before PostgreSQL 18, gin_trgm_ops used a
> >   platform-dependent representation.  pg_upgrade automatically uses
> >   gin_trgm_ops_unsigned when upgrading from source data that used the
> >   deprecated representation.
> >
> > What concerns might users have, then?  (Neither operator class would use 
> > plain
> > "char" in a context that affects on-disk state.  They'll use "signed char" 
> > and
> > "unsigned char".)
> 
> I think I understand your idea now. Since gin_trgm_ops will use
> "signed char", there is no impact for v18 users -- they can continue
> using gin_trgm_ops.

Right.

> But how does pg_upgrade use gin_trgm_ops_unsigned? This opclass will
> be created by executing the pg_trgm script for v18, but it isn't
> executed during pg_upgrade. Another way would be to do these opclass
> replacement when executing the pg_trgm's update script (i.e., 1.6 to
> 1.7).

Yes, that's one way to make it work.  If we do it that way, it would be
critical to make the ALTER EXTENSION UPDATE run before anything uses the
index.  Otherwise, we'll run new v18 "signed char" code on a v17 "unsigned
char" data file.  Running ALTER EXTENSION UPDATE early enough should be
feasible, so that's fine.  Some other options:

- If v18 "pg_dump -b" decides to emit CREATE INDEX ... gin_trgm_ops_unsigned,
  then make it also emit the statements to create the opclass.

- Ship pg_trgm--1.6--1.7.sql in back branches.  If the upgrade wants to use
  gin_trgm_ops_unsigned, require the user to ALTER EXTENSION UPDATE first.
  (In back branches, the C code behind gin_trgm_ops_unsigned could just raise
  an error if called.)

What other options exist?  I bet there are more.

Re: race condition in pg_class

2024-09-06 Thread Noah Misch

On Fri, Sep 06, 2024 at 03:22:48PM +0530, Nitin Motiani wrote:
> Thanks. I have no other comments.

https://commitfest.postgresql.org/49/5090/ remains in status="Needs review".
When someone moves it to status="Ready for Committer", I will commit
inplace090, inplace110, and inplace120 patches.  If one of you is comfortable
with that, please modify the status.

Re: race condition in pg_class

2024-09-05 Thread Noah Misch

On Thu, Sep 05, 2024 at 07:10:04PM +0530, Nitin Motiani wrote:
> On Thu, Sep 5, 2024 at 1:27 AM Noah Misch  wrote:
> > On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> > >  Assert(rel->ri_needsLockTagTuple ==  
> > > IsInplaceUpdateRelation(rel->relationDesc)
> > >
> > > This can safeguard against users of ResultRelInfo missing this field.
> >
> > v10 does the rename and adds that assertion.  This question remains open:
> 
> Looks good. A couple of minor comments :
> 1. In the inplace110 commit message, there are still references to
> heap_inplace_update. Should it be clarified that the function has been
> renamed?

PGXN has only one caller of this function, so I think that wouldn't help
readers enough.  If someone gets a compiler error about the old name, they'll
figure it out without commit log guidance.  If a person doesn't get a compiler
error, they didn't need to read about the fact of the rename.

> 2. Should there be a comment above the ri_needLockTag definition in
> execNodes.h that we are caching this value to avoid function calls to
> IsInPlaceUpdateRelation for every tuple? Similar to how the comment
> above ri_TrigFunctions mentions that it is cached lookup info.

Current comment:

/* updates do LockTuple() before oldtup read; see README.tuplock */
boolri_needLockTagTuple;

Once the comment doesn't fit in one line, pgindent rules make it take a
minimum of four lines.  I don't think words about avoiding function calls
would add enough value to justify the vertical space, because a person
starting to remove it would see where it's called.  That's not to say the
addition would be negligent.  If someone else were writing the patch and had
included that, I wouldn't be deleting the material.

Re: Use read streams in pg_visibility

2024-09-05 Thread Noah Misch

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 4 Sept 2024 at 21:43, Noah Misch  wrote:
> > https://postgr.es/m/caeudqaozv3wty5tv2t29jcwpydbmkbiwqkzd42s2ogzdixp...@mail.gmail.com
> > then observed that collect_corrupt_items() was now guaranteed to never 
> > detect
> > corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  
> > For
> > the next try, could you add that testing we discussed?
> 
> Do you think that corrupting the visibility map and then seeing if
> pg_check_visible() and pg_check_frozen() report something is enough?

I think so.  Please check that it would have caught both the blkno bug and the
above bug.

Re: race condition in pg_class

2024-09-04 Thread Noah Misch

On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> How about this alternative then? The tuple length check
> and the elog(ERROR) gets its own function. Something like
> heap_inplace_update_validate or
> heap_inplace_update_validate_tuple_length. So in that case, it would
> look like this :
> 
>   genam.c:systable_inplace_update_finish
> heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
> PreInplace_Inval
> START_CRIT_SECTION
> heapam.c:heap_inplace_update
> BUFFER_LOCK_UNLOCK
> AtInplace_Inval
> END_CRIT_SECTION
> UnlockTuple
> AcceptInvalidationMessages
> 
> This is starting to get complicated though so I don't have any issues
> with just renaming the heap_inplace_update to
> heap_inplace_update_and_unlock.

Complexity aside, I don't see the _precheck design qualifying as a modularity
improvement.

>  Assert(rel->ri_needsLockTagTuple ==  
> IsInplaceUpdateRelation(rel->relationDesc)
> 
> This can safeguard against users of ResultRelInfo missing this field.

v10 does the rename and adds that assertion.  This question remains open:

On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote:
> On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> > always require a tuple lock on indexes, if that would make a difference.
> 
> Three sites.  See attached inplace125 patch.  Is it a net improvement?  If so,
> I'll squash it into inplace120.

If nobody has an opinion, I'll discard inplace125.  I feel it's not a net
improvement, but either way is fine with me.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by Heikki Linnakangas.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 83b99a9..51d52c4 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2255,6 +2255,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}

+#ifdef USE_ASSERT_CHECKING
+
+   /*
+* Tuple locks are currently held only for short durations 
within a
+* transaction. Check that we didn't forget to release one.
+*/
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by Heikki Linnakangas, Nitin Motiani and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..ddb2def 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,14 @@ The following infomask bits are applicable:

 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Reading inplace-updated columns
+---
+
+Inplace updates create an exception to the rule that tuple data won't change
+under a reader holding a pin.  A reader of a heap_fetch() result

Re: GetRelationPath() vs critical sections

2024-09-04 Thread Noah Misch

On Wed, Sep 04, 2024 at 11:58:33AM -0400, Andres Freund wrote:
> In general emitting a LOG inside a critical section isn't a huge issue - we
> made sure that elog.c has a reserve of memory to be able to log without
> crashing.
> 
> However, the current message for buffer IO issues use relpath*() (ending up in
> a call to GetRelationPath()). Which in turn uses psprintf() to generate the
> path. Which in turn violates the no-memory-allocations-in-critical-sections
> rule, as the containing memory context will typically not have
> ->allowInCritSection == true.
> 
> It's not obvious to me what the best way to deal with this is.
> 
> One idea I had was to add an errrelpath() that switches to
> edata->assoc_context before calling relpath(), but that would end up leaking
> memory, as FreeErrorDataContents() wouldn't know about the allocation.
> 
> Obviously we could add a version of GetRelationPath() that just prints into a
> caller provided buffer - but that's somewhat awkward API wise.

Agreed.

> A third approach would be to have a dedicated memory context for this kind of
> thing that's reset after logging the message - but that comes awkwardly close
> to duplicating ErrorContext.

That's how I'd try to do it.  Today's ErrorContext is the context for
allocations FreeErrorDataContents() knows how to find.  The new context would
be for calling into arbitrary code unknown to FreeErrorDataContents().  Most
of the time, we'd avoid reset work for the new context, since it would have no
allocations.

Ideally, errstart() would switch to the new context before returning true, and
errfinish() would switch back.  That way, you could just call relpath*()
without an errrelpath() to help.  This does need functions called in ereport()
argument lists to not use CurrentMemoryContext for allocations that need to
survive longer.  I'd not be concerned about imposing that in a major release.
What obstacles would arise if we did that?

> I wonder if we're lacking a bit of infrastructure here...

Conceptually, the ereport() argument list should be a closure that runs in a
suitable mcxt.  I think we're not far from the goal.

Re: Use read streams in pg_visibility

2024-09-04 Thread Noah Misch

On Tue, Sep 03, 2024 at 10:46:34PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 3 Sept 2024 at 22:20, Noah Misch  wrote:
> > On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
> > > Pushed with some other cosmetic changes.
> 
> Thanks!
> 
> > I see I pushed something unacceptable under ASAN.  I will look into that:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20
> 
> I think it is related to the scope of BlockRangeReadStreamPrivate in
> the collect_visibility_data() function. Attached a small fix for that.

Right.

https://postgr.es/m/caeudqaozv3wty5tv2t29jcwpydbmkbiwqkzd42s2ogzdixp...@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  For
the next try, could you add that testing we discussed?

Re: race condition in pg_class

2024-09-03 Thread Noah Misch

On Tue, Sep 03, 2024 at 09:24:52PM +0530, Nitin Motiani wrote:
> On Sat, Aug 31, 2024 at 6:40 AM Noah Misch  wrote:
> > On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> > > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch  wrote:
> > > > - In the cancel case, call both systable_inplace_update_cancel and
> > > >   systable_inplace_update_end.  _finish or _cancel would own unlock, 
> > > > while
> > > >   _end would own systable_endscan().
> > > >
> > > What happens to CacheInvalidateHeapTuple() in this approach?  I think
> > > it will still need to be brought to the genam.c layer if we are
> > > releasing the lock in systable_inplace_update_finish.

> understanding is that the code in this approach would look like below
> :
> if (dirty)
>systable_inplace_update_finish(inplace_state, tup);
> else
>systable_inplace_update_cancel(inplace_state);
> systable_inplace_update_end(inplace_state);
> 
> And that in this structure, both _finish and _cancel will call
> heap_inplace_unlock and then _end will call systable_endscan. So even
> with this structure, the invalidation has to happen inside _finish
> after the unlock.

Right.

> So this also pulls the invalidation to the genam.c
> layer. Am I understanding this correctly?

Compared to the v9 patch, the "call both" alternative would just move the
systable_endscan() call to a new systable_inplace_update_end().  It wouldn't
move anything across the genam:heapam boundary.
systable_inplace_update_finish() would remain a thin wrapper around a heapam
function.

> > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > > >   tolerable now, this gets less attractive after the inplace160 patch 
> > > > from
> > > >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> > > >
> > > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > > becomes less attractive with the patch. I see there is a new
> > > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > > while holding the lock. And then there is an
> > > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > > layer too. Is the concern that one invalidation call will be in the
> > > heapam layer and the other will be in the genam layer?
> >
> > That, or a critical section would start in heapam.c, then end in genam.c.
> > Current call tree at inplace160 v4:

> How about this alternative?
> 
>  genam.c:systable_inplace_update_finish
>PreInplace_Inval
>START_CRIT_SECTION
>heapam.c:heap_inplace_update
>BUFFER_LOCK_UNLOCK
>AtInplace_Inval
>END_CRIT_SECTION
>UnlockTuple
>AcceptInvalidationMessages

> Looking at inplace160, it seems that the start of the critical section
> is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
> and END_CRIT_SECTION out to the genam.c layer?

heap_inplace_update() has an elog(ERROR) that needs to happen outside any
critical section.  Since the condition for that elog deals with tuple header
internals, it belongs at the heapam layer more than the systable layer.

> Alternatively since
> heap_inplace_update is commented as a subroutine of
> systable_inplace_update_finish, should everything just be moved to the
> genam.c layer? Although it looks like you already considered and
> rejected this approach.

Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity
violation than the one that led to the changes between v8 and v9.  I think
even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity
violation than the one attributed to v8.  Modularity would have the
heap_inplace function resemble heap_update() handling of invals.

> If the above alternatives are not possible, it's probably fine to go
> ahead with the current patch with the function renamed to
> heap_inplace_update_and_unlock (or something similar) as mentioned
> earlier?

I like that name.  The next version will use it.

> > > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > > we can just call
> > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > > big advantage of storing a separate bool field? Also there is another
> >
> > No, not a big advantage.  I felt it was more in line with the typical style 
> > of
> > src/backend/executor.

> just find it a smell that there are two fields which are not
> independent and they go out of sync. And that's why my preference is
> to not have a dependent field unless there is a specific adv

Re: Use read streams in pg_visibility

2024-09-03 Thread Noah Misch

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
> On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:
> > Thanks, updated patches are attached.
> 
> > +/*
> > + * Ask the callback which block it would like us to read next, with a small
> > + * buffer in front to allow read_stream_unget_block() to work and to allow 
> > the
> > + * fast path to skip this function and work directly from the array.
> >   */
> >  static inline BlockNumber
> >  read_stream_get_block(ReadStream *stream, void *per_buffer_data)
> 
> v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
> this update to the read_stream_get_block() header comment, but we're not
> changing read_stream_get_block() here.  I reverted this.
> 
> Pushed with some other cosmetic changes.

I see I pushed something unacceptable under ASAN.  I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20

Re: Use read streams in pg_visibility

2024-09-03 Thread Noah Misch

On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:
> Thanks, updated patches are attached.

> +/*
> + * Ask the callback which block it would like us to read next, with a small
> + * buffer in front to allow read_stream_unget_block() to work and to allow 
> the
> + * fast path to skip this function and work directly from the array.
>   */
>  static inline BlockNumber
>  read_stream_get_block(ReadStream *stream, void *per_buffer_data)

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here.  I reverted this.

Pushed with some other cosmetic changes.

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-08-30 Thread Noah Misch

On Thu, Aug 29, 2024 at 03:48:53PM -0500, Masahiko Sawada wrote:
> On Sun, May 19, 2024 at 6:46 AM Noah Misch  wrote:
> > If I were standardizing pg_trgm on one or the other notion of "char", I 
> > would
> > choose signed char, since I think it's still the majority.  More broadly, I
> > see these options to fix pg_trgm:
> >
> > 1. Change to signed char.  Every arm64 system needs to scan pg_trgm indexes.
> > 2. Change to unsigned char.  Every x86 system needs to scan pg_trgm indexes.
> 
> Even though it's true that signed char systems are the majority, it
> would not be acceptable to force the need to scan pg_trgm indexes on
> unsigned char systems.
> 
> > 3. Offer both, as an upgrade path.  For example, pg_trgm could have separate
> >operator classes gin_trgm_ops and gin_trgm_ops_unsigned.  Running
> >pg_upgrade on an unsigned-char system would automatically map v17
> >gin_trgm_ops to v18 gin_trgm_ops_unsigned.  This avoids penalizing any
> >architecture with upgrade-time scans.
> 
> Very interesting idea. How can new v18 users use the correct operator
> class? I don't want to require users to specify the correct signed or
> unsigned operator classes when creating a GIN index. Maybe we need to

In brief, it wouldn't matter which operator class new v18 indexes use.  The
documentation would focus on gin_trgm_ops and also say something like:

  There's an additional operator class, gin_trgm_ops_unsigned.  It behaves
  exactly like gin_trgm_ops, but it uses a deprecated on-disk representation.
  Use gin_trgm_ops in new indexes, but there's no disadvantage from continuing
  to use gin_trgm_ops_unsigned.  Before PostgreSQL 18, gin_trgm_ops used a
  platform-dependent representation.  pg_upgrade automatically uses
  gin_trgm_ops_unsigned when upgrading from source data that used the
  deprecated representation.

What concerns might users have, then?  (Neither operator class would use plain
"char" in a context that affects on-disk state.  They'll use "signed char" and
"unsigned char".)

> dynamically use the correct compare function for the same operator
> class depending on the char signedness. But is it possible to do it on
> the extension (e.g. pg_trgm) side?

No, I don't think the extension can do that cleanly.  It would need to store
the signedness in the index somehow, and GIN doesn't call the opclass at a
time facilitating that.  That would need help from the core server.

Re: race condition in pg_class

2024-08-30 Thread Noah Misch

On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> On Thu, Aug 29, 2024 at 8:11 PM Noah Misch  wrote:
> > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > My order of preference is: 2, 1, 3.
> >
> > I kept tuple locking responsibility in heapam.c.  That's simpler and better
> > for modularity, but it does mean we release+acquire after any xmax wait.
> > Before, we avoided that if the next genam.c scan found the same TID.  (If 
> > the
> > next scan finds the same TID, the xmax probably aborted.)  I think DDL 
> > aborts
> > are rare enough to justify simplifying as this version does.  I don't expect
> > anyone to notice the starvation outside of tests built to show it.  (With
> > previous versions, one can show it with a purpose-built test that commits
> > instead of aborting, like the "001_pgbench_grant@9" test.)
> >
> > This move also loses the optimization of unpinning before 
> > XactLockTableWait().
> > heap_update() doesn't optimize that way, so that's fine.
> >
> > The move ended up more like (1), though I did do
> > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  
> > I
> > felt that worked better than (2) to achieve lock release before
> > CacheInvalidateHeapTuple().  Alternatives that could be fine:
> >
> From a consistency point of view, I find it cleaner if we can have all
> the heap_inplace_lock and heap_inplace_unlock in the same set of
> functions. So here those would be the systable_inplace_* functions.

That will technically be the case after inplace160, and I could make it so
here by inlining heap_inplace_unlock() into its heapam.c caller.  Would that
be cleaner or less clean?

> > - In the cancel case, call both systable_inplace_update_cancel and
> >   systable_inplace_update_end.  _finish or _cancel would own unlock, while
> >   _end would own systable_endscan().
> >
> What happens to CacheInvalidateHeapTuple() in this approach?  I think
> it will still need to be brought to the genam.c layer if we are
> releasing the lock in systable_inplace_update_finish.

Cancel scenarios don't do invalidation.  (Same under other alternatives.)

> > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> >   tolerable now, this gets less attractive after the inplace160 patch from
> >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> >
> I skimmed through the inplace160 patch. It wasn't clear to me why this
> becomes less attractive with the patch. I see there is a new
> CacheInvalidateHeapTupleInPlace but that looks like it would be called
> while holding the lock. And then there is an
> AcceptInvalidationMessages which can perhaps be moved to the genam.c
> layer too. Is the concern that one invalidation call will be in the
> heapam layer and the other will be in the genam layer?

That, or a critical section would start in heapam.c, then end in genam.c.
Current call tree at inplace160 v4:

genam.c:systable_inplace_update_finish
 heapam.c:heap_inplace_update
  PreInplace_Inval
  START_CRIT_SECTION
  BUFFER_LOCK_UNLOCK
  AtInplace_Inval
  END_CRIT_SECTION
  UnlockTuple
  AcceptInvalidationMessages

If we hoisted all of invalidation up to the genam.c layer, a critical section
that starts in heapam.c would end in genam.c:

genam.c:systable_inplace_update_finish
 PreInplace_Inval
 heapam.c:heap_inplace_update
  START_CRIT_SECTION
 BUFFER_LOCK_UNLOCK
 AtInplace_Inval
 END_CRIT_SECTION
 UnlockTuple
 AcceptInvalidationMessages

If we didn't accept splitting the critical section but did accept splitting
invalidation responsibilities, one gets perhaps:

genam.c:systable_inplace_update_finish
 PreInplace_Inval
 heapam.c:heap_inplace_update
  START_CRIT_SECTION
  BUFFER_LOCK_UNLOCK
  AtInplace_Inval
  END_CRIT_SECTION
  UnlockTuple
 AcceptInvalidationMessages

That's how I ended up at inplace120 v9's design.

> Also I have a small question from inplace120.
> 
> I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> we can just call
> IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> big advantage of storing a separate bool field? Also there is another

No, not a big advantage.  I felt it was more in line with the typical style of
src/backend/executor.

> write to ri_RelationDesc in CatalogOpenIndexes in
> src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
> be set there also to keep it consistent with ri_RelationDesc. Please
> let me know if I am missing something about the usage of the new
> field.

Can you say more about consequences you found?

Only the full executor reads the field, doing so when it fetches the most
recent versio

Re: Inval reliability, especially for inplace updates

2024-08-30 Thread Noah Misch

On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:
> On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:
> > On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> > > On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> > > > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> > > > > https://postgr.es/m/20240512232923.aa.nmi...@google.com wrote:
> > > > > > Separable, nontrivial things not fixed in the attached patch stack:
> > > > > >
> > > > > > - Inplace update uses transactional CacheInvalidateHeapTuple().  
> > > > > > ROLLBACK of
> > > > > >   CREATE INDEX wrongly discards the inval, leading to the 
> > > > > > relhasindex=t loss
> > > > > >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does 
> > > > > > this right.
> > > > >
> > > > > I plan to fix that like CacheInvalidateRelmap(): send the inval 
> > > > > immediately,
> > > > > inside the critical section.  Send it in heap_xlog_inplace(), too.
> > 
> > I'm worried this might cause its own set of bugs, e.g. if there are any 
> > places
> > that, possibly accidentally, rely on the invalidation from the inplace 
> > update
> > to also cover separate changes.
> 
> Good point.  I do have index_update_stats() still doing an ideally-superfluous
> relcache update for that reason.  Taking that further, it would be cheap
> insurance to have the inplace update do a transactional inval in addition to
> its immediate inval.  Future master-only work could remove the transactional
> one.  How about that?

Restoring the transactional inval seemed good to me, so I've rebased and
included that.  This applies on top of three patches from
https://postgr.es/m/20240822073200.4f.nmi...@google.com.  I'm attaching those
to placate cfbot, but this thread is for review of the last patch only.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..e5e7ab5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}

+#ifdef USE_ASSERT_CHECKING
+
+   /*
+* Tuple locks are currently held only for short durations 
within a
+* transaction. Check that we didn't forget to release one.
+*/
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by Heikki Linnakangas and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..ddb2def 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,14 @@ The following infomask bits ar

Re: Use read streams in pg_visibility

2024-08-30 Thread Noah Misch

On Tue, Aug 27, 2024 at 10:49:19AM +0300, Nazir Bilal Yavuz wrote:
> On Fri, 23 Aug 2024 at 22:01, Noah Misch  wrote:
> > On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> > > On Tue, 20 Aug 2024 at 21:47, Noah Misch  wrote:
> > > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

> I liked the block_range_read_stream_cb. Attached patches for that
> (first 3 patches). I chose an nblocks way instead of last_blocks in
> the struct.

To read blocks 10 and 11, I would expect to initialize the struct with one of:

{ .first=10, .nblocks=2 }
{ .first=10, .last_inclusive=11 }
{ .first=10, .last_exclusive=12 }

With the patch's API, I would need {.first=10,.nblocks=12}.  The struct field
named "nblocks" behaves like a last_block_exclusive.  Please either make the
behavior an "nblocks" behavior or change the field name to replace the term
"nblocks" with something matching the behavior.  (I used longer field names in
my examples here, to disambiguate those examples.  It's okay if the final
field names aren't those, as long as the field names and the behavior align.)

> > > > The callback doesn't return blocks having zero vm bits, so the blkno 
> > > > variable
> > > > is not accurate.  I didn't test, but I think the loop's "Recheck to 
> > > > avoid
> > > > returning spurious results." looks at the bit for the wrong block.  If 
> > > > that's
> > > > what v1 does, could you expand the test file to catch that?  For 
> > > > example, make
> > > > a two-block table with only the second block all-visible.
> > >
> > > Yes, it was not accurate. I am getting blockno from the buffer now. I
> > > checked and confirmed it is working as expected by manually logging
> > > blocknos returned from the read stream. I am not sure how to add a
> > > test case for this.
> >
> > VACUUM FREEZE makes an all-visible, all-frozen table.  DELETE of a 
> > particular
> > TID, even if rolled back, clears both vm bits for the TID's page.  Past 
> > tests
> > like that had instability problems.  One cause is a concurrent session's XID
> > or snapshot, which can prevent VACUUM setting vm bits.  Using a TEMP table 
> > may
> > have been one of the countermeasures, but I don't remember clearly.  Hence,
> > please search the archives or the existing pg_visibility tests for how we
> > dealt with that.  It may not be problem for this particular test.
> 
> Thanks for the information, I will check these. What I still do not
> understand is how to make sure that only the second block is processed
> and the first one is skipped. pg_check_visible() and pg_check_frozen()
> returns TIDs that cause corruption in the visibility map, there is no
> information about block numbers.

I see what you're saying.  collect_corrupt_items() needs a corrupt table to
report anything; all corruption-free tables get the same output.  Testing this
would need extra C code or techniques like corrupt_page_checksum() to create
the corrupt state.  That wouldn't be a bad thing to have, but it's big enough
that I'll consider it out of scope for $SUBJECT.  With the callback change
above, I'll be ready to push all this.

Re: Use read streams in pg_visibility

2024-08-23 Thread Noah Misch

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 20 Aug 2024 at 21:47, Noah Misch  wrote:
> > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > The downside of this approach is there are too many "vmbuffer is valid
> > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > vmbuffer needs to be read again" cases in the read stream version (700
> > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > the read stream reading a new vmbuf while getting next block numbers.
> > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > have changed while we were acquiring the page lock.
> >
> > Did the test that found 700 "read again" use a different procedure than the
> > one shared as setup.sh down-thread?  Using that script alone, none of the vm
> > bits are set, so I'd not expect any heap page reads.
> 
> Yes, it is basically the same script but the query is 'SELECT
> pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts.  But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure.  Thoughts?

> > The "vmbuffer needs to be read again" regression fits what I would expect 
> > the
> > v1 patch to do with a table having many vm bits set.  In general, I think 
> > the
> > fix is to keep two heap Buffer vars, so the callback can work on one 
> > vmbuffer
> > while collect_corrupt_items() works on another vmbuffer.  Much of the time,
> > they'll be the same buffer.  It could be as simple as that, but you could
> > consider further optimizations like these:
> 
> Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> "read-again" cases to ~40. I ran the test again and the timing
> improvement is ~5% now.

> I think the number of "read-again" cases are low enough now.

Agreed.  The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

> Does creating something like
> pg_general_read_stream_next_block() in read_stream code and exporting
> it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there.  How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

> > The callback doesn't return blocks having zero vm bits, so the blkno 
> > variable
> > is not accurate.  I didn't test, but I think the loop's "Recheck to avoid
> > returning spurious results." looks at the bit for the wrong block.  If 
> > that's
> > what v1 does, could you expand the test file to catch that?  For example, 
> > make
> > a two-block table with only the second block all-visible.
> 
> Yes, it was not accurate. I am getting blockno from the buffer now. I
> checked and confirmed it is working as expected by manually logging
> blocknos returned from the read stream. I am not sure how to add a
> test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table.  DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page.  Past tests
like that had instability problems.  One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits.  Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly.  Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that.  It may not be problem for this particular test.

> There is one behavioral change introduced with the patches. It could
> happen when collect_corrupt_items() is called with both all_visible
> and all_frozen being true.
> -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
> false in callback. Callback returns this block number because
> VM_ALL_FROZEN() is true.
> -> At the /* Recheck to avoid returning spurious results. */ part, we
> should only check VM_ALL_FROZEN() again as this was returning true in
> the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
> VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
> (vice versa of callback).
> -> We were skipping this block in the master but the patched version
> does not skip that.
> 
> Is this a problem? If this is a problem, a solution might be to
> callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
> per_buffer_data.

No, I don't consider that a problem.  That's not making us do additional I/O,
just testing more bits within the pages we're already loading.  The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.

Re: race condition in pg_class

2024-08-22 Thread Noah Misch

On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> On 17/08/2024 07:07, Noah Misch wrote:
> > On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> > > I wonder if the functions should be called "systable_*" and placed in
> > > genam.c rather than in heapam.c. The interface looks more like the 
> > > existing
> > > systable functions. It feels like a modularity violation for a function in
> > > heapam.c to take an argument like "indexId", and call back into systable_*
> > > functions.
> > 
> > Yes, _scan() and _cancel() especially are wrappers around systable.  Some 
> > API
> > options follow.  Any preference or other ideas?
> > 
> >  direct s/heap_/systable_/ rename [option 1]
> > 
> >   systable_inplace_update_scan([...], &tup, &inplace_state);
> >   if (!HeapTupleIsValid(tup))
> > elog(ERROR, [...]);
> >   ... [buffer is exclusive-locked; mutate "tup"] ...
> >   if (dirty)
> > systable_inplace_update_finish(inplace_state, tup);
> >   else
> > systable_inplace_update_cancel(inplace_state);
> > 
> >  make the first and last steps more systable-like [option 2]
> > 
> >   systable_inplace_update_begin([...], &tup, &inplace_state);
> >   if (!HeapTupleIsValid(tup))
> > elog(ERROR, [...]);
> >   ... [buffer is exclusive-locked; mutate "tup"] ...
> >   if (dirty)
> > systable_inplace_update(inplace_state, tup);
> >   systable_inplace_update_end(inplace_state);

> My order of preference is: 2, 1, 3.

I kept tuple locking responsibility in heapam.c.  That's simpler and better
for modularity, but it does mean we release+acquire after any xmax wait.
Before, we avoided that if the next genam.c scan found the same TID.  (If the
next scan finds the same TID, the xmax probably aborted.)  I think DDL aborts
are rare enough to justify simplifying as this version does.  I don't expect
anyone to notice the starvation outside of tests built to show it.  (With
previous versions, one can show it with a purpose-built test that commits
instead of aborting, like the "001_pgbench_grant@9" test.)

This move also loses the optimization of unpinning before XactLockTableWait().
heap_update() doesn't optimize that way, so that's fine.

The move ended up more like (1), though I did do
s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2).  I
felt that worked better than (2) to achieve lock release before
CacheInvalidateHeapTuple().  Alternatives that could be fine:

- In the cancel case, call both systable_inplace_update_cancel and
  systable_inplace_update_end.  _finish or _cancel would own unlock, while
  _end would own systable_endscan().

- Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
  tolerable now, this gets less attractive after the inplace160 patch from
  https://postgr.es/m/flat/20240523000548.58.nmi...@google.com

I made the other changes we discussed, also.

> > > Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
> > > call heap_update() on pg_class or pg_database? That'd make the rule 
> > > simple.
> > 
> > We could.  That would change more code sites.  Rough estimate:
> > 
> > $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc 
> > -l
> > 23

> How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> always require a tuple lock on indexes, if that would make a difference.

Three sites.  See attached inplace125 patch.  Is it a net improvement?  If so,
I'll squash it into inplace120.

> > Another option here would be to preface that README section with a 
> > simplified
> > view, something like, "If a warning brought you here, take a tuple lock.  
> > The
> > rest of this section is just for people needing to understand the conditions
> > for --enable-casserts emitting that warning."  How about that instead of
> > simplifying the rules?
> 
> Works for me. Or perhaps the rules could just be explained more succinctly.
> Something like:
> 
> -

I largely used your text instead.

While doing these updates, I found an intra-grant-inplace.spec permutation
being flaky on inplace110 but stable on inplace120.  That turned out not to be
v9-specific.  As of patch v1, I now see it was already flaky (~5% failure
here).  I've now added to inplace110 a minimal tweak to stabilize that spec,
which inplace120 removes.

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A

Re: Use read streams in pg_visibility

2024-08-20 Thread Noah Misch

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> 2- collect_corrupt_items()
> 
> This one is more complicated. The read stream callback function loops
> until it finds a suitable block to read. So, if the callback returns
> an InvalidBlockNumber; it means that the stream processed all possible
> blocks and the stream should be finished. There is ~3% timing
> improvement with this change. I started the server with the default
> settings and created a 6 GB table. Then run 100 times
> pg_check_visible() by clearing the OS cache between each run.
> 
> The downside of this approach is there are too many "vmbuffer is valid
> but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> vmbuffer needs to be read again" cases in the read stream version (700
> vs 20 for the 6 GB table). This is caused by the callback function of
> the read stream reading a new vmbuf while getting next block numbers.
> So, vmbuf is wrong when we are checking visibility map bits that might
> have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread?  Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set.  In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer.  Much of the time,
they'll be the same buffer.  It could be as simple as that, but you could
consider further optimizations like these:

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
  lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
  to stop calling the ReadStreamBlockNumberCB for awhile.  This could help if
  nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
  block.  For example, if one block out of every 33000 is all-visible, every
  heap block we visit has a different vmbuffer.  It's likely not optimal for
  the callback to pin and unpin 20 vmbuffers, then have
  collect_corrupt_items() pin and unpin the same 20 vmbuffers.  pg_visibility
  could notice this trend and request a stop of the callbacks until more of
  the heap block work completes.  If pg_visibility is going to be the only
  place in the code with this use case, it's probably not worth carrying the
  extra API just for pg_visibility.  However, if we get a stronger use case
  later, pg_visibility could be another beneficiary.

> +/*
> + * Callback function to get next block for read stream object used in
> + * collect_visibility_data() function.
> + */
> +static BlockNumber
> +collect_visibility_data_read_stream_next_block(ReadStream *stream,
> + 
>void *callback_private_data,
> + 
>void *per_buffer_data)
> +{
> + struct collect_visibility_data_read_stream_private *p = 
> callback_private_data;
> +
> + if (p->blocknum < p->nblocks)
> + return p->blocknum++;
> +
> + return InvalidBlockNumber;

This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block().  While not a big problem,
I think it's time to have a general-use callback for block range scans.  The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

> +static BlockNumber
> +collect_corrupt_items_read_stream_next_block(ReadStream *stream,
> + 
>  void *callback_private_data,
> + 
>  void *per_buffer_data)
> +{
> + struct collect_corrupt_items_read_stream_private *p = 
> callback_private_data;
> +
> + for (; p->blocknum < p->nblocks; p->blocknum++)
> + {
> + boolcheck_frozen = false;
> + boolcheck_visible = false;
> +
> + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, 
> p->vmbuffer))
> + check_frozen = true;
> + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, 
> p->vmbuffer))
> + check_visible = true;
> + if (!check_visible && !check_frozen)
> + continue;

If a vm has zero bits set, this loop will scan the entire vm before returning.
Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

> @@ -687,6 +734,20 @@ collect_corrupt_i

Re: race condition in pg_class

2024-08-16 Thread Noah Misch

Thanks for reviewing.

On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> On 14/07/2024 20:48, Noah Misch wrote:
> > I've pushed the two patches for your reports.  To placate cfbot, I'm 
> > attaching
> > the remaining patches.
> 
> inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in
> order, it looks pretty random as it is. Something like:
> 
> /*
>  * Tuple locks are currently only held for short durations within a
>  * transaction. Check that we didn't forget to release one.
>  */

Will add.

> inplace110-successors-v8.patch: Makes sense.
> 
> The README changes would be better as part of the third patch, as this patch
> doesn't actually do any of the new locking described in the README, and it
> fixes the "inplace update updates wrong tuple" bug even without those tuple
> locks.

That should work.  Will confirm.

> > + * ... [any slow preparation not requiring oldtup] ...
> > + * heap_inplace_update_scan([...], &tup, &inplace_state);
> > + * if (!HeapTupleIsValid(tup))
> > + * elog(ERROR, [...]);
> > + * ... [buffer is exclusive-locked; mutate "tup"] ...
> > + * if (dirty)
> > + * heap_inplace_update_finish(inplace_state, tup);
> > + * else
> > + * heap_inplace_update_cancel(inplace_state);
> 
> I wonder if the functions should be called "systable_*" and placed in
> genam.c rather than in heapam.c. The interface looks more like the existing
> systable functions. It feels like a modularity violation for a function in
> heapam.c to take an argument like "indexId", and call back into systable_*
> functions.

Yes, _scan() and _cancel() especially are wrappers around systable.  Some API
options follow.  Any preference or other ideas?

 direct s/heap_/systable_/ rename

 systable_inplace_update_scan([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
systable_inplace_update_finish(inplace_state, tup);
 else
systable_inplace_update_cancel(inplace_state);

 make the first and last steps more systable-like

 systable_inplace_update_begin([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
systable_inplace_update(inplace_state, tup);
 systable_inplace_update_end(inplace_state);

 no systable_ wrapper for middle step, more like CatalogTupleUpdate

 systable_inplace_update_begin([...], &tup, &inplace_state);
 if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
 ... [buffer is exclusive-locked; mutate "tup"] ...
 if (dirty)
heap_inplace_update(relation,

systable_inplace_old_tuple(inplace_state),
tup,

systable_inplace_buffer(inplace_state));
 systable_inplace_update_end(inplace_state);

> > /*--
> >  * XXX A crash here can allow datfrozenxid() to get ahead of 
> > relfrozenxid:
> >  *
> >  * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
> >  * ["R" is a VACUUM tbl]
> >  * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
> > * c * D: systable_getnext() returns pg_class tuple of tbl
> >  * R: memcpy() into pg_class tuple of tbl
> >  * D: raise pg_database.datfrozenxid, XLogInsert(), finish
> >  * [crash]
> >  * [recovery restores datfrozenxid w/o relfrozenxid]
> >  */
> 
> Hmm, that's a tight race, but feels bad to leave it unfixed. One approach
> would be to modify the tuple on the buffer only after WAL-logging it. That
> way, D cannot read the updated value before it has been WAL logged. Just
> need to make sure that the change still gets included in the WAL record.
> Maybe something like:
> 
> if (RelationNeedsWAL(relation))
> {
> /*
>  * Make a temporary copy of the page that includes the change, in
>  * case the a full-page image is logged
>  */
> PGAlignedBlock tmppage;
> 
> memcpy(tmppage.data, page, BLCKSZ);
> 
> /* copy the tuple to the temporary copy */
> memcpy(...);
> 
> XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
> XLogInsert();
> }
> 
> /* copy the tuple to the buffer */
> memcpy(...);

Yes, that's the essence of
inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patch from
https://postgr.es/m/flat/20240620012908.92.nmi...@google.com.

> >   pg_class heap_inplace_update_scan() callers: before the call, acquire
> >   L

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2024-08-10 Thread Noah Misch

On Sat, Jul 27, 2024 at 07:24:33AM +0900, Michael Paquier wrote:
> I've just applied now the remaining pieces down to 17.

Comparing commit c9e2457 to the patch in zqgvzsbw5tgkq...@paquier.xyz, the
commit lacks the slru.c portion.

Re: Don't overwrite scan key in systable_beginscan()

2024-08-08 Thread Noah Misch

On Thu, Aug 08, 2024 at 08:46:35AM +0200, Peter Eisentraut wrote:
> When systable_beginscan() and systable_beginscan_ordered() choose an index
> scan, they remap the attribute numbers in the passed-in scan keys to the
> attribute numbers of the index, and then write those remapped attribute
> numbers back into the scan key passed by the caller.  This second part is
> surprising and gratuitous.  It means that a scan key cannot safely be used
> more than once (but it might sometimes work, depending on circumstances).
> Also, there is no value in providing these remapped attribute numbers back
> to the caller, since they can't do anything with that.
> 
> I propose to fix that by making a copy of the scan keys passed by the caller
> and make the modifications there.

No objection, but this would obsolete at least some of these comments (the
catcache.c ones if nothing else):

$ git grep -in scankey | grep -i copy
src/backend/access/gist/gistscan.c:257:  * Copy consistent 
support function to ScanKey structure instead
src/backend/access/gist/gistscan.c:330:  * Copy distance 
support function to ScanKey structure instead of
src/backend/access/nbtree/nbtutils.c:281:   ScanKey arrayKeyData;   
/* modified copy of scan->keyData */
src/backend/access/nbtree/nbtutils.c:3303: * We copy the appropriate indoption 
value into the scankey sk_flags
src/backend/access/nbtree/nbtutils.c:3318: * It's a bit ugly to modify the 
caller's copy of the scankey but in practice
src/backend/access/spgist/spgscan.c:385:/* copy scankeys into local 
storage */
src/backend/utils/cache/catcache.c:1474: * Ok, need to make a 
lookup in the relation, copy the scankey and
src/backend/utils/cache/catcache.c:1794: * Ok, need to 
make a lookup in the relation, copy the scankey and
src/backend/utils/cache/relfilenumbermap.c:192: /* copy scankey to 
local copy, it will be modified during the scan */

> In order to prove to myself that there are no other cases where
> caller-provided scan keys are modified, I went through and const-qualified
> all the APIs.  This works out correctly.  Several levels down in the stack,
> the access methods make their own copy of the scan keys that they store in
> their scan descriptors, and they use those in non-const-clean ways, but
> that's ok, that's their business.  As far as the top-level callers are
> concerned, they can rely on their scan keys to be const after this.

We do still have code mutating IndexScanDescData.keyData, but that's fine.  We
must be copying function args to form IndexScanDescData.keyData, or else your
proof patch would have noticed an assignment of const to non-const.

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-08-08 Thread Noah Misch

On Sun, Apr 07, 2024 at 01:22:51AM +0300, Alexander Korotkov wrote:
> I've pushed 0001 and 0002

The partition MERGE (1adf16b8f) and SPLIT (87c21bb94) v17 patches introduced
createPartitionTable() with this code:

createStmt->relation = newPartName;
...
wrapper->utilityStmt = (Node *) createStmt;
...
ProcessUtility(wrapper,
...
newRel = table_openrv(newPartName, NoLock);

This breaks from the CVE-2014-0062 (commit 5f17304) principle of not repeating
name lookups.  The attached demo uses this defect to make one partition have
two parents.
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index ae2efdc..654b502 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -3495,7 +3495,11 @@ StorePartitionBound(Relation rel, Relation parent, 
PartitionBoundSpec *bound)
elog(ERROR, "cache lookup failed for relation %u",
 RelationGetRelid(rel));
 
-#ifdef USE_ASSERT_CHECKING
+   /*
+* Assertion fails during partition getting multiple parents.  Disable 
the
+* assertion, to see what non-assert builds experience.
+*/
+#if 0
{
Form_pg_class classForm;
boolisnull;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8fcb188..48207f9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -95,6 +95,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/injection_point.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -15825,7 +15826,11 @@ MergeAttributesIntoExisting(Relation child_rel, 
Relation parent_rel, bool ispart
 */
if (parent_rel->rd_rel->relkind == 
RELKIND_PARTITIONED_TABLE)
{
-   Assert(child_att->attinhcount == 1);
+   /*
+* Assertion fails during partition getting 
multiple parents.
+* Disable, to see what non-assert builds 
experience.
+*/
+   /* Assert(child_att->attinhcount == 1); */
child_att->attislocal = false;
}
 
@@ -20388,7 +20393,14 @@ createPartitionTable(RangeVar *newPartName, Relation 
modelRel,
 * Open the new partition with no lock, because we already have
 * AccessExclusiveLock placed there after creation.
 */
-   newRel = table_openrv(newPartName, NoLock);
+   INJECTION_POINT("merge-after-create");
+   /*
+* For testing, switch to taking a lock.  This solves two problems.
+* First, it gets an AcceptInvalidationMessages(), so we actually
+* invalidate the search path.  Second, it avoids an assertion failure
+* from our lack of lock, so we see what non-assert builds experience.
+*/
+   newRel = table_openrv(newPartName, AccessExclusiveLock);
 
/*
 * We intended to create the partition with the same persistence as the
diff --git a/src/test/modules/injection_points/Makefile 
b/src/test/modules/injection_points/Makefile
index 2ffd2f7..9af45bf 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -9,7 +9,7 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = inplace
+ISOLATION = inplace merge
 
 # The injection points are cluster-wide, so disable installcheck
 NO_INSTALLCHECK = 1
diff --git a/src/test/modules/injection_points/expected/merge.out 
b/src/test/modules/injection_points/expected/merge.out
new file mode 100644
index 000..04920e7
--- /dev/null
+++ b/src/test/modules/injection_points/expected/merge.out
@@ -0,0 +1,60 @@
+Parsed test spec with 3 sessions
+
+starting permutation: merge1 ct2 detach3
+injection_points_attach
+---
+   
+(1 row)
+
+step merge1: 
+   SET search_path = target;
+   ALTER TABLE parted MERGE PARTITIONS (part1, part3) INTO part_all;
+   -- only one of *parted should have rows, but both do
+   SELECT a AS new_target_parted FROM target.parted ORDER BY 1;
+   SELECT a AS old_target_parted FROM old_target.parted ORDER BY 1;
+   SELECT a AS new_target_part_all FROM target.part_all ORDER BY 1;
+   SELECT a AS old_target_part_all FROM old_target.part_all ORDER BY 1;
+ 
+step ct2: 
+   ALTER SCHEMA target RENAME TO old_target;
+   CREATE SCHEMA target
+   CREATE TABLE parted (a int) partition by list (a)
+   CREATE TABLE part_all PARTITION OF parted FOR VALUES IN (1, 2, 
3, 4)
+
+step detach3: 
+   SELECT injection_points_detach('merge-after-create');
+   SELECT in

Re: Inval reliability, especially for inplace updates

2024-08-06 Thread Noah Misch

On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:
> On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:
> > On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> > > That inplace150 patch turned out to be unnecessary.  Contrary to the
> > > "noncritical resource releasing" comment some lines above
> > > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
> > > PANIC.  An ERROR just before or after sending invals becomes PANIC, 
> > > "cannot
> > > abort transaction %u, it was already committed".
> > 
> > Relying on that, instead of explicit critical sections, seems fragile to me.
> > IIRC some of the behaviour around errors around transaction commit/abort has
> > changed a bunch of times. Tying correctness into something that could be
> > changed for unrelated reasons doesn't seem great.
> 
> Fair enough.  It could still be a good idea for master, but given I missed a
> bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
> $SUBJECT fixes, let's not risk it in back branches.

What are your thoughts on whether a change to explicit critical sections
should be master-only vs. back-patched?  I have a feeling your comment pointed
to something I'm still missing, but I don't know where to look next.

Re: meson "experimental"?

2024-08-02 Thread Noah Misch

On Thu, May 30, 2024 at 01:32:18PM +0300, Aleksander Alekseev wrote:
> > > [*] What is the correct name for this?
> >
> > I believe in this section it should be "Visual Studio" as we specify
> > elsewhere [1][2]. In [2] we name specific required versions. Maybe we
> > should reference this section.
> >
> > While on it, in [2] section 17.7.5 is named "Visual". I don't think
> > such a product exists and find it confusing. Maybe we should rename
> > the section to "Visual Studio".

Agreed.  I've pushed your patch for that:

> Here are corresponding patches.

The  ID is new in v17, so I also renamed it like
s/installation-notes-visual/installation-notes-visual-studio/.

(I didn't examine or push your other patch, which was about $SUBJECT.)

Re: The stats.sql test is failing sporadically in v14- on POWER7/AIX 7.1 buildfarm animals

2024-07-31 Thread Noah Misch

On Wed, Jul 31, 2024 at 02:00:00PM +0300, Alexander Lakhin wrote:
> --- 
> /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/expected/stats.out
>  2022-03-30 01:18:17.0 +
> +++ 
> /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/results/stats.out
>  2024-07-30 09:49:39.0 +
> @@ -165,11 +165,11 @@
>    WHERE relname like 'trunc_stats_test%' order by relname;
>     relname  | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | 
> n_dead_tup
> ---+---+---+---++
> -  trunc_stats_test  | 3 | 0 | 0 | 0 |  0
> -  trunc_stats_test1 | 4 | 2 | 1 | 1 |  0
> -  trunc_stats_test2 | 1 | 0 | 0 | 1 |  0
> -  trunc_stats_test3 | 4 | 0 | 0 | 2 |  2
> -  trunc_stats_test4 | 2 | 0 | 0 | 0 |  2
> +  trunc_stats_test  | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test1 | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test2 | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test3 | 0 | 0 | 0 | 0 |  0
> +  trunc_stats_test4 | 0 | 0 | 0 | 0 |  0
> ...
> 
> inst/logfile contains:
> 2024-07-30 09:25:11.225 UTC [63307946:1] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2024-07-30 09:25:11.345 UTC [11206724:559] pg_regress/create_index LOG: 
> using stale statistics instead of current ones because stats collector is
> not responding
> ...

> I could not find such failures coming from POWER8 animals: hoverfly
> (running AIX 7200-04-03-2038), ayu, boa, chub, and I did not encounter such
> anomalies on x86 nor ARM platforms.

The animals you list as affected share a filesystem.  The failure arises from
the slow filesystem metadata operations of that filesystem.

> Thus, it looks like this stats collector issue is only happening on this
> concrete platform, and given [11], I think such failures perhaps should
> be just ignored for the next two years (until v14 EOL) unless AIX 7.1
> will be upgraded and we see them on a vendor-supported OS version.

This has happened on non-POWER, I/O-constrained machines.  Still, I have been
ignoring these failures.  The stats subsystem was designed to drop stats
updates at times, which was always at odds with the need for stable tests.  So
the failures witness a defect of the test, not a defect of the backend.
Stabilizing this test was a known benefit of the new stats implementation.

Re: race condition when writing pg_control

2024-07-30 Thread Noah Misch

On Mon, Jul 15, 2024 at 03:44:48PM +1200, Thomas Munro wrote:
> On Fri, Jul 12, 2024 at 11:43 PM Noah Misch  wrote:
> > On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:
> > > On Fri, May 17, 2024 at 4:46 PM Thomas Munro  
> > > wrote:
> > > > The specific problem here is that LocalProcessControlFile() runs in
> > > > every launched child for EXEC_BACKEND builds.  Windows uses
> > > > EXEC_BACKEND, and Windows' NTFS file system is one of the two file
> > > > systems known to this list to have the concurrent read/write data
> > > > mashing problem (the other being ext4).
> >
> > > First idea idea I've come up with to avoid all of that: pass a copy of
> > > the "proto-controlfile", to coin a term for the one read early in
> > > postmaster startup by LocalProcessControlFile().  As far as I know,
> > > the only reason we need it is to suck some settings out of it that
> > > don't change while a cluster is running (mostly can't change after
> > > initdb, and checksums can only be {en,dis}abled while down).  Right?
> > > Children can just "import" that sucker instead of calling
> > > LocalProcessControlFile() to figure out the size of WAL segments yada
> > > yada, I think?  Later they will attach to the real one in shared
> > > memory for all future purposes, once normal interlocking is allowed.
> >
> > I like that strategy, particularly because it recreates what !EXEC_BACKEND
> > backends inherit from the postmaster.  It might prevent future bugs that 
> > would
> > have been specific to EXEC_BACKEND.
> 
> Thanks for looking!  Yeah, that is a good way to put it.

Oops, the way I put it turned out to be false.  Postmaster has ControlFile
pointing to shared memory before forking backends, so !EXEC_BACKEND children
are born that way.  In the postmaster, ControlFile->checkPointCopy->redo does
change after each checkpoint.

> The only other idea I can think of is that the Postmaster could take
> all of the things that LocalProcessControlFile() wants to extract from
> the file, and transfer them via that struct used for EXEC_BACKEND as
> individual variables, instead of this new proto-controlfile copy.  I
> think it would be a bigger change with no obvious-to-me additional
> benefit, so I didn't try it.

Yeah, that would be more future-proof but a bigger change.  One could argue
for a yet-larger refactor so even the !EXEC_BACKEND case doesn't read those
fields from ControlFile memory.  Then we could get rid of ControlFile ever
being set to something other than NULL or a shmem pointer.  ControlFileData's
mix of initdb-time fields, postmaster-start-time fields, and changes-anytime
fields is inconvenient here.

The unknown is the value of that future proofing.  Much EXEC_BACKEND early
startup code is shared with postmaster startup, which can assume it's the only
process.  I can't rule out a future bug where that shared code does a read
that's harmless in postmaster startup but harmful when an EXEC_BACKEND child
runs the same read.  For a changes-anytime field, the code would already be
subtly buggy in EXEC_BACKEND today, since it would be reading without an
LWLock.  For a postmaster-start-time field, things should be okay so long as
we capture the proto ControlFileData after the last change to such fields.
That invariant is not trivial to achieve, but it's not gravely hard either.

A possible middle option would be to use the proto control file, but
explicitly set its changes-anytime fields to bogus values.

What's your preference?  I don't think any of these would be bad decisions.
It could be clearer after enumerating how many ControlFile fields get used
this early.

Re: race condition in pg_class

2024-07-28 Thread Noah Misch

On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> >> A recent buildfarm test failure [1] showed that the
> >> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> 
> >> But as the test going to be modified by the inplace110-successors-v8.patch
> >> and the modified test (with all three latest patches applied) passes
> >> reliably in the same conditions, maybe this failure doesn't deserve a
> >> deeper exploration.
> 
> > Agreed.  Let's just wait for code review of the actual bug fix, not develop 
> > a
> > separate change to stabilize the test.  One flake in three weeks is low 
> > enough
> > to make that okay.
> 
> It's now up to three similar failures in the past ten days

> Is it time to worry yet?  If this were HEAD only, I'd not be too
> concerned; but two of these three are on allegedly-stable branches.
> And we have releases coming up fast.

I don't know; neither decision feels terrible to me.  A bug fix that would
address both the data corruption causes and those buildfarm failures has been
awaiting review on this thread for 77 days.  The data corruption causes are
more problematic than 0.03% of buildfarm runs getting noise failures.  Two
wrongs don't make a right, but a commit masking that level of buildfarm noise
also feels like sending the wrong message.

Re: Built-in CTYPE provider

2024-07-26 Thread Noah Misch

On Wed, Jul 24, 2024 at 08:19:13AM -0700, Noah Misch wrote:
> On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote:
> > vote one way or the other on the question in
> > https://postgr.es/m/20240706195129...@rfd.leadboat.com?
> 
> I'll keep the voting open for another 24 hours from now
> or 36 hours after the last vote, whichever comes last.

I count 4.5 or 5 votes for "okay" and 2 votes for "not okay".  I've moved the
open item to "Non-bugs".

On Wed, Jul 17, 2024 at 11:06:43PM -0700, Jeff Davis wrote:
> You haven't established that any problem actually exists in version 17,
> and your arguments have been a moving target throughout this subthread.

I can understand that experience of yours.  It wasn't my intent to make a
moving target.  To be candid, I entered the thread with no doubts that you'd
agree with the problem.  When you and Tom instead shared a different view, I
switched to pursuing the votes to recognize the problem.  (Voting then held
that pg_c_utf8 is okay as-is.)

On Thu, Jul 18, 2024 at 09:52:44AM -0700, Jeff Davis wrote:
> On Thu, 2024-07-18 at 07:00 -0700, Noah Misch wrote:
> > Given all the messages on this thread, if the feature remains in
> > PostgreSQL, I
> > advise you to be ready to tolerate PostgreSQL "routinely updating the
> > built-in
> > provider to adopt any changes that Unicode makes".
> 
> You mean messages from me, like:
> 
>   * "I have no intention force a Unicode update" [1]
>   * "While nothing needs to be changed for 17, I agree that we may need
> to be careful in future releases not to break things." [2]
>   * "...you are right that we may need to freeze Unicode updates or be
> more precise about versioning..." [2]
>   * "If you are proposing that Unicode updates should not be performed
> if they affect the results of any IMMUTABLE function...I am neither
> endorsing nor opposing..." [3]
> 
> ?

Those, plus all the other messages.

On Fri, Jul 19, 2024 at 08:50:41AM -0700, Jeff Davis wrote:
> Consider:
> 
> a. Some user creates an expression index on NORMALIZE(); vs.
> b. Some user chooses the builtin "C.UTF-8" locale and creates a partial
> index with a predicate like "string ~ '[[:alpha:]]'" (or an expression
> index on LOWER())
> 
> Both cases create a risk if we update Unicode in some future version.
> Why are you unconcerned about case (a), but highly concerned about case
> (b)?

I am not unconcerned about (a), but the v17 beta process gave an opportunity
to do something about (b) that it didn't give for (a).  Also, I have never
handled a user report involving NORMALIZE().  I have handled user reports
around regexp index inconsistency, e.g. the one at
https://www.youtube.com/watch?v=kNH94tmpUus&t=1490s

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2024-07-25 Thread Noah Misch

On Thu, Jul 25, 2024 at 10:52:13AM +0900, Michael Paquier wrote:
> On Wed, Jul 24, 2024 at 06:00:59AM -0700, Noah Misch wrote:
> > I'm still seeing need for s/int/int64/ at:

> I am attaching a patch for all these you have spotted, switching the
> logs to use %llx.  Does that look fine for you?

Yes.  I think that completes the project.

Re: Built-in CTYPE provider

2024-07-24 Thread Noah Misch

On Wed, Jul 17, 2024 at 03:03:26PM -0700, Noah Misch wrote:
> On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote:
> > On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote:
> > > > This is still marked as an open item for 17, but you've already
> > > > acknowledged[1] that no code changes are necessary in version 17.
> > > 
> > > Later posts on the thread made that obsolete.  The next step is to
> > > settle the
> > > question at https://postgr.es/m/20240706195129...@rfd.leadboat.com. 
> > > If that
> > > conclusion entails a remedy, v17 code changes may be part of that
> > > remedy.
> > 
> > This is the first time you've mentioned a code change in version 17. If
> 
> That's right.
> 
> > you have something in mind, please propose it. However, this feature
> > followed the right policies at the time of commit, so there would need
> > to be a strong consensus to accept such a change.
> 
> If I'm counting the votes right, you and Tom have voted that the feature's
> current state is okay, and I and Laurenz have voted that it's not okay.  I
> still hope more people will vote, to avoid dealing with the tie.  Daniel,
> Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are you
> willing to vote one way or the other on the question in
> https://postgr.es/m/20240706195129...@rfd.leadboat.com?

The last vote arrived 6 days ago.  So far, we have votes from Jeff, Noah, Tom,
Daniel, and Laurenz.  I'll keep the voting open for another 24 hours from now
or 36 hours after the last vote, whichever comes last.  If that schedule is
too compressed for anyone, do share.

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2024-07-24 Thread Noah Misch

On Tue, Jul 23, 2024 at 06:01:44PM +0900, Michael Paquier wrote:
> Hearing nothing but cicadas as now is their season, I have taken the
> initiative here to address this open item.
> 
> 0001 felt a bit too complicated in slru.h, so I have simplified it and
> kept all the details in slru.c with SlruFileName().
> 
> I have reviewed all the code that uses SLRUs, and spotted three more
> problematic code paths in predicate.c that needed an update like the
> others for some pagenos.  I've added these, and applied 0002.  We
> should be good now.

I'm still seeing need for s/int/int64/ at:

- "pagesegno" variable
- return value of MultiXactIdToOffsetSegment()
- return value of MXOffsetToMemberSegment()
- callers of previous two

Only the first should be a live bug, since multixact isn't electing the higher
pageno ceiling.

Re: [18] Policy on IMMUTABLE functions and Unicode updates

2024-07-24 Thread Noah Misch

On Tue, Jul 23, 2024 at 01:07:49PM -0700, Jeff Davis wrote:
> On Tue, 2024-07-23 at 07:39 -0700, Noah Misch wrote:
> > Short-term, we should remedy the step backward that pg_c_utf8 has taken:
> > https://postgr.es/m/20240718233908.52.nmi...@google.com
> > https://postgr.es/m/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.ca...@cybertec.at
> 
> Obviously I disagree that we've taken a step backwards.

Yes.

> Can you articulate the principle by which all of the other problems
> with IMMUTABLE are just fine, but updates to Unicode are intolerable,
> and only for PG_C_UTF8?

No, because I don't think all the other problems with IMMUTABLE are just fine.
The two messages linked cover the comparisons I do consider important,
especially the comparison between pg_c_utf8 and packager-frozen ICU.

Re: [18] Policy on IMMUTABLE functions and Unicode updates

2024-07-23 Thread Noah Misch

On Mon, Jul 22, 2024 at 09:34:42AM -0700, Jeff Davis wrote:
> On Mon, 2024-07-22 at 11:14 -0400, Robert Haas wrote:
> > On Mon, Jul 22, 2024 at 10:26 AM Peter Eisentraut  
> > wrote:
> > > I disagree with that.  We should put ourselves into the position to
> > > adopt new Unicode versions without fear.  Similar to updates to
> > > time
> > > zones, snowball, etc.
> > > 
> > > We can't be discussing the merits of the Unicode update every year.
> > > That would be madness.
> > 
> > Yeah, I agree with that 100%.
> 
> It's hard for me to argue; that was my reasoning during development.
> 
> But Noah seems to have a very strong opinion on this matter:
> 
> https://www.postgresql.org/message-id/20240629220857.fb.nmisch%40google.com
> 
> and I thought this thread would be a better opportunity for him to
> express it. Noah?

Long-term, we should handle this like Oracle, SQL Server, and DB2 do:
https://postgr.es/m/CA+fnDAbmn2d5tzZsj-4wmD0jApHTsg_zGWUpteb=omssx5r...@mail.gmail.com

Short-term, we should remedy the step backward that pg_c_utf8 has taken:
https://postgr.es/m/20240718233908.52.nmi...@google.com
https://postgr.es/m/486d71991a3f80ec1c47e1bd7931e2ef3627b6b3.ca...@cybertec.at


$SUBJECT has proposed remedy "take more care with Unicode updates".  If one
wanted to pursue that, it should get more specific, by giving one or both of:

(a) principles for deciding whether a Unicode update is okay
(b) examples of past Unicode release changes and whether PostgreSQL should
adopt a future Unicode version making a similar change

That said, I'm not aware of an (a) or (b) likely to create an attractive
compromise between the "index scan agrees with seqscan after pg_upgrade" goal
(https://postgr.es/m/20240706195129...@rfd.leadboat.com) and the "don't freeze
Unicode data" goal
(https://postgr.es/m/CA+TgmoZRpOFVmQWKEXHdcKj9AFLbXT5ouwtXa58J=3ydlp0...@mail.gmail.com).
The "long-term" above would satisfy both goals.  If it were me, I would
abandon the "more care" proposal.

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-23 Thread Noah Misch

On Mon, Jul 22, 2024 at 12:00:45PM +0300, Nazir Bilal Yavuz wrote:
> On Sat, 20 Jul 2024 at 21:14, Noah Misch  wrote:
> > On a different naming topic, my review missed that field name
> > copy_storage_using_buffer_read_stream_private.last_block doesn't fit how the
> > field is used.  Code uses it like an nblocks.  So let's either rename the
> > field or change the code to use it as a last_block (e.g. initialize it to
> > nblocks-1, not nblocks).
> 
> I prefered renaming it as nblocks, since that is how it was used in
> RelationCopyStorageUsingBuffer() before. Also, I realized that instead
> of setting p.blocknum = 0; initializing blkno as 0 and using
> p.blocknum = blkno makes sense. Because, p.blocknum and blkno should
> always start with the same block number. The relevant patch is
> attached.

I felt the local variable change was not a clear improvement.  It would have
been fine for the original patch to do it in that style, but the style of the
original patch was also fine.  So I've pushed just the struct field rename.

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-20 Thread Noah Misch

On Sat, Jul 20, 2024 at 03:01:31PM +0300, Nazir Bilal Yavuz wrote:
> On Sat, 20 Jul 2024 at 14:27, Noah Misch  wrote:
> >
> > On Thu, Jul 18, 2024 at 02:11:13PM +0300, Nazir Bilal Yavuz wrote:
> > > v4 is attached.
> >
> > Removal of the PinBufferForBlock() comment about the "persistence =
> > RELPERSISTENCE_PERMANENT" fallback started to feel like a loss.  I looked 
> > for
> > a way to re-add a comment about the fallback.
> > https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html
> > shows no test coverage of that fallback, and I think the fallback is
> > unreachable.  Hence, I've removed the fallback in a separate commit.  I've
> > pushed that and your three patches.  Thanks.
> 
> Thanks for the separate commit and push!
> 
> With the separate commit (e00c45f685), does it make sense to rename
> the smgr_persistence parameter of the ReadBuffer_common() to
> persistence? Because, ExtendBufferedRelTo() calls ReadBuffer_common()
> with rel's persistence now, not with smgr's persistence.

BMR_REL() doesn't set relpersistence, so bmr.relpersistence is associated with
bmr.smgr and is unset if bmr.rel is set.  That is to say, bmr.relpersistence
is an smgr_persistence.  It could make sense to change ReadBuffer_common() to
take a BufferManagerRelation instead of the three distinct arguments.

On a different naming topic, my review missed that field name
copy_storage_using_buffer_read_stream_private.last_block doesn't fit how the
field is used.  Code uses it like an nblocks.  So let's either rename the
field or change the code to use it as a last_block (e.g. initialize it to
nblocks-1, not nblocks).

Re: race condition in pg_class

2024-07-20 Thread Noah Misch

On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed.
> 
> A recent buildfarm test failure [1] showed that the
> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> on a slow machine

> But as the test going to be modified by the inplace110-successors-v8.patch
> and the modified test (with all three latest patches applied) passes
> reliably in the same conditions, maybe this failure doesn't deserve a
> deeper exploration.

Agreed.  Let's just wait for code review of the actual bug fix, not develop a
separate change to stabilize the test.  One flake in three weeks is low enough
to make that okay.

> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-20 Thread Noah Misch

On Thu, Jul 18, 2024 at 02:11:13PM +0300, Nazir Bilal Yavuz wrote:
> v4 is attached.

Removal of the PinBufferForBlock() comment about the "persistence =
RELPERSISTENCE_PERMANENT" fallback started to feel like a loss.  I looked for
a way to re-add a comment about the fallback.
https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html
shows no test coverage of that fallback, and I think the fallback is
unreachable.  Hence, I've removed the fallback in a separate commit.  I've
pushed that and your three patches.  Thanks.

Re: Built-in CTYPE provider

2024-07-18 Thread Noah Misch

On Thu, Jul 18, 2024 at 01:03:31PM -0400, Tom Lane wrote:
> This whole discussion seems quite bizarre to me.  In the first
> place, it is certain that Unicode will continue to evolve, and
> I can't believe that we'd just freeze pg_c_utf8 on the current
> definition forever.  Whether the first change happens in v18
> or years later doesn't seem like a particularly critical point.
> 
> In the second place, I cannot understand why pg_c_utf8 is being
> held to a mutability standard that we have never applied to any
> other locale-related functionality --- and indeed could not do
> so, since in most cases that functionality has been buried in
> libraries we don't control.  It seems to me to be already a

With libc and ICU providers, packagers have a way to avoid locale-related
behavior changes.  That's the "mutability standard" I want pg_c_utf8 to join.
pg_c_utf8 is the one provider where packagers can't opt out[1] of annual
pg_upgrade-time index scan breakage on affected expression indexes.

> great step forward that with pg_c_utf8, at least we can guarantee
> that the behavior won't change without us knowing about it.
> Noah's desire to revert the feature makes the mutability situation
> strictly worse, because people will have to continue to rely on
> OS-provided functionality that can change at any time.

I see:
- one step forward:
  "string1 < string2" won't change, forever, regardless of packager choices
- one step backward:
  "string ~ '[[:alpha:]]'" will change at pg_upgrade time, regardless of 
packager choices

I think one's perspective on the relative importance of the step forward and
the step backward depends on the sort of packages one uses today.  Consider a
user of Debian packages with locale!=C, doing Debian upgrades and pg_upgrade.
For that user, pg_c_utf8 gets far less index corruption than an ICU locale.
The step forward is a great step forward _for this user_, and the step
backward is in the noise next to the step forward.

I'm with a different kind of packager.  I don't tolerate index scans returning
wrong answers.  To achieve that, my libc and ICU aren't changing collation
behavior.  I suspect my packages won't offer a newer ICU behavior until
PostgreSQL gets support for multiple ICU library versions per database.  (SQL
Server, DB2 and Oracle already do.  I agree we can't freeze forever.  The
multiple-versions feature gets more valuable each year.)  _For this_ kind of
package, the step forward is a no-op.  The step backward is the sole effect on
this kind of package.

How much does that pair of perspectives explain the contrast between my
"revert" and your "great step forward"?  We may continue to disagree on the
ultimate decision, but I hope I can make my position cease to appear bizarre
to you.

Thanks,
nm

[1] Daniel Verite said packagers could patch src/Makefile.global.in and run
"make -C src/common/unicode update-unicode".  Editing src/Makefile.global.in
is modifying PostgreSQL, not configuring a packager-facing option.

Re: Built-in CTYPE provider

2024-07-18 Thread Noah Misch

On Thu, Jul 18, 2024 at 10:05:34AM +0200, Laurenz Albe wrote:
> On Wed, 2024-07-17 at 15:03 -0700, Noah Misch wrote:
> > If I'm counting the votes right, you and Tom have voted that the feature's
> > current state is okay, and I and Laurenz have voted that it's not okay.
> 
> Maybe I should expand my position.
> 
> I am very much for the built-in CTYPE provider.  When I said that I am against
> changes in major versions, I mean changes that are likely to affect real-life
> usage patterns.  If there are modifications affecting a code point that was
> previously unassigned, it is *theoretically* possible, but very unlikely, that
> someone has stored it in a database.  I would want to deliberate about any 
> change
> affecting such a code point, and if the change seems highly desirable, we can
> consider applying it.
> 
> What I am against is routinely updating the built-in provider to adopt any 
> changes
> that Unicode makes.

Given all the messages on this thread, if the feature remains in PostgreSQL, I
advise you to be ready to tolerate PostgreSQL "routinely updating the built-in
provider to adopt any changes that Unicode makes".  Maybe someone will change
something in v18 so it's not like that, but don't count on it.

Would you like to change your vote to "okay", keep your vote at "not okay", or
change it to an abstention?

> To make a comparison with Tom's argument upthread: we have slightly changed 
> how
> floating point computations work, even though they are IMMUTABLE.  But I'd 
> argue
> that very few people build indexes on the results of floating point arithmetic
> (and those who do are probably doing something wrong), so the risk is 
> acceptable.
> But people index strings all the time.

Agreed.

Re: Built-in CTYPE provider

2024-07-17 Thread Noah Misch

On Wed, Jul 17, 2024 at 08:48:46AM -0700, Jeff Davis wrote:
> On Thu, 2024-07-11 at 05:50 -0700, Noah Misch wrote:
> > > This is still marked as an open item for 17, but you've already
> > > acknowledged[1] that no code changes are necessary in version 17.
> > 
> > Later posts on the thread made that obsolete.  The next step is to
> > settle the
> > question at https://postgr.es/m/20240706195129...@rfd.leadboat.com. 
> > If that
> > conclusion entails a remedy, v17 code changes may be part of that
> > remedy.
> 
> This is the first time you've mentioned a code change in version 17. If

That's right.

> you have something in mind, please propose it. However, this feature
> followed the right policies at the time of commit, so there would need
> to be a strong consensus to accept such a change.

If I'm counting the votes right, you and Tom have voted that the feature's
current state is okay, and I and Laurenz have voted that it's not okay.  I
still hope more people will vote, to avoid dealing with the tie.  Daniel,
Peter, and Jeremy, you're all listed as reviewers on commit f69319f.  Are you
willing to vote one way or the other on the question in
https://postgr.es/m/20240706195129...@rfd.leadboat.com?

A tie would become a decision against the unreleased behavior.

In the event of a decision against the unreleased behavior, reverting the
feature is the remedy that could proceed without further decision making.

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-17 Thread Noah Misch

On Wed, Jul 17, 2024 at 12:22:49PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 16 Jul 2024 at 15:19, Noah Misch  wrote:
> > On Tue, Jul 16, 2024 at 02:11:20PM +0300, Nazir Bilal Yavuz wrote:
> > > On Fri, 12 Jul 2024 at 02:52, Noah Misch  wrote:
> > > > On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> > > > > --- a/src/backend/storage/aio/read_stream.c
> > > > > +++ b/src/backend/storage/aio/read_stream.c
> > > > > @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
> > > > >   {
> > > > >   stream->ios[i].op.rel = rel;
> > > > >   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> > > > > - stream->ios[i].op.smgr_persistence = 0;
> > > > > + stream->ios[i].op.smgr_persistence = 
> > > > > rel->rd_rel->relpersistence;
> > > >
> > > > Does the following comment in ReadBuffersOperation need an update?
> > > >
> > > > /*
> > > >  * The following members should be set by the caller.  If only 
> > > > smgr is
> > > >  * provided without rel, then smgr_persistence can be set to 
> > > > override the
> > > >  * default assumption of RELPERSISTENCE_PERMANENT.
> > > >  */
> > >
> > > I believe it does not need to be updated but I renamed
> > > 'ReadBuffersOperation.smgr_persistence' as
> > > 'ReadBuffersOperation.persistence'. So, this comment is updated as
> > > well. I think that rename suits better because persistence does not
> > > need to come from smgr, it could come from relation, too. Do you think
> > > it is a good idea? If it is, does it need a separate commit?
> >
> > The rename is good.  I think the comment implies "persistence" is unused 
> > when
> > rel!=NULL.  That implication is true before the patch but false after the
> > patch.
> 
> What makes it false after the patch? I think the logic did not change.
> If there is rel, the value of persistence is obtained from
> 'rel->rd_rel->relpersistence'. If there is no rel, then smgr is used
> to obtain its value.

First, the patch removes the "default assumption of RELPERSISTENCE_PERMANENT".
It's now an assertion failure.

The second point is about "If only smgr is provided without rel".  Before the
patch, the extern functions that take a ReadBuffersOperation argument examine
smgr_persistence if and only if rel==NULL.  That's consistent with the
comment.  After the patch, StartReadBuffersImpl() calling PinBufferForBlock()
uses the field unconditionally.

On that note, does WaitReadBuffers() still have a reason to calculate its
persistence as follows, or should this patch make it "persistence =
operation->persistence"?

persistence = operation->rel
? operation->rel->rd_rel->relpersistence
: RELPERSISTENCE_PERMANENT;

> I think what you said in the counter argument makes sense.

Okay.

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-16 Thread Noah Misch

On Tue, Jul 16, 2024 at 02:11:20PM +0300, Nazir Bilal Yavuz wrote:
> On Fri, 12 Jul 2024 at 02:52, Noah Misch  wrote:
> > On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> > > --- a/src/backend/storage/aio/read_stream.c
> > > +++ b/src/backend/storage/aio/read_stream.c
> > > @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
> > >   {
> > >   stream->ios[i].op.rel = rel;
> > >   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> > > - stream->ios[i].op.smgr_persistence = 0;
> > > + stream->ios[i].op.smgr_persistence = 
> > > rel->rd_rel->relpersistence;
> >
> > Does the following comment in ReadBuffersOperation need an update?
> >
> > /*
> >  * The following members should be set by the caller.  If only smgr 
> > is
> >  * provided without rel, then smgr_persistence can be set to 
> > override the
> >  * default assumption of RELPERSISTENCE_PERMANENT.
> >  */
> 
> I believe it does not need to be updated but I renamed
> 'ReadBuffersOperation.smgr_persistence' as
> 'ReadBuffersOperation.persistence'. So, this comment is updated as
> well. I think that rename suits better because persistence does not
> need to come from smgr, it could come from relation, too. Do you think
> it is a good idea? If it is, does it need a separate commit?

The rename is good.  I think the comment implies "persistence" is unused when
rel!=NULL.  That implication is true before the patch but false after the
patch.

> > > @@ -4667,19 +4698,31 @@ RelationCopyStorageUsingBuffer(RelFileLocator 
> > > srclocator,
> >
> > >   /* Iterate over each block of the source relation file. */
> > >   for (blkno = 0; blkno < nblocks; blkno++)
> > >   {
> > >   CHECK_FOR_INTERRUPTS();
> > >
> > >   /* Read block from source relation. */
> > > - srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, 
> > > blkno,
> > > - 
> > >RBM_NORMAL, bstrategy_src,
> > > - 
> > >permanent);
> > > + srcBuf = read_stream_next_buffer(src_stream, NULL);
> > >   LockBuffer(srcBuf, BUFFER_LOCK_SHARE);
> >
> > I think this should check for read_stream_next_buffer() returning
> > InvalidBuffer.  pg_prewarm doesn't, but the other callers do, and I think 
> > the
> > other callers are a better model.  LockBuffer() doesn't check the
> > InvalidBuffer case, so let's avoid the style of using a
> > read_stream_next_buffer() return value without checking.
> 
> There is an assert in the LockBuffer which checks for the
> InvalidBuffer. If that is not enough, we may add an if check for
> InvalidBuffer but what should we do in this case? It should not
> happen, so erroring out may be a good idea.

I like this style from read_stream_reset():

while ((buffer = read_stream_next_buffer(stream, NULL)) != 
InvalidBuffer)
{
...
}

That is, don't iterate over block numbers.  Drain the stream until empty.  If
the stream returns a number of blocks higher or lower than we expected, we
won't detect that, and that's okay.  It's not a strong preference, so I'm open
to arguments against that from you or others.  A counterargument could be that
read_stream_reset() doesn't know the buffer count, so it has no choice.  The
counterargument could say that callers knowing the block count should use the
pg_prewarm() style, and others should use the read_stream_reset() style.

Re: Confine vacuum skip logic to lazy_scan_skip

2024-07-15 Thread Noah Misch

On Mon, Jul 15, 2024 at 03:26:32PM +1200, Thomas Munro wrote:
> On Mon, Jul 8, 2024 at 2:49 AM Noah Misch  wrote:
> > what is the scope of the review you seek?
> 
> The patch "Refactor tidstore.c memory management." could definitely
> use some review.

That's reasonable.  radixtree already forbids mutations concurrent with
iteration, so there's no new concurrency hazard.  One alternative is
per_buffer_data big enough for MaxOffsetNumber, but that might thrash caches
measurably.  That patch is good to go apart from these trivialities:

> - return &(iter->output);
> + return &iter->output;

This cosmetic change is orthogonal to the patch's mission.

> - for (wordnum = 0; wordnum < page->header.nwords; wordnum++)
> + for (int wordnum = 0; wordnum < page->header.nwords; wordnum++)

Likewise.

Re: race condition in pg_class

2024-07-14 Thread Noah Misch

On Thu, Jul 04, 2024 at 03:08:16PM -0700, Noah Misch wrote:
> On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> > 28.06.2024 08:13, Noah Misch wrote:
> > > Pushed. ...
> > 
> > Please look also at another anomaly, I've discovered.
> > 
> > An Assert added with d5f788b41 may be falsified with:
> > CREATE TABLE t(a int PRIMARY KEY);
> > INSERT INTO t VALUES (1);
> > CREATE VIEW v AS SELECT * FROM t;
> > 
> > MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
> >   WHEN MATCHED THEN DO NOTHING
> >   WHEN NOT MATCHED THEN DO NOTHING;
> > 
> > TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: 
> > "nodeModifyTable.c", Line: 2891, PID: 1590670
> 
> Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
> returns true

I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
the remaining patches.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}

+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:

 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+
+  pg_class heap_inplace_update_scan() callers: before the call, acquire
+  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
+  (VACUUM), or a mode with strictly more conflicts.  If the update targets a
+  row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
+  on the table.

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-13 Thread Noah Misch

On Fri, Jul 12, 2024 at 04:50:17PM -0700, Jeff Davis wrote:
> On Fri, 2024-07-12 at 16:11 -0700, Noah Misch wrote:
> > Since refresh->relation is a RangeVar, this departs from the standard
> > against
> > repeated name lookups, from CVE-2014-0062 (commit 5f17304).
> 
> Interesting, thank you.
> 
> I did a rough refactor and attached v3. Aside from cleanup issues, is
> this what you had in mind?

> +extern ObjectAddress RefreshMatViewByOid(Oid matviewOid, bool skipData, bool 
> concurrent,
> + 
>  const char *queryString, ParamListInfo params,
> + 
>  QueryCompletion *qc);
>  

Yes, that's an API design that avoids repeated name lookups.

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-12 Thread Noah Misch

On Fri, Jul 12, 2024 at 02:50:52PM -0700, Jeff Davis wrote:
> On Thu, 2024-07-11 at 05:52 -0700, Noah Misch wrote:
> > > I could try to refactor it into two statements and execute them
> > > separately, or I could try to rewrite the statement to use a fully-
> > > qualified destination table before execution. Thoughts?
> > 
> > Those sound fine.  Also fine: just adding a comment on why creation
> > namespace
> > considerations led to not doing it there.
> 
> Attached. 0002 separates the CREATE MATERIALIZED VIEW ... WITH DATA
> into (effectively):
> 
>CREATE MATERIALIZED VIEW ... WITH NO DATA;
>REFRESH MATERIALIZED VIEW ...;
> 
> Using refresh also achieves the stated goal more directly: to (mostly)
> ensure that a subsequent REFRESH will succeed.
> 
> Note: the creation itself no longer executes in a security-restricted
> context, but I don't think that's a problem. The only reason it's using
> the security restricted context is so the following REFRESH will
> succeed, right?

Right, that's the only reason.

> @@ -346,13 +339,21 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt 
> *stmt,
>   PopActiveSnapshot();
>   }
>  
> - if (is_matview)
> + /*
> +  * For materialized views, use REFRESH, which locks down
> +  * security-restricted operations and restricts the search_path.
> +  * Otherwise, one could create a materialized view not possible to
> +  * refresh.
> +  */
> + if (do_refresh)
>   {
> - /* Roll back any GUC changes */
> - AtEOXact_GUC(false, save_nestlevel);
> + RefreshMatViewStmt *refresh = makeNode(RefreshMatViewStmt);
>  
> - /* Restore userid and security context */
> - SetUserIdAndSecContext(save_userid, save_sec_context);
> + refresh->relation = into->rel;
> + ExecRefreshMatView(refresh, pstate->p_sourcetext, NULL, qc);
> +
> + if (qc)
> + qc->commandTag = CMDTAG_SELECT;
>   }

Since refresh->relation is a RangeVar, this departs from the standard against
repeated name lookups, from CVE-2014-0062 (commit 5f17304).

Re: race condition when writing pg_control

2024-07-12 Thread Noah Misch

On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:
> On Fri, May 17, 2024 at 4:46 PM Thomas Munro  wrote:
> > The specific problem here is that LocalProcessControlFile() runs in
> > every launched child for EXEC_BACKEND builds.  Windows uses
> > EXEC_BACKEND, and Windows' NTFS file system is one of the two file
> > systems known to this list to have the concurrent read/write data
> > mashing problem (the other being ext4).

> First idea idea I've come up with to avoid all of that: pass a copy of
> the "proto-controlfile", to coin a term for the one read early in
> postmaster startup by LocalProcessControlFile().  As far as I know,
> the only reason we need it is to suck some settings out of it that
> don't change while a cluster is running (mostly can't change after
> initdb, and checksums can only be {en,dis}abled while down).  Right?
> Children can just "import" that sucker instead of calling
> LocalProcessControlFile() to figure out the size of WAL segments yada
> yada, I think?  Later they will attach to the real one in shared
> memory for all future purposes, once normal interlocking is allowed.

I like that strategy, particularly because it recreates what !EXEC_BACKEND
backends inherit from the postmaster.  It might prevent future bugs that would
have been specific to EXEC_BACKEND.

> I dunno.  Draft patch attached.  Better plans welcome.  This passes CI
> on Linux systems afflicted by EXEC_BACKEND, and Windows.  Thoughts?

Looks reasonable.  I didn't check over every detail, given the draft status.

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-11 Thread Noah Misch

On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> I am working on using read streams in the CREATE DATABASE command when the
> strategy is wal_log. RelationCopyStorageUsingBuffer() function is used in
> this context. This function reads source buffers then copies them to the

Please rebase.  I applied this to 40126ac for review purposes.

> a. Timings:
> b. strace:
> c. perf:

Thanks for including those details.  That's valuable confirmation.

> Subject: [PATCH v1 1/3] Refactor PinBufferForBlock() to remove if checks about
>  persistence
> 
> There are if checks in PinBufferForBlock() function to set persistence
> of the relation and this function is called for the each block in the
> relation. Instead of that, set persistence of the relation before
> PinBufferForBlock() function.

I tried with the following additional patch to see if PinBufferForBlock() ever
gets invalid smgr_relpersistence:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1098,6 +1098,11 @@ PinBufferForBlock(Relation rel,

Assert(blockNum != P_NEW);

+   if (!(smgr_persistence == RELPERSISTENCE_TEMP ||
+ smgr_persistence == RELPERSISTENCE_PERMANENT ||
+ smgr_persistence == RELPERSISTENCE_UNLOGGED))
+   elog(WARNING, "unexpected relpersistence %d", smgr_persistence);
+
if (smgr_persistence == RELPERSISTENCE_TEMP)
{
io_context = IOCONTEXT_NORMAL;

That still gets relpersistence==0 in various src/test/regress cases.  I think
the intent was to prevent that.  If not, please add a comment about when
relpersistence==0 is still allowed.

> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
>   {
>   stream->ios[i].op.rel = rel;
>   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> - stream->ios[i].op.smgr_persistence = 0;
> + stream->ios[i].op.smgr_persistence = 
> rel->rd_rel->relpersistence;

Does the following comment in ReadBuffersOperation need an update?

/*
 * The following members should be set by the caller.  If only smgr is
 * provided without rel, then smgr_persistence can be set to override 
the
 * default assumption of RELPERSISTENCE_PERMANENT.
 */

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> +/*
> + * Helper struct for read stream object used in
> + * RelationCopyStorageUsingBuffer() function.
> + */
> +struct copy_storage_using_buffer_read_stream_private
> +{
> + BlockNumber blocknum;
> + int64   last_block;
> +};

Why is last_block an int64, not a BlockNumber?

> @@ -4667,19 +4698,31 @@ RelationCopyStorageUsingBuffer(RelFileLocator 
> srclocator,

>   /* Iterate over each block of the source relation file. */
>   for (blkno = 0; blkno < nblocks; blkno++)
>   {
>   CHECK_FOR_INTERRUPTS();
>  
>   /* Read block from source relation. */
> - srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
> - 
>RBM_NORMAL, bstrategy_src,
> - 
>permanent);
> + srcBuf = read_stream_next_buffer(src_stream, NULL);
>   LockBuffer(srcBuf, BUFFER_LOCK_SHARE);

I think this should check for read_stream_next_buffer() returning
InvalidBuffer.  pg_prewarm doesn't, but the other callers do, and I think the
other callers are a better model.  LockBuffer() doesn't check the
InvalidBuffer case, so let's avoid the style of using a
read_stream_next_buffer() return value without checking.

Thanks,
nm

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-11 Thread Noah Misch

On Tue, Jul 09, 2024 at 05:47:36PM -0700, Jeff Davis wrote:
> On Tue, 2024-07-09 at 15:20 +0900, Michael Paquier wrote:
> > On Sun, Jun 30, 2024 at 03:23:44PM -0700, Noah Misch wrote:

> > Hmm.  Is RestrictSearchPath() something that we should advertise more
> > strongly, thinking here about extensions that call NewGUCNestLevel()?
> > That would be really easy to miss, and it could have bad
> > consequences.
> > I know that this is not something that's published in the release
> > notes, but it looks like something sensible to have, though.
> 
> The pattern also involves SetUserIdAndSecContext(). Perhaps we could
> come up with a wrapper function to better encapsulate the general
> pattern?

Worth a look.  usercontext.c has an existing wrapper for a superuser process
switching to an untrusted user.  It could become the home for another wrapper
targeting MAINTAIN-relevant callers.

> > > While "not necessary for security", ExecCreateTableAs() should do
> > > it for the
> > > same reason it calls NewGUCNestLevel().
> > 
> > +1.
> 
> Do you have a suggestion about how that should be done?
> 
> It's not trivial, because the both creates the table and populates it
> in ExecutorRun. For table creation, we need to use the original
> search_path, but we need to use the restricted search_path when
> populating it.
> 
> I could try to refactor it into two statements and execute them
> separately, or I could try to rewrite the statement to use a fully-
> qualified destination table before execution. Thoughts?

Those sound fine.  Also fine: just adding a comment on why creation namespace
considerations led to not doing it there.

Re: Built-in CTYPE provider

2024-07-11 Thread Noah Misch

On Tue, Jul 09, 2024 at 04:20:12PM -0700, Jeff Davis wrote:
> On Mon, 2024-07-08 at 18:05 -0700, Noah Misch wrote:
> > I'm thinking about these
> > aggravating factors for $SUBJECT:
> 
> This is still marked as an open item for 17, but you've already
> acknowledged[1] that no code changes are necessary in version 17.

Later posts on the thread made that obsolete.  The next step is to settle the
question at https://postgr.es/m/20240706195129...@rfd.leadboat.com.  If that
conclusion entails a remedy, v17 code changes may be part of that remedy.

> The idea that you're arguing against is "stability within a PG major
> version". There's no new discovery here: it was listed under the
> heading of "Benefits" near the top of my initial proposal[2]

Thanks for that distillation.  More specifically, I'm arguing against the lack
of choice about instability across major versions:

  | ICU collations| pg_c_utf8
--|---|--
Corruption within a major version | packager's choice | no
Corruption at pg_upgrade time | packager's choice | yes

I am a packager who chooses no-corruption (chooses stability).  As a packager,
the pg_c_utf8 stability within major versions is never a bad thing, but it
does not compensate for instability across major versions.  I don't want a
future in which pg_c_utf8 is the one provider that integrity-demanding
pg_upgrade users should not use.

> and known to all reviewers.

If after https://postgr.es/m/20240706195129...@rfd.leadboat.com and
https://postgr.es/m/20240709010545.8c.nmi...@google.com they think $SUBJECT
should continue as-committed, they should vote that way.  Currently, we have
multiple votes in one direction and multiple votes in the other direction.  If
all three reviewers were to vote in the same direction (assuming no other new
votes), I argue that such a count would render whichever way they vote as the
conclusion.  Does that match your count?

> [1]
> https://www.postgresql.org/message-id/20240701230352.2c.nmi...@google.com
> [2]
> https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com

Re: Built-in CTYPE provider

2024-07-08 Thread Noah Misch

On Sat, Jul 06, 2024 at 04:19:21PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > As a released feature, NORMALIZE() has a different set of remedies to choose
> > from, and I'm not proposing one.  I may have sidetracked this thread by
> > talking about remedies without an agreement that pg_c_utf8 has a problem.  
> > My
> > question for the PostgreSQL maintainers is this:
> 
> >   textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite 
> > being
> >   IMMUTABLE, will change behavior in some major releases.  pg_upgrade does 
> > not
> >   have a concept of IMMUTABLE functions changing, so index scans will return
> >   wrong query results after upgrade.  Is it okay for v17 to release a
> >   pg_c_utf8 planned to behave that way when upgrading v17 to v18+?
> 
> I do not think it is realistic to define "IMMUTABLE" as meaning that
> the function will never change behavior until the heat death of the
> universe.  As a counterexample, we've not worried about applying
> bug fixes or algorithm improvements that change the behavior of
> "immutable" numeric computations.

True.  There's a continuum from "releases can change any IMMUTABLE function"
to "index integrity always wins, even if a function is as wrong as 1+1=3".
I'm less concerned about the recent "Incorrect results from numeric round"
thread, even though it's proposing to back-patch.  I'm thinking about these
aggravating factors for $SUBJECT:

- $SUBJECT is planning an annual cadence of this kind of change.

- We already have ICU providing collation support for the same functions.
  Unlike $SUBJECT, ICU integration gives packagers control over when to accept
  corruption at pg_upgrade time.

- SQL Server, DB2 and Oracle do their Unicode updates in a non-corrupting way.
  (See Jeremy Schneider's reply concerning DB2 and Oracle.)

- lower() and regexp are more popular in index expressions than
  high-digit-count numeric calculations.

> I'd say a realistic policy is "immutable means we don't intend to
> change it within a major release".  If we do change the behavior,
> either as a bug fix or a major-release improvement, that should
> be release-noted so that people know they have to rebuild dependent
> indexes and matviews.

It sounds like you're very comfortable with $SUBJECT proceeding in its current
form.  Is that right?

Re: Faster "SET search_path"

2024-07-08 Thread Noah Misch

On Mon, Jul 08, 2024 at 04:39:21PM -0700, Jeff Davis wrote:
> On Sun, 2024-06-30 at 15:30 -0700, Noah Misch wrote:
> > You're caching the result of object_aclcheck(NamespaceRelationId,
> > ...), so
> > pg_auth_members changes
> 
> Thank you for the report.
> 
> Question: to check for changes to pg_auth_members, it seems like either
> AUTHMEMROLEMEM or AUTHMEMMEMROLE work, and to use both would be
> redundant. Am I missing something, or should I just pick one
> arbitrarily (or by some convention)?

One suffices.  initialize_acl() is the only other place that invals on this
catalog, so consider it a fine convention to mimic.

> >  and pg_database changes also need to invalidate this
> > cache.  (pg_database affects the ACL_CREATE_TEMP case in
> > pg_namespace_aclmask_ext()
> 
> I am having trouble finding an actual problem with ACL_CREATE_TEMP.
> search_path ACL checks are normally bypassed for the temp namespace, so
> it can only happen when the actual temp namespace name (e.g.
> pg_temp_NNN) is explicitly included. In that case, the mask is
> ACL_USAGE, so the two branches in pg_namespace_aclmask_ext() are
> equivalent, right?

I don't know.

> This question is not terribly important for the fix, because
> invalidating for each pg_database change is still necessary as you
> point out

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-07-08 Thread Noah Misch

On Mon, Jul 08, 2024 at 02:09:21PM -0700, Jacob Champion wrote:
> On Sun, Jun 30, 2024 at 10:48 AM Noah Misch  wrote:
> > That said, it
> > may be more fruitful to arrange for authentication timeout to cut through 
> > PAM
> > etc.
> 
> That seems mostly out of our hands -- the misbehaving modules are free
> to ignore our signals (and do). Is there another way to force the
> issue?

Two ways at least (neither of them cheap):
- Invoke PAM in a subprocess, and SIGKILL that process if needed.
- Modify the module to be interruptible.

> > Hanging connection slots hurt even if they lack an xmin.
> 
> Oh, would releasing the xmin not really move the needle, then?

It still moves the needle.

Re: Confine vacuum skip logic to lazy_scan_skip

2024-07-07 Thread Noah Misch

On Fri, Jun 28, 2024 at 05:36:25PM -0400, Melanie Plageman wrote:
> I've attached a WIP v11 streaming vacuum patch set here that is
> rebased over master (by Thomas), so that I could add a CF entry for
> it. It still has the problem with the extra WAL write and fsync calls
> investigated by Thomas above. Thomas has some work in progress doing
> streaming write-behind to alleviate the issues with the buffer access
> strategy and streaming reads. When he gets a version of that ready to
> share, he will start a new "Streaming Vacuum" thread.

To avoid reviewing the wrong patch, I'm writing to verify the status here.
This is Needs Review in the commitfest.  I think one of these two holds:

1. Needs Review is valid.
2. It's actually Waiting on Author.  You're commissioning a review of the
   future-thread patch, not this one.

If it's (1), given the WIP marking, what is the scope of the review you seek?
I'm guessing performance is out of scope; what else is in or out of scope?

Re: Vectored I/O in bulk_write.c

2024-07-06 Thread Noah Misch

On Tue, Apr 09, 2024 at 04:51:52PM +1200, Thomas Munro wrote:
> Here's a rebase.  I decided against committing this for v17 in the
> end.  There's not much wrong with it AFAIK, except perhaps an
> unprincipled chopping up of writes with large io_combine_limit due to
> simplistic flow control, and I liked the idea of having a decent user
> of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit
> faster, but... I'd like to try something more ambitious that
> streamifies this and also the "main" writeback paths.

I see this in the CF as Needs Review since 2024-03-10, but this 2024-04-09
message sounds like you were abandoning it.  Are you still commissioning a
review of this thread's patches, or is the CF entry obsolete?

Re: Built-in CTYPE provider

2024-07-06 Thread Noah Misch

On Fri, Jul 05, 2024 at 02:38:45PM -0700, Jeff Davis wrote:
> On Thu, 2024-07-04 at 14:26 -0700, Noah Misch wrote:
> > I think you're saying that if some Unicode update changes the results
> > of a
> > STABLE function but does not change the result of any IMMUTABLE
> > function, we
> > may as well import that update.  Is that about right?  If so, I
> > agree.
> 
> If you are proposing that Unicode updates should not be performed if
> they affect the results of any IMMUTABLE function, then that's a new
> policy.
> 
> For instance, the results of NORMALIZE() changed from PG15 to PG16 due
> to commit 1091b48cd7:
> 
>   SELECT NORMALIZE(U&'\+01E030',nfkc)::bytea;
> 
>   Version 15: \xf09e80b0
> 
>   Version 16: \xd0b0

As a released feature, NORMALIZE() has a different set of remedies to choose
from, and I'm not proposing one.  I may have sidetracked this thread by
talking about remedies without an agreement that pg_c_utf8 has a problem.  My
question for the PostgreSQL maintainers is this:

  textregexeq(... COLLATE pg_c_utf8, '[[:alpha:]]') and lower(), despite being
  IMMUTABLE, will change behavior in some major releases.  pg_upgrade does not
  have a concept of IMMUTABLE functions changing, so index scans will return
  wrong query results after upgrade.  Is it okay for v17 to release a
  pg_c_utf8 planned to behave that way when upgrading v17 to v18+?

If the answer is yes, the open item closes.  If the answer is no, determining
the remedy can come next.

Lest concrete details help anyone reading, here are some affected objects:

  CREATE TABLE t (s text COLLATE pg_c_utf8);
  INSERT INTO t VALUES (U&'\+00a7dc'), (U&'\+001dd3');
  CREATE INDEX iexpr ON t ((lower(s)));
  CREATE INDEX ipred ON t (s) WHERE s ~ '[[:alpha:]]';

v17 can simulate the Unicode aspect of a v18 upgrade, like this:

  sed -i 's/^UNICODE_VERSION.*/UNICODE_VERSION = 16.0.0/' src/Makefile.global.in
  # ignore test failures (your ICU likely doesn't have the Unicode 16.0.0 draft)
  make -C src/common/unicode update-unicode
  make
  make install
  pg_ctl restart

Behavior after that:

-- 2 rows w/ seq scan, 0 rows w/ index scan
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';
SET enable_seqscan = off;
SELECT 1 FROM t WHERE s ~ '[[:alpha:]]';

-- ERROR:  heap tuple (0,1) from table "t" lacks matching index tuple within 
index "iexpr"
SELECT bt_index_parent_check('iexpr', heapallindexed => true);
-- ERROR:  heap tuple (0,1) from table "t" lacks matching index tuple within 
index "ipred"
SELECT bt_index_parent_check('ipred', heapallindexed => true);

Re: race condition in pg_class

2024-07-04 Thread Noah Misch

On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> 28.06.2024 08:13, Noah Misch wrote:
> > Pushed. ...
> 
> Please look also at another anomaly, I've discovered.
> 
> An Assert added with d5f788b41 may be falsified with:
> CREATE TABLE t(a int PRIMARY KEY);
> INSERT INTO t VALUES (1);
> CREATE VIEW v AS SELECT * FROM t;
> 
> MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
>   WHEN MATCHED THEN DO NOTHING
>   WHEN NOT MATCHED THEN DO NOTHING;
> 
> TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: "nodeModifyTable.c", 
> Line: 2891, PID: 1590670

Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
returns true, so we use the wholerow code regardless of the view's triggers or
auto update capability.  The behavior is fine, so I'm fixing the new assertion
and comments with new patch inplace087-merge-DO-NOTHING-v8.patch.  The closest
relevant tests processed zero rows, so they narrowly avoided witnessing this
assertion.
Author: Noah Misch 
Commit: Noah Misch 

Don't lose partitioned table reltuples=0 after relhassubclass=f.

ANALYZE sets relhassubclass=f when a partitioned table no longer has
partitions.  An ANALYZE doing that proceeded to apply the inplace update
of pg_class.reltuples to the old pg_class tuple instead of the new
tuple, losing that reltuples=0 change if the ANALYZE committed.
Non-partitioning inheritance trees were unaffected.  Back-patch to v14,
where commit 375aed36ad83f0e021e9bdd3a0034c0c992c66dc introduced
maintenance of partitioned table pg_class.reltuples.

Reviewed by FIXME.  Reported by Alexander Lakhin.

Discussion: 
https://postgr.es/m/a295b499-dcab-6a99-c06e-01cf60593...@gmail.com

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd24..c590a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -629,7 +629,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
else
relallvisible = 0;
 
-   /* Update pg_class for table relation */
+   /*
+* Update pg_class for table relation.  CCI first, in case 
acquirefunc
+* updated pg_class.
+*/
+   CommandCounterIncrement();
vac_update_relstats(onerel,
relpages,
totalrows,
@@ -664,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 * Partitioned tables don't have storage, so we don't set any 
fields
 * in their pg_class entries except for reltuples and 
relhasindex.
 */
+   CommandCounterIncrement();
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, 
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/test/regress/expected/vacuum.out 
b/src/test/regress/expected/vacuum.out
index 330fcd8..2eba712 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -83,6 +83,53 @@ INSERT INTO vactst SELECT generate_series(301, 400);
 DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
 ANALYZE vactst;
 COMMIT;
+-- Test ANALYZE setting relhassubclass=f for non-partitioning inheritance
+BEGIN;
+CREATE TABLE past_inh_parent ();
+CREATE TABLE past_inh_child () INHERITS (past_inh_parent);
+INSERT INTO past_inh_child DEFAULT VALUES;
+INSERT INTO past_inh_child DEFAULT VALUES;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | t
+(1 row)
+
+DROP TABLE past_inh_child;
+ANALYZE past_inh_parent;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_inh_parent'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | f
+(1 row)
+
+COMMIT;
+-- Test ANALYZE setting relhassubclass=f for partitioning
+BEGIN;
+CREATE TABLE past_parted (i int) PARTITION BY LIST(i);
+CREATE TABLE past_part PARTITION OF past_parted FOR VALUES IN (1);
+INSERT INTO past_parted VALUES (1),(1);
+ANALYZE past_parted;
+DROP TABLE past_part;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_parted'::regclass;
+ reltuples | relhassubclass 
+---+
+ 2 | t
+(1 row)
+
+ANALYZE past_parted;
+SELECT reltuples, relhassubclass
+  FROM pg_class WHERE oid = 'past_parted'::regclass;
+ reltuples | relhassubclass 
+---+
+ 0 | f
+(1 row)
+
+COMMIT;
 VACUUM FULL pg_am;
 VACUUM FULL pg_class;
 VACUUM FUL

Re: Built-in CTYPE provider

2024-07-04 Thread Noah Misch

On Wed, Jul 03, 2024 at 02:19:07PM -0700, Jeff Davis wrote:
> * Unless I made a mistake, the last three releases of Unicode (14.0,
> 15.0, and 15.1) all have the exact same behavior for UPPER() and
> LOWER() -- even for unassigned code points. It would be silly to
> promise to stay with 15.1 and then realize that moving to 16.0 doesn't
> create any actual problem.

I think you're saying that if some Unicode update changes the results of a
STABLE function but does not change the result of any IMMUTABLE function, we
may as well import that update.  Is that about right?  If so, I agree.

In addition to the options I listed earlier (error in pg_upgrade or document
that IMMUTABLE stands) I would be okay with a third option.  Decide here that
we'll not adopt a Unicode update in a way that changes a v17 IMMUTABLE
function result of the new provider.  We don't need to write that in the
documentation, since it's implicit in IMMUTABLE.  Delete the "stable within a
Postgres major version" documentation text.

> * While someone can pin libc+ICU to particular versions, it's
> impossible when using the official packages, and additionally requires
> using something like [1], which just became available last year. I
> don't think it's reasonable to put it forth as a matter-of-fact
> solution.
> 
> * Let's keep some perspective: we've lived for a long time with ALL
> text indexes at serious risk of breakage. In contrast, the concerns you
> are raising now are about certain kinds of expression indexes over data
> containing certain unassigned code points. I am not dismissing that
> concern, but the builtin provider moves us in the right direction and
> let's not lose sight of that.

I see you're trying to help users get less breakage, and that's a good goal.
I agree $SUBJECT eliminates libc+ICU breakage, and libc+ICU breakage has hurt
plenty.  However, you proposed to update Unicode data and give REINDEX as the
solution to breakage this causes.  Unlike libc+ICU breakage, the packager has
no escape from that.  That's a different kind of breakage proposition, and no
new PostgreSQL feature should do that.  It's on a different axis from helping
users avoid libc+ICU breakage, and a feature doesn't get to credit helping on
one axis against a regression on the other axis.  What am I missing here?

> Given that no code changes for v17 are proposed, I suggest that we
> refrain from making any declarations until the next version of Unicode
> is released. If the pattern holds, that will be around September, which
> still leaves time to make reasonable decisions for v18.

Soon enough, a Unicode release will add one character to regexp [[:alpha:]].
PostgreSQL will then need to decide what IMMUTABLE is going to mean.  How does
that get easier in September?

Thanks,
nm

> [1] https://github.com/awslabs/compat-collation-for-glibc

Re: race condition in pg_class

2024-07-03 Thread Noah Misch

On Wed, Jul 03, 2024 at 04:09:54PM -0700, Noah Misch wrote:
> On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> > 29.06.2024 05:42, Noah Misch wrote:
> > > Good point, any effort on (2) would be wasted once the fixes get 
> > > certified.  I
> > > pushed (1).  I'm attaching the rebased fix patches.
> > 
> > Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> > CREATE TABLE t (i int) PARTITION BY LIST(i);
> > CREATE TABLE p1 (i int);
> > ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> > ALTER TABLE t DETACH PARTITION p1;
> > ANALYZE t;
> > 
> > triggers unexpected
> > ERROR:  tuple to be updated was already modified by an operation triggered 
> > by the current command
> 
> Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
> without an intervening CommandCounterIncrement().

Correction: it's not okay today.  If code does that, heap_inplace_update()
mutates a tuple that is going to become invisible at CCI.  The lack of CCI
yields a minor live bug in v14+.  Its consequences seem to be limited to
failing to update reltuples for a partitioned table having zero partitions.

> The patch makes the CCI
> required.  The ANALYZE in your example reaches this with a heap_update to set
> relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
> tests in vacuum.sql).

That's still the right fix, but I've separated it into its own patch and
expanded the test.  All the non-comment changes between v5 and v6 are now part
of the separate patch.

> The alternative would be to allow inplace updates on TM_SelfModified tuples.
> I can't think of a specific problem with allowing that, but I feel that would
> make system state interactions harder to reason about.  It might be optimal to
> allow that in back branches only, to reduce the chance of releasing a bug like
> the one you found.

Allowing a mutation of a TM_SelfModified tuple is bad, since that tuple is
going to become dead soon.  Mutating its successor could be okay.  Since we'd
expect such code to be unreachable, I'm not keen carry such code.  For that
scenario, I'd rather keep the error you encountered.  Other opinions?
Author: Noah Misch 
Commit: Noah Misch 

Don't lose partitioned table reltuples=0 after relhassubclass=f.

ANALYZE sets relhassubclass=f when a partitioned table no longer has
partitions.  An ANALYZE doing that proceeded to apply the inplace update
of pg_class.reltuples to the old pg_class tuple instead of the new
tuple, losing that reltuples=0 change if the ANALYZE committed.
Non-partitioning inheritance trees were unaffected.  Back-patch to v14,
where commit 375aed36ad83f0e021e9bdd3a0034c0c992c66dc introduced
maintenance of partitioned table pg_class.reltuples.

Reviewed by FIXME.  Reported by Alexander Lakhin.

Discussion: 
https://postgr.es/m/a295b499-dcab-6a99-c06e-01cf60593...@gmail.com

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd24..c590a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -629,7 +629,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
else
relallvisible = 0;

-   /* Update pg_class for table relation */
+   /*
+* Update pg_class for table relation.  CCI first, in case 
acquirefunc
+* updated pg_class.
+*/
+   CommandCounterIncrement();
vac_update_relstats(onerel,
relpages,
totalrows,
@@ -664,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 * Partitioned tables don't have storage, so we don't set any 
fields
 * in their pg_class entries except for reltuples and 
relhasindex.
 */
+   CommandCounterIncrement();
vac_update_relstats(onerel, -1, totalrows,
0, hasindex, 
InvalidTransactionId,
InvalidMultiXactId,
diff --git a/src/test/regress/expected/vacuum.out 
b/src/test/regress/expected/vacuum.out
index 330fcd8..2eba712 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -83,6 +83,53 @@ INSERT INTO vactst SELECT generate_series(301, 400);
 DELETE FROM vactst WHERE i % 5 <> 0; -- delete a few rows inside
 ANALYZE vactst;
 COMMIT;
+-- Test ANALYZE setting relhassubclass=f for non-partitioning inheritance
+BEGIN;
+CREATE TABLE past_inh_parent ();
+CREATE TABLE past_inh_child () INHERITS (pa

Re: race condition in pg_class

2024-07-03 Thread Noah Misch

On Wed, Jul 03, 2024 at 06:00:00AM +0300, Alexander Lakhin wrote:
> 29.06.2024 05:42, Noah Misch wrote:
> > Good point, any effort on (2) would be wasted once the fixes get certified. 
> >  I
> > pushed (1).  I'm attaching the rebased fix patches.
> 
> Please look at a new anomaly, introduced by inplace110-successors-v5.patch:
> CREATE TABLE t (i int) PARTITION BY LIST(i);
> CREATE TABLE p1 (i int);
> ALTER TABLE t ATTACH PARTITION p1 FOR VALUES IN (1);
> ALTER TABLE t DETACH PARTITION p1;
> ANALYZE t;
> 
> triggers unexpected
> ERROR:  tuple to be updated was already modified by an operation triggered by 
> the current command

Thanks.  Today, it's okay to issue heap_inplace_update() after heap_update()
without an intervening CommandCounterIncrement().  The patch makes the CCI
required.  The ANALYZE in your example reaches this with a heap_update to set
relhassubclass=f.  I've fixed this by just adding a CCI (and adding to the
tests in vacuum.sql).

The alternative would be to allow inplace updates on TM_SelfModified tuples.
I can't think of a specific problem with allowing that, but I feel that would
make system state interactions harder to reason about.  It might be optimal to
allow that in back branches only, to reduce the chance of releasing a bug like
the one you found.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}

+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
     * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:

 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+

Re: cannot abort transaction 2737414167, it was already committed

2024-07-03 Thread Noah Misch

On Thu, May 09, 2024 at 05:19:47PM +1200, Thomas Munro wrote:
> On Thu, Dec 28, 2023 at 11:42 AM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > In CommitTransaction() there is a stretch of code beginning s->state =
> > > TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call
> > > out to various subsystems' AtEOXact_XXX() functions.  There is no way
> > > to roll back in that state, so anything that throws ERROR from those
> > > routines is going to get something much like $SUBJECT.  Hmm, we'd know
> > > which exact code path got that EIO from your smoldering core if we'd
> > > put an explicit critical section there (if we're going to PANIC
> > > anyway, it might as well not be from a different stack after
> > > longjmp()...).
> >
> > +1, there's basically no hope of debugging this sort of problem
> > as things stand.
> 
> I was reminded of this thread by Justin's other file system snafu thread.
> 
> Naively defining a critical section to match the extent of the
> TRANS_COMMIT state doesn't work, as a bunch of code under there uses
> palloc().  That reminds me of the nearby RelationTruncate() thread,
> and there is possibly even some overlap, plus more in this case...
> ugh.
> 
> Hmm, AtEOXact_RelationMap() is one of those steps, but lives just
> outside the crypto-critical-section created by TRANS_COMMIT, though
> has its own normal CS for logging.  I wonder, given that "updating the
> map file is effectively commit of the relocation", why wouldn't it
> have a variant of the problem solved by DELAY_CHKPT_START for normal
> commit records, under diabolical scheduling?  It's a stretch, but: You
> log XLOG_RELMAP_UPDATE, a concurrent checkpoint runs with REDO after
> that record, you crash before/during durable_rename(), and then you
> perform crash recovery.

See the CheckPointRelationMap() header comment for how relmapper behaves like
DELAY_CHKPT_START without using that flag.  I think its mechanism suffices.

Re: Relation bulk write facility

2024-07-02 Thread Noah Misch

On Tue, Jul 02, 2024 at 02:42:50PM +0300, Heikki Linnakangas wrote:
> On 02/07/2024 02:24, Noah Misch wrote:
> > On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:

> log_newpage_range() loads the pages to the buffer
> cache and dirties them. That kinds of sucks actually, I wish it didn't need
> to dirty the buffers.

Agreed.

> > > Fortunately, fsync() on a file that's already flushed to disk is pretty
> > > cheap.
> > 
> > Yep.  I'm more concerned about future readers wondering why the function is
> > using LSNs to decide what to do about data that doesn't appear in WAL.  A
> > comment could be another way to fix that, though.
> 
> Agreed, this is all very subtle, and deserves a good comment. What do you
> think of the attached?

Looks good.  Thanks.  pgindent doesn't preserve all your indentation, but it
doesn't make things objectionable, either.

Re: Built-in CTYPE provider

2024-07-02 Thread Noah Misch

On Wed, Jul 03, 2024 at 12:05:09AM +0200, Peter Eisentraut wrote:
> On 02.07.24 18:51, Noah Misch wrote:
> > On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote:
> > > On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
> > > >    An alternative would be to make pg_upgrade reject
> > > > operating on a cluster that contains use of $SUBJECT.
> > > 
> > > That wouldn't help anyone.
> > 
> > Can you say more about that?  For the last decade at least, I think our
> > standard for new features has been to error rather than allow an operation
> > that creates a known path to wrong query results.  I think that's a helpful
> > standard that we should continue to follow.
> 
> I don't think the builtin locale provider is any different in this respect
> from the other providers:  The locale data might change and there is a
> version mechanism to track that.  We don't prevent pg_upgrade in scenarios
> like that for other providers.

Each packager can choose their dependencies so the v16 providers don't have
the problem.  With the $SUBJECT provider, a packager won't have that option.

Re: collect_corrupt_items_vacuum.patch

2024-07-02 Thread Noah Misch

On Wed, Jul 03, 2024 at 12:31:48AM +0300, Alexander Korotkov wrote:
> On Mon, Jul 1, 2024 at 2:18 AM Noah Misch  wrote:
> > Commit e85662d wrote:
> > > --- a/src/backend/storage/ipc/procarray.c
> > > +++ b/src/backend/storage/ipc/procarray.c
> >
> > > @@ -2740,6 +2741,8 @@ GetRunningTransactionData(void)
> > >*/
> > >   for (index = 0; index < arrayP->numProcs; index++)
> > >   {
> > > + int pgprocno = arrayP->pgprocnos[index];
> > > + PGPROC *proc = &allProcs[pgprocno];
> > >   TransactionId xid;
> > >
> > >   /* Fetch xid just once - see GetNewTransactionId */
> > > @@ -2760,6 +2763,13 @@ GetRunningTransactionData(void)
> > >   if (TransactionIdPrecedes(xid, oldestRunningXid))
> > >   oldestRunningXid = xid;
> > >
> > > + /*
> > > +  * Also, update the oldest running xid within the current 
> > > database.
> > > +  */
> > > + if (proc->databaseId == MyDatabaseId &&
> > > + TransactionIdPrecedes(xid, oldestRunningXid))
> > > + oldestDatabaseRunningXid = xid;
> >
> > Shouldn't that be s/oldestRunningXid/oldestDatabaseRunningXid/?
> 
> Thank you for catching this.
> 
> > While this isn't a hot path, I likely would test TransactionIdPrecedes()
> > before fetching pgprocno and PGPROC, to reduce wasted cache misses.
> 
> And thanks for suggestion.
> 
> The patchset is attached.  0001 implements
> s/oldestRunningXid/oldestDatabaseRunningXid/.  0002 implements cache
> misses optimization.
> 
> If no objection, I'll backpatch 0001 and apply 0002 to the head.

Looks fine.  I'd drop the comment update as saying the obvious, but keeping it
is okay.

Re: Built-in CTYPE provider

2024-07-02 Thread Noah Misch

On Mon, Jul 01, 2024 at 06:19:08PM -0700, Jeff Davis wrote:
> On Mon, 2024-07-01 at 16:03 -0700, Noah Misch wrote:
> >   An alternative would be to make pg_upgrade reject
> > operating on a cluster that contains use of $SUBJECT.
> 
> That wouldn't help anyone.

Can you say more about that?  For the last decade at least, I think our
standard for new features has been to error rather than allow an operation
that creates a known path to wrong query results.  I think that's a helpful
standard that we should continue to follow.

Re: Relation bulk write facility

2024-07-01 Thread Noah Misch

On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
> On 01/07/2024 23:52, Noah Misch wrote:
> > Commit 8af2565 wrote:
> > > --- /dev/null
> > > +++ b/src/backend/storage/smgr/bulk_write.c
> > 
> > > +/*
> > > + * Finish bulk write operation.
> > > + *
> > > + * This WAL-logs and flushes any remaining pending writes to disk, and 
> > > fsyncs
> > > + * the relation if needed.
> > > + */
> > > +void
> > > +smgr_bulk_finish(BulkWriteState *bulkstate)
> > > +{
> > > + /* WAL-log and flush any remaining pages */
> > > + smgr_bulk_flush(bulkstate);
> > > +
> > > + /*
> > > +  * When we wrote out the pages, we passed skipFsync=true to avoid the
> > > +  * overhead of registering all the writes with the checkpointer.  
> > > Register
> > > +  * the whole relation now.
> > > +  *
> > > +  * There is one hole in that idea: If a checkpoint occurred while we 
> > > were
> > > +  * writing the pages, it already missed fsyncing the pages we had 
> > > written
> > > +  * before the checkpoint started.  A crash later on would replay the WAL
> > > +  * starting from the checkpoint, therefore it wouldn't replay our 
> > > earlier
> > > +  * WAL records.  So if a checkpoint started after the bulk write, fsync
> > > +  * the files now.
> > > +  */
> > > + if (!SmgrIsTemp(bulkstate->smgr))
> > > + {
> > 
> > Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
> > decision is irrelevant to the !wal case.  Either we don't need fsync at all
> > (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
> 
> The point of GetRedoRecPtr() is to detect if a checkpoint has started
> concurrently. It works for that purpose whether or not the bulk load is
> WAL-logged. It is not compared with the LSNs of WAL records written by the
> bulk load.

I think the significance of start_RedoRecPtr is it preceding all records
needed to recreate the bulk write.  If start_RedoRecPtr==GetRedoRecPtr() and
we crash after commit, we're indifferent to whether the rel gets synced at a
checkpoint before that crash or rebuilt from WAL after that crash.  If
start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
deleted, so only smgrimmedsync() suffices.  Overall, while it is not compared
with LSNs in WAL records, it's significant only to the extent that such a WAL
record exists.  What am I missing?

> Unlogged tables do need to be fsync'd. The scenario is:
> 
> 1. Bulk load an unlogged table.
> 2. Shut down Postgres cleanly
> 3. Pull power plug from server, and restart.
> 
> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
> thread [1]. I had already forgotten about that; that bug actually still
> exists in back branches, and we should fix it..
> 
> [1] 
> https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi

Ah, that's right.  I agree this code suffices for unlogged.  As a further
optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
always call smgrregistersync().  (For any rel, smgrimmedsync() improves on
smgrregistersync() only if we fail to reach the shutdown checkpoint.  Without
a shutdown checkpoint, unlogged rels get reset anyway.)

> > I don't see any functional problem, but this likely arranges for an
> > unnecessary sync when a checkpoint starts between mdcreate() and
> > here.  (The mdcreate() sync may also be unnecessary, but that's
> > longstanding.)
> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
> fsyncing it, and in that case we do still need the fsync.

We do not need the fsync in the "WAL-log the relation instead" case; see
https://postgr.es/m/20230921062210.ga110...@rfd.leadboat.com

So maybe like this:

  if (use_wal) /* includes init forks */
current logic;
  else if (unlogged)
smgrregistersync;
  /* else temp || (permanent && wal_level=minimal): nothing to do */

> Fortunately, fsync() on a file that's already flushed to disk is pretty
> cheap.

Yep.  I'm more concerned about future readers wondering why the function is
using LSNs to decide what to do about data that doesn't appear in WAL.  A
comment could be another way to fix that, though.

Re: Built-in CTYPE provider

2024-07-01 Thread Noah Misch

On Mon, Jul 01, 2024 at 12:24:15PM -0700, Jeff Davis wrote:
> On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> > lower(), initcap(), upper(), and regexp_matches() are
> > PROVOLATILE_IMMUTABLE.
> > Until now, we've delegated that responsibility to the user.  The user
> > is
> > supposed to somehow never update libc or ICU in a way that changes
> > outcomes
> > from these functions.
> 
> To me, "delegated" connotes a clear and organized transfer of
> responsibility to the right person to solve it. In that sense, I
> disagree that we've delegated it.

Good point.

> >   Now that postgresql.org is taking that responsibility
> > for builtin C.UTF-8, how should we govern it?  I think the above text
> > and [1]
> > convey that we'll update the Unicode data between major versions,
> > making
> > functions like lower() effectively STABLE.  Is that right?
> 
> Marking them STABLE is not a viable option, that would break a lot of
> valid use cases, e.g. an index on LOWER().

I agree.

> I don't think we need code changes for 17. Some documentation changes
> might be helpful, though. Should we have a note around LOWER()/UPPER()
> that users should REINDEX any dependent indexes when the provider is
> updated?

I agree the v17 code is fine.  Today, a user can (with difficulty) choose
dependency libraries so regexp_matches() is IMMUTABLE, as marked.  I don't
want $SUBJECT to be the ctype that, at some post-v17 version, can't achieve
that with unpatched PostgreSQL.  Let's change the documentation to say this
provider uses a particular snapshot of Unicode data, taken around PostgreSQL
17.  We plan never to change that data, so IMMUTABLE functions can rely on the
data.  If we provide a newer Unicode data set in the future, we'll provide it
in such a way that DDL must elect the new data.  How well would that suit your
vision for this feature?  An alternative would be to make pg_upgrade reject
operating on a cluster that contains use of $SUBJECT.

Re: Relation bulk write facility

2024-07-01 Thread Noah Misch

On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

Commit 8af2565 wrote:
> --- /dev/null
> +++ b/src/backend/storage/smgr/bulk_write.c

> +/*
> + * Finish bulk write operation.
> + *
> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> + * the relation if needed.
> + */
> +void
> +smgr_bulk_finish(BulkWriteState *bulkstate)
> +{
> + /* WAL-log and flush any remaining pages */
> + smgr_bulk_flush(bulkstate);
> +
> + /*
> +  * When we wrote out the pages, we passed skipFsync=true to avoid the
> +  * overhead of registering all the writes with the checkpointer.  
> Register
> +  * the whole relation now.
> +  *
> +  * There is one hole in that idea: If a checkpoint occurred while we 
> were
> +  * writing the pages, it already missed fsyncing the pages we had 
> written
> +  * before the checkpoint started.  A crash later on would replay the WAL
> +  * starting from the checkpoint, therefore it wouldn't replay our 
> earlier
> +  * WAL records.  So if a checkpoint started after the bulk write, fsync
> +  * the files now.
> +  */
> + if (!SmgrIsTemp(bulkstate->smgr))
> + {

Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
decision is irrelevant to the !wal case.  Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).  I
don't see any functional problem, but this likely arranges for an unnecessary
sync when a checkpoint starts between mdcreate() and here.  (The mdcreate()
sync may also be unnecessary, but that's longstanding.)

> + /*
> +  * Prevent a checkpoint from starting between the 
> GetRedoRecPtr() and
> +  * smgrregistersync() calls.
> +  */
> + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> + MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +
> + if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
> + {
> + /*
> +  * A checkpoint occurred and it didn't know about our 
> writes, so
> +  * fsync() the relation ourselves.
> +  */
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
> + elog(DEBUG1, "flushed relation because a checkpoint 
> occurred concurrently");
> + }
> + else
> + {
> + smgrregistersync(bulkstate->smgr, bulkstate->forknum);
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + }
> + }
> +}

This is an elegant optimization.

Re: speed up a logical replica setup

2024-06-30 Thread Noah Misch

On Sun, Jun 30, 2024 at 09:32:57PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sun, Jun 30, 2024 at 05:01:52PM -0400, Tom Lane wrote:
> >> I'm tempted to lobotomize the new test case on Windows until
> >> we have that resolved.
> 
> > Sounds fine.  The pg_upgrade suite adequately tests appendShellString() and
> > appendConnStrVal() with the larger character repertoire, so it's enough for
> > pg_createsubscriber to test just a space or so.
> 
> Hmph ... according to [1], 545082091 was not enough to fix this.
> I guess that old version of IPC::Run also misbehaves for cases
> involving backslash, single quote, and/or some other special
> characters?

The database name has backslash toward the end, which likely creates a
problematic backslash-double-quote sequence under win32 command line rules.
You can see that in the node log, comparing the CREATE DATABASE log line
(shows dquote at end of dbname) to the FATAL log line (no dquote at end).  I'm
not aware of trouble with characters other than backslash or dquote.

> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-01%2000%3A03%3A05

Re: collect_corrupt_items_vacuum.patch

2024-06-30 Thread Noah Misch

On Tue, Mar 12, 2024 at 02:10:59PM +0200, Alexander Korotkov wrote:
> I'm going to push this if no objections.

Commit e85662d wrote:
> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c

> @@ -2740,6 +2741,8 @@ GetRunningTransactionData(void)
>*/
>   for (index = 0; index < arrayP->numProcs; index++)
>   {
> + int pgprocno = arrayP->pgprocnos[index];
> + PGPROC *proc = &allProcs[pgprocno];
>   TransactionId xid;
>  
>   /* Fetch xid just once - see GetNewTransactionId */
> @@ -2760,6 +2763,13 @@ GetRunningTransactionData(void)
>   if (TransactionIdPrecedes(xid, oldestRunningXid))
>   oldestRunningXid = xid;
>  
> + /*
> +  * Also, update the oldest running xid within the current 
> database.
> +  */
> + if (proc->databaseId == MyDatabaseId &&
> + TransactionIdPrecedes(xid, oldestRunningXid))
> + oldestDatabaseRunningXid = xid;

Shouldn't that be s/oldestRunningXid/oldestDatabaseRunningXid/?

While this isn't a hot path, I likely would test TransactionIdPrecedes()
before fetching pgprocno and PGPROC, to reduce wasted cache misses.

Re: Faster "SET search_path"

2024-06-30 Thread Noah Misch

On Tue, Nov 14, 2023 at 08:13:25PM -0800, Jeff Davis wrote:
> On Thu, 2023-10-19 at 19:01 -0700, Jeff Davis wrote:
> > 0003: Cache for recomputeNamespacePath.
> 
> Committed

Commit f26c236 wrote:
> Add cache for recomputeNamespacePath().
> 
> When search_path is changed to something that was previously set, and
> no invalidation happened in between, use the cached list of namespace
> OIDs rather than recomputing them. This avoids syscache lookups and
> ACL checks.

> --- a/src/backend/catalog/namespace.c
> +++ b/src/backend/catalog/namespace.c

> @@ -109,11 +110,13 @@
>   * activeSearchPath is always the actually active path; it points to
>   * to baseSearchPath which is the list derived from namespace_search_path.
>   *
> - * If baseSearchPathValid is false, then baseSearchPath (and other
> - * derived variables) need to be recomputed from namespace_search_path.
> - * We mark it invalid upon an assignment to namespace_search_path or receipt
> - * of a syscache invalidation event for pg_namespace.  The recomputation
> - * is done during the next lookup attempt.
> + * If baseSearchPathValid is false, then baseSearchPath (and other derived
> + * variables) need to be recomputed from namespace_search_path, or retrieved
> + * from the search path cache if there haven't been any syscache
> + * invalidations.  We mark it invalid upon an assignment to
> + * namespace_search_path or receipt of a syscache invalidation event for
> + * pg_namespace or pg_authid.  The recomputation is done during the next

You're caching the result of object_aclcheck(NamespaceRelationId, ...), so
pg_auth_members changes and pg_database changes also need to invalidate this
cache.  (pg_database affects the ACL_CREATE_TEMP case in
pg_namespace_aclmask_ext() and affects ROLE_PG_DATABASE_OWNER membership.)

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-06-30 Thread Noah Misch

On Mon, Mar 04, 2024 at 07:52:05PM -0800, Jeff Davis wrote:
> Committed.

Commit 2af07e2 wrote:
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -1412,6 +1412,8 @@ brin_summarize_range(PG_FUNCTION_ARGS)
>   SetUserIdAndSecContext(heapRel->rd_rel->relowner,
>  save_sec_context | 
> SECURITY_RESTRICTED_OPERATION);
>   save_nestlevel = NewGUCNestLevel();
> + SetConfigOption("search_path", GUC_SAFE_SEARCH_PATH, 
> PGC_USERSET,
> + PGC_S_SESSION);

I've audited NewGUCNestLevel() calls that didn't get this addition.  Among
those, these need the addition:

- Each in ComputeIndexAttrs() -- they arise when the caller is DefineIndex()
- In DefineIndex(), after comment "changed a behavior-affecting GUC"

While "not necessary for security", ExecCreateTableAs() should do it for the
same reason it calls NewGUCNestLevel().

Re: speed up a logical replica setup

2024-06-30 Thread Noah Misch

On Sun, Jun 30, 2024 at 05:01:52PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sun, Jun 30, 2024 at 02:58:00PM -0400, Tom Lane wrote:
> >> So b3f5ccebd promptly blew up on fairywren [1]:
> 
> > It does look consistent with IPC::Run predating v20220807.0, hence lacking 
> > the
> > https://github.com/toddr/IPC-Run/issues/142 fix.  I wondered how this animal
> > was passing 002_pg_upgrade.pl, but it seems the animal has run TAP suites 
> > only
> > occasionally.  Passes in the last week were TAP-free.
> 
> Hmm, drongo just failed in the same way.  (It seems to likewise
> run TAP tests only some of the time, which I don't understand
> because the $Script_Config output looks the same between runs
> with TAP tests and runs without.)

On further reflection, 002_pg_upgrade.pl may not fail with old IPC::Run.  It
may just create a database with an unintended name, losing some test coverage.

> I'm tempted to lobotomize the new test case on Windows until
> we have that resolved.

Sounds fine.  The pg_upgrade suite adequately tests appendShellString() and
appendConnStrVal() with the larger character repertoire, so it's enough for
pg_createsubscriber to test just a space or so.

Re: speed up a logical replica setup

2024-06-30 Thread Noah Misch

On Sun, Jun 30, 2024 at 02:58:00PM -0400, Tom Lane wrote:
> I wrote:
> > In hopes of moving things along as we approach the v18 branch,
> > I went ahead and pushed Kuroda-san's patches (with a bit of
> > further editorialization).
> 
> So b3f5ccebd promptly blew up on fairywren [1]:
> 
> connection error: 'psql: error: unterminated quoted string in connection info 
> string'
> while running 'psql -XAtq -d port=52984 host=C:/tools/nmsys64/tmp/hHg_pngw4z 
> dbname='regression"    
> !"#$%&\\'()*+,-"' -f - -v ON_ERROR_STOP=1' at 
> C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
>  line 2124.
> 
> This shows that the command-line arguments to psql are not getting
> adequately quoted on that platform.  AFAICS, we are relying on
> IPC::Run::run to perform such quoting, and it seems to be
> falling down here.  I wonder what version of IPC::Run fairywren
> is using.

It does look consistent with IPC::Run predating v20220807.0, hence lacking the
https://github.com/toddr/IPC-Run/issues/142 fix.  I wondered how this animal
was passing 002_pg_upgrade.pl, but it seems the animal has run TAP suites only
occasionally.  Passes in the last week were TAP-free.

> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-06-30%2018%3A03%3A06

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-06-30 Thread Noah Misch

On Tue, Mar 12, 2024 at 05:50:48PM +0100, Alvaro Herrera wrote:
> On 2024-Mar-12, Jelte Fennema-Nio wrote:
> > On Tue, 12 Mar 2024 at 10:19, Alvaro Herrera  
> > wrote:
> > > Here's a last one for the cfbot.
> > 
> > Thanks for committing the first 3 patches btw.
> 
> Thanks, I included it.

PGcancelConn *
PQcancelCreate(PGconn *conn)
{
...
oom_error:
conn->status = CONNECTION_BAD;
libpq_append_conn_error(cancelConn, "out of memory");
return (PGcancelConn *) cancelConn;
}

Shouldn't that be s/conn->status/cancelConn->status/?

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1012 matches

Mail list logo