Re: New docs chapter on Transaction Management and related changes

2022-11-09 Thread Laurenz Albe
On Wed, 2022-11-09 at 09:16 -0500, Robert Treat wrote:
> On Mon, Nov 7, 2022 at 5:04 PM Laurenz Albe  wrote:
> > Some comments:
> > 
> 
> > > --- a/doc/src/sgml/ref/release_savepoint.sgml
> > > +++ b/doc/src/sgml/ref/release_savepoint.sgml
> > > @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] 
> > > savepoint_name
> > >    Description
> > > 
> > >    
> > > -   RELEASE SAVEPOINT destroys a savepoint previously 
> > > defined
> > > -   in the current transaction.
> > > +   RELEASE SAVEPOINT will subcommit the subtransaction
> > > +   established by the named savepoint, if one exists. This will release
> > > +   any resources held by the subtransaction. If there were any
> > > +   subtransactions of the named savepoint, these will also be 
> > > subcommitted.
> > >    
> > > 
> > >    
> > 
> > "Subtransactions of the named savepoint" is somewhat confusing; how about
> > "subtransactions of the subtransaction established by the named savepoint"?
> > 
> > If that is too long and explicit, perhaps "subtransactions of that 
> > subtransaction".
> 
> Personally, I think these are more confusing.

Perhaps.  I was worried because everywhere else, the wording makes a clear 
distinction
between a savepoint and the subtransaction created by a savepoint.  But perhaps 
some
sloppiness is better to avoid such word cascades.

> > > --- a/doc/src/sgml/ref/rollback.sgml
> > > +++ b/doc/src/sgml/ref/rollback.sgml
> > > @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
> > >  AND CHAIN
> > >  
> > >   
> > > -  If AND CHAIN is specified, a new transaction is
> > > +  If AND CHAIN is specified, a new unaborted 
> > > transaction is
> > >    immediately started with the same transaction characteristics (see 
> > >  > >    linkend="sql-set-transaction"/>) as the just finished one.  
> > > Otherwise,
> > >    no new transaction is started.
> > 
> > I don't think that is an improvement.  "Unaborted" is an un-word.  A new 
> > transaction
> > is always "unaborted", isn't it?
> 
> I thought about this as well when reviewing it, but I do think
> something is needed for the case where you have a transaction which
> has suffered an error and then you issue "rollback and chain"; if you
> just say "a new transaction is immediately started with the same
> transaction characteristics" it might imply to some the new
> transaction has some kind of carry over of the previous broken
> transaction... the use of the word unaborted makes it clear that the
> new transaction is 100% functional.

A new transaction is never aborted in my understanding.  Being aborted is not a
characteristic of a transaction, but a state.

> > 
> 
> 
> > > +    The internal transaction ID type xid is 32-bits wide
> > 
> > There should be no hyphen in "32 bits wide", just as in "3 years old".
> 
> Minor aside, we should clean up glossary.sgml as well.

Right, it has this:

 The numerical, unique, sequentially-assigned identifier that each
 transaction receives when it first causes a database modification.
 Frequently abbreviated as xid.
 When stored on disk, xids are only 32-bits wide, so only
 approximately four billion write transaction IDs can be generated;
 to permit the system to run for longer than that,
 epochs are used, also 32 bits wide.

Which reminds me that I should have suggested  rather than
 where I complained about the use of .

> > 
> > +   Up to
> > +    64 open subxids are cached in shared memory for each backend; after
> > +    that point, the overhead increases significantly since we must look
> > +    up subxid entries in pg_subtrans.
> > 
> > Comma before "since".  Perhaps you should mention that this means disk I/O.
> 
> ISTR that you only use a comma before since in cases where the
> preceding thought contains a negative.

Not being a native speaker, I'll leave that to those who are; I went by feeling.

> In any case, are you thinking something like this:
> 
> " 64 open subxids are cached in shared memory for each backend; after
>  that point the overhead increases significantly due to additional disk I/O
>  from looking up subxid entries in pg_subtrans."

Yes.

Yours,
Laurenz Albe




Re: Direct I/O

2022-11-09 Thread John Naylor
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro  wrote:

> Memory alignment patches:
>
> Direct I/O generally needs to be done to/from VM page-aligned
> addresses, but only "standard" 4KB pages, even when larger VM pages
> are in use (if there is an exotic system where that isn't true, it
> won't work).  We need to deal with buffers on the stack, the heap and
> in shmem.  For the stack, see patch 0001.  For the heap and shared
> memory, see patch 0002, but David Rowley is going to propose that part
> separately, as MemoryContext API adjustments are a specialised enough
> topic to deserve another thread; here I include a copy as a
> dependency.  The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Unit tests for SLRU

2022-11-09 Thread Michael Paquier
On Wed, Apr 13, 2022 at 03:51:30PM +0400, Pavel Borisov wrote:
> Only one thing to note. Maybe it would be good not to copy-paste Assert
> after every call of SimpleLruInit, putting it into the wrapper function
> instead. So the test can call calling the inner function (without assert)
> and all other callers using the wrapper. Not sure about naming though.
> Maybe rename current SimpleLruInit -> SimpleLruInitInner and a new wrapper
> being under the old name (SimpleLruInit).

I have looked at what you have here..

This patch redesigns SimpleLruInit() so as the caller would be now in
charge of checking that the SLRU has been created in the context of
the postmaster (aka when initializing shared memory).  While this
should work as long as the amount of shared memory area is correctly
sized in _PG_init() and that this area is initialized, then attached
later like for autoprewarm.c (this counts for LWLockRegisterTranche(),
for example), I am not really convinced that this is something that a
patch aimed at extending testing coverage should redesign, especially
with a routine as old as that.  If you don't know what you are doing,
it could easily lead to problems with external code.  Note that I
don't object to the addition of a new code path or a routine that
would be able to create a SLRU on-the-fly with less restrictions, but
I am not convinced that this we should change this behavior (well,
there is a new argument that would force a recompilation).  I am not
sure what could be the use cases in favor of a SLRU created outside
the _PG_init() phase, but perhaps you have more imagination than I do
for such matters ;p

FWIW, I'd like to think that the proper way of doing things for this
test facility is to initialize a SLRU through a loading of _PG_init()
when processing shared_preload_libraries, meaning that you'd better
put this facility in src/test/modules/ with a custom configuration
file with shared_preload_libraries set and a NO_INSTALLCHECK, without
touching at SimpleLruInit().
--
Michael


signature.asc
Description: PGP signature


Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-09 Thread Yugo NAGATA
On Wed, 09 Nov 2022 11:38:05 -0500
Tom Lane  wrote:

> Peter Eisentraut  writes:
> > This has broken the following use:
> 
> > parse: create temporary table t1 (a int) on commit drop
> > bind
> > execute
> > parse: analyze t1
> > bind
> > execute
> > parse: select * from t1
> > bind
> > execute
> > sync
> 
> > I think the behavior of IsInTransactionBlock() needs to be further 
> > refined to support this.
> 
> Hmm.  Maybe the right way to think about this is "if we have completed an
> EXECUTE, and not yet received a following SYNC, then report that we are in
> a transaction block"?  But I'm not sure if that breaks any other cases.

Or, in that case, regarding it as an implicit transaction if multiple commands
are executed in a pipeline as proposed in [1] could be another solution, 
although I have once withdrawn this for not breaking backward compatibility.
Attached is the same patch of [1].

[1] 
https://www.postgresql.org/message-id/20220728105134.d5ce51dd756b3149e9b9c52c%40sraoss.co.jp

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: [PATCH] Allow usage of archive .backup files as backup_label

2022-11-09 Thread Michael Paquier
On Tue, Oct 18, 2022 at 04:55:46AM +0200, Laurenz Albe wrote:
> I tend to agree with you.  It is easy to break PostgreSQL by manipulating
> or removing "backup_label", and copying a file from the WAL archive and
> renaming it to "backup_label" sounds like a footgun of the first order.
> There is nothing that prevents you from copying the wrong file.
> Such practices should not be encouraged.
> 
> Anybody who knows enough about PostgreSQL to be sure that what they are
> doing is correct should be smart enough to know how to edit the copied file.

A few weeks after, still the same thoughts on the matter, so please
note that I have marked that as rejected in the CF app.  If somebody
wants to offer more arguments for this thread, of course please feel
free.
--
Michael


signature.asc
Description: PGP signature


Re: Fix some newly modified tab-complete changes

2022-11-09 Thread Michael Paquier
On Tue, Oct 18, 2022 at 05:17:32PM +1100, Peter Smith wrote:
> I re-tested and confirm that the patch does indeed fix the quirk I'd
> previously reported.
> 
> But, looking at the patch code, I don't know if it is the best way to
> fix the problem or not. Someone with more experience of the
> tab-complete module can judge that.

It seems to me that the patch as proposed has more problems than
that.  I have spotted a few issues at quick glance, there may be
more.

For example, take this one:
+   else if (TailMatches("GRANT") ||
+TailMatches("REVOKE", "GRANT", "OPTION", "FOR"))
COMPLETE_WITH_QUERY_PLUS(Query_for_list_of_roles,

"REVOKE GRANT OPTION FOR" completes with a list of role names, which
is incorrect.

FWIW, I am not much a fan of the approach taken by the patch to
duplicate the full list of keywords to append after REVOKE or GRANT,
at the only difference of "GRANT OPTION FOR".  This may be readable if
unified with a single list, with extra items appended for GRANT and
REVOKE?

Note that REVOKE has a "ADMIN OPTION FOR" clause, which is not
completed to.
--
Michael


signature.asc
Description: PGP signature


Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-09 Thread Yugo NAGATA
On Wed, 09 Nov 2022 11:17:29 -0500
Tom Lane  wrote:

> Yugo NAGATA  writes:
> > Tom Lane  wrote:
> >> What do you think of the attached wording?
> 
> > It looks good to me. That describes the expected behaviour exactly.
> 
> Pushed that, then.

Thank you.

> >> I don't think the pipeline angle is of concern to anyone who might be
> >> reading these comments with the aim of understanding what guarantees
> >> they have.  Perhaps there should be more about that in the user-facing
> >> docs, though.
> 
> > I agree with that we don't need to mention pipelining in these comments,
> > and that we need more in the documentation. I attached a doc patch to add
> > a mention of commands that do internal commit to the pipelining section.
> > Also, this adds a reference for the pipelining protocol to the libpq doc.
> 
> Hmm ... I don't really find either of these changes to be improvements.
> The fact that, say, multi-table ANALYZE uses multiple transactions
> seems to me to be a property of that statement, not of the protocol.

Ok. Then, if we want to notice users that commands using internal commits
could unexpectedly close a transaction in pipelining, the proper place is
libpq section?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




RE: Ability to reference other extensions by schema in extension scripts

2022-11-09 Thread Regina Obe
> "Regina Obe"  writes:
> > My proposal is this.  If you think it's a good enough idea I can work
> > up a patch for this.
> > Extensions currently are allowed to specify a requires in the control
file.
> > I propose to use this information, to allow replacement of phrases
> > @extschema_nameofextension@  as a variable, where nameofextension has
> > to be one of the extensions listed in the requires.
> 
> I have a distinct sense of deja vu here.  I think this idea, or something
> isomorphic to it, was previously discussed with some other syntax details.
> I'm too lazy to go searching the archives right now, but I suggest that
you try to
> find that discussion and see if the discussed syntax seems better or worse
than
> what you mention.
> 
> I think it might've been along the line of @extschema:nameofextension@,
> which seems like it might be superior because colon isn't a valid
identifier
> character so there's less risk of ambiguity.
> 
>   regards, tom lane
I found the old discussion I recalled having and Stephen had suggested using

@extschema{'postgis'}@

On this thread --
https://www.postgresql.org/message-id/20160425232251.GR10850@tamriel.snowman
.net

Is that the one you remember?

Thanks,
Regina





Re: Tests for psql \g and \o

2022-11-09 Thread Michael Paquier
On Tue, Nov 01, 2022 at 12:42:47PM +0100, Daniel Verite wrote:
> It's a follow up to the discussion at [1]. Since this discussion
> already has a slot in the CF [2] with a committed patch, let's start a
> new separate thread.

+psql_like($node, "SELECT 'one' \\g | cat >$g_file", qr//, "one command \\g");
+my $c1 = slurp_file($g_file);
+like($c1, qr/one/);

Windows may not have an equivalent for "cat", no?  Note that psql's
001_basic.pl has no restriction in place for Windows.  Perhaps you
could use the same trick as basebackup_to_shell, where GZIP is used to
write some arbitrary data..  Anyway, this has some quoting issues
especially if the file's path has whitespaces?  This is located in
File::Temp::tempdir, still it does not sound like a good thing to rely
on this assumption on portability grounds.
--
Michael


signature.asc
Description: PGP signature


Re: Ability to reference other extensions by schema in extension scripts

2022-11-09 Thread Tom Lane
"Regina Obe"  writes:
> My proposal is this.  If you think it's a good enough idea I can work up a
> patch for this.
> Extensions currently are allowed to specify a requires in the control file.
> I propose to use this information, to allow replacement of phrases
> @extschema_nameofextension@  as a variable, where nameofextension has to be
> one of the extensions listed in the requires.

I have a distinct sense of deja vu here.  I think this idea, or something
isomorphic to it, was previously discussed with some other syntax details.
I'm too lazy to go searching the archives right now, but I suggest that
you try to find that discussion and see if the discussed syntax seems
better or worse than what you mention.

I think it might've been along the line of @extschema:nameofextension@,
which seems like it might be superior because colon isn't a valid
identifier character so there's less risk of ambiguity.

regards, tom lane




Ability to reference other extensions by schema in extension scripts

2022-11-09 Thread Regina Obe
I think I had brought this up a while ago, but I forget what the opinion was
on the matter.

PostGIS has a number of extensions that rely on it.  For the extensions that
are packaged with PostGIS, we force them all into the same schema except for
the postgis_topology and postgis_tiger_geocoder extensions which are already
installed in dedicated schemas.

This makes it impossible for postgis_topology and postgis_tiger_geocoder to
schema qualify their use of postgis.  Other extensions like pgRouting,
h3-pg, mobilitydb have similar issues.

My proposal is this.  If you think it's a good enough idea I can work up a
patch for this.

Extensions currently are allowed to specify a requires in the control file.

I propose to use this information, to allow replacement of phrases

@extschema_nameofextension@  as a variable, where nameofextension has to be
one of the extensions listed in the requires.

The extension plumbing will then use this information to look up the schema
that the current required extensions are installed in, and replace the
variables with the schema of where the dependent extension is installed.

Does anyone see any issue with this idea. 

Thanks,
Regina





Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2022 at 6:10 PM Andres Freund  wrote:
> And thinking about it, it'd be quite bad if the horizon worked that way. You 
> can easily construct a workload where every single xid would "skewer" some 
> chain, never allowing the horizon to be raised.

Your whole scenario is one involving a insert of a tuple by XID 10,
which is then updated by XID 5 -- a lower XID. Obviously that's
possible, but it's relatively rare. I have to imagine that the vast
majority of updates affect tuples inserted by an XID before the XID of
the updater.

My use of the term "skewer" was limited to updates that look like
that. So I don't know what you mean about never allowing the horizon
to be raised.

-- 
Peter Geoghegan




Re: Refactor to introduce pg_strcoll().

2022-11-09 Thread Jeff Davis
Attached some new patches. I think you were right that the API of
pg_strcoll() was strange before, so I changed it to:

  * pg_strcoll() takes nul-terminated arguments
  * pg_strncoll() takes arguments and lengths

The new pg_strcoll()/pg_strncoll() api in 0001 seems much reasonable to
support in the long term, potentially with other callers.

Patch 0004 exports:

  * pg_collate_icu() is for ICU and takes arguments and lengths
  * pg_collate_libc() is for libc and takes nul-terminated arguments

for a tiny speedup because varstrfastcmp_locale() has both nul-
terminated arguments and their lengths.

On Fri, 2022-11-04 at 15:06 -0700, Jeff Davis wrote:
> But I think the complex hacks are just the transformation into UTF 16
> and calling of wcscoll(). If that's done from within pg_strcoll()
> (after my patch), then I think it just works, right?

I included a patch (0005) to enable varstrfastcmp_locale on windows
with a server encoding of UTF-8. I don't have a great way of testing
this, but it seems like it should work.

> I was also considering an optimization to use stack allocation for
> short strings when doing the non-UTF8 icu comparison. I am seeing
> some
> benefit there

I also included this optimization in 0003: try to use the stack for
reasonable values; allocate on the heap for larger strings. I think
it's helping a bit.

Patch 0002 helps a lot more: for the case of ICU with a non-UTF8 server
encoding, the speedup is something like 30%. The reason is that
icu_to_uchar() currently calls ucnv_toUChars() twice: once to determine
the precise output buffer size required, and then again once it has the
buffer ready. But it's easy to just size the destination buffer
conservatively, because the maximum space required is enough to hold
twice the number of UChars as there are input characters[1], plus the
terminating nul. That means we just call ucnv_toUChars() once, to do
the actual conversion.

My perf test was to use a quick hack to disable abbreviated keys (so
that it's doing more comparisons), and then just do a large ORDER BY
... COLLATE on a table with generated text. The text is random lorem
ipsum data, with some characters replaced by multibyte characters to
exercise some more interesting paths. If someone else has some more
sophisticated collation tests I'd be interested to try them.

Regards,
Jeff Davis

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucnv_8h.html#ae1049fcb893783c860fe0f9d4da84939
From 29249d17b5580e87365c008b18c83db5528db37e Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Wed, 9 Nov 2022 08:58:08 -0800
Subject: [PATCH v3 1/5] Introduce pg_strcoll() and pg_strncoll().

Hide the special cases of the platform, collation provider, and
database encoding in pg_locale.c. Simplify varlena.c.
---
 src/backend/utils/adt/pg_locale.c | 263 ++
 src/backend/utils/adt/varlena.c   | 230 +-
 src/include/utils/pg_locale.h |   3 +
 3 files changed, 268 insertions(+), 228 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..94dc64c2d0 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include 
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -1731,6 +1737,263 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * win32_utf8_wcscoll
+ *
+ * Convert UTF8 arguments to wide characters and invoke wcscoll() or
+ * wcscoll_l().
+ */
+#ifdef WIN32
+static int
+win32_utf8_wcscoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+   pg_locale_t locale)
+{
+	char		a1buf[TEXTBUFLEN];
+	char		a2buf[TEXTBUFLEN];
+	char	   *a1p,
+			   *a2p;
+	int			a1len;
+	int			a2len;
+	int			r;
+	int			result;
+
+	if (len1 >= TEXTBUFLEN / 2)
+	{
+		a1len = len1 * 2 + 2;
+		a1p = palloc(a1len);
+	}
+	else
+	{
+		a1len = TEXTBUFLEN;
+		a1p = a1buf;
+	}
+	if (len2 >= TEXTBUFLEN / 2)
+	{
+		a2len = len2 * 2 + 2;
+		a2p = palloc(a2len);
+	}
+	else
+	{
+		a2len = TEXTBUFLEN;
+		a2p = a2buf;
+	}
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+	(errmsg("could not convert string to UTF-16: error code %lu",
+			GetLastError(;
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+	(errmsg("could not convert string to UTF-16: error code %lu",
+			GetLastError(;
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);

Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2022 at 5:46 PM Andres Freund  wrote:
> > Putting all 3 together: doesn't it seem quite likely that the way that
> > we compute OldestXmin is the factor that prevents "skewering" of an
> > update chain? What else could possibly be preventing corruption here?
> > (Theoretically it might never have been discovered, but that seems
> > pretty hard to believe.)
>
> I don't see how that follows. The existing code is just ok with that.

My remarks about "3 facts we agree on" were not intended to be a
watertight argument. More like: what else could it possibly be that
prevents problems in practice, if not *something* to do with how we
compute OldestXmin?

Leaving aside the specifics of how OldestXmin is computed for a
moment: what alternative explanation is even remotely plausible? There
just aren't that many moving parts involved here. The idea that we can
ever freeze the xmin of a successor tuple/version from an update chain
without also pruning away earlier versions of the same chain is wildly
implausible. It sounds totally contradictory.

> In fact
> we have explicit code trying to exploit this:
>
> /*
>  * If the DEAD tuple is at the end of the chain, the entire 
> chain is
>  * dead and the root line pointer can be marked dead.  
> Otherwise just
>  * redirect the root to the correct chain member.
>  */
> if (i >= nchain)
> heap_prune_record_dead(prstate, rootoffnum);
> else
> heap_prune_record_redirect(prstate, rootoffnum, 
> chainitems[i]);

I don't see why this code is relevant.

-- 
Peter Geoghegan




Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Andres Freund
Hi,

And thinking about it, it'd be quite bad if the horizon worked that way. You 
can easily construct a workload where every single xid would "skewer" some 
chain, never allowing the horizon to be raised.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Andres Freund
Hi,

On 2022-11-09 17:32:46 -0800, Peter Geoghegan wrote:
> > The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
> > that xid 10 is still running. All it means that one backend or slot has an
> > xmin or xid of 7.
> 
> Of course that's true. But I wasn't talking about the general case --
> I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid"
> update chain case specifically, with its "skewered" OldestXmin of 7.

The sequence below produces such an OldestXmin:

> > s1: acquire xid 5
> > s2: acquire xid 7
> > s3: acquire xid 10
> >
> > s3: insert
> > s3: commit
> > s1: update
> > s1: commit
> >
> > s2: get a new snapshot, xmin 7 (or just hold no snapshot)
> >
> > At this point the xmin horizon is 7. The first tuple's xmin can't be
> > frozen. The second tuple's xmin can be.
> 
> Basically what I'm saying about OldestXmin is that it ought to "work
> transitively", from the updater to the inserter that inserted the
> now-updated tuple. That is, the OldestXmin should either count both
> XIDs that appear in the update chain, or neither XID.

It doesn't work that way. The above sequence shows one case where it doesn't.


> > > I believe you're right that an update chain that looks like this one
> > > is possible. However, I don't think it's possible for
> > > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
> > > "skewers" the update chain like this, the value 7 from your example).
> > > We ought to be able to rely on an OldestXmin value that can never let
> > > such a situation emerge. Right?
> >
> > I don't see anything that'd guarantee that currently, nor do immediately 
> > see a
> > possible way to get there.
> >
> > What do you think prevents such an OldestXmin?
> 
> ComputeXidHorizons() computes VACUUM's OldestXmin (actually it
> computes h->data_oldest_nonremovable values) by scanning the proc
> array. And counts PGPROC.xmin from each running xact. So ultimately
> the inserter and updater are tied together by that. It's either an
> OldestXmin that includes both, or one that includes neither.

> Here are some facts that I think we both agree on already:
> 
> 1. It is definitely possible to have an update chain like your "xmin
> 10, xmax 5 -> xmin 5, xmax invalid" example.
> 
> 2. It is definitely not possible to "freeze xmax" by setting its value
> to FrozenTransactionId or something similar -- there is simply no code
> path that can do that, and never has been. (The term "freeze xmax" is
> a bit ambiguous, though it usually means set xmax to
> InvalidTransactionId.)
> 
> 3. There is no specific reason to believe that there is a live bug here.

I don't think there's a live bug here. I think the patch isn't dealing
correctly with that issue though.


> Putting all 3 together: doesn't it seem quite likely that the way that
> we compute OldestXmin is the factor that prevents "skewering" of an
> update chain? What else could possibly be preventing corruption here?
> (Theoretically it might never have been discovered, but that seems
> pretty hard to believe.)

I don't see how that follows. The existing code is just ok with that. In fact
we have explicit code trying to exploit this:

/*
 * If the DEAD tuple is at the end of the chain, the entire 
chain is
 * dead and the root line pointer can be marked dead.  
Otherwise just
 * redirect the root to the correct chain member.
 */
if (i >= nchain)
heap_prune_record_dead(prstate, rootoffnum);
else
heap_prune_record_redirect(prstate, rootoffnum, 
chainitems[i]);


Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2022 at 4:15 PM Andres Freund  wrote:
> To me this is extending the problem into more areas rather than reducing
> it. I'd have *zero* confidence in any warnings that amcheck issued that
> involved <9.4 special cases.

Maybe you would at first. But then we get to learn what mistake we
made. And then we get to fix the bug, and get to know better next time
around. Next time (or maybe the time after that) you really will have
confidence in amcheck, because it'll have been battle tested at that
point.

For something like this that seems like the best way to go. Either we
support an on-disk format that includes legacy representations, or we
don't.

> I think it doesn't just affect the < 9.4 path, but also makes us implement
> things differently for >= 9.4. And we loose some accuracy due to that.

I don't follow. How so?

> The field we check for FrozenTransactionId in the code I was quoting is the
> xmin of the follower tuple. We follow the chain if either
> cur->xmax == next->xmin or if next->xmin == FrozenTransactionId
>
> What I'm doubting is the FrozenTransactionId path.

AFAICT we shouldn't be treating it as part of the same HOT chain. Only
the first heap-only tuple in a valid HOT chain should have an xmin
that is FrozenTransactionId (and only with an LP_REDIRECT root item, I
think). Otherwise the "prev_xmax==xmin" HOT chain traversal logic used
in places like heap_hot_search_buffer() simply won't work.

> > In my view it simply isn't possible for a valid HOT chain to be in
> > this state in the first place. So by definition it wouldn't be a HOT
> > chain.
>
> We haven't done any visibility checking at this point and my whole point is
> that there's no guarantee that the pointed-to tuple actually belongs to the
> same hot chain, given that we follow as soon as "xmin == FrozenXid". So the
> pointing tuple might be an orphaned tuple.

But an orphaned heap-only tuple shouldn't ever have
xmin==FrozenTransactionId to begin with. The only valid source of
orphaned heap-only tuples is transaction aborts. Aborted XID xmin
fields are never frozen.

> > That would be a form of corruption, which is something that
> > would probably be detected by noticing orphaned heap-only tuples
> > (heap-only tuples not reachable from some root item on the same page,
> > or some other intermediary heap-only tuple reachable from a root
> > item).
>
> You're saying that there's no way that there's a tuple pointing to another
> tuple on the same page, which the pointed-to tuple belonging to a different
> HOT chain?

Define "belongs to a different HOT chain".

You can get orphaned heap-only tuples, obviously. But only due to
transaction abort. Any page with an orphaned heap-only tuple that is
not consistent with it being from an earlier abort is a corrupt heap
page.

> > > Can't we have a an update chain that is e.g.
> > > xmin 10, xmax 5 -> xmin 5, xmax invalid
> > >
> > > and a vacuum cutoff of 7? That'd preent the first tuple from being 
> > > removed,
> > > but would allow 5 to be frozen.
> >
> > I don't see how that can be possible. That is contradictory, and
> > cannot possibly work, since it supposes a situation where every
> > possible MVCC snapshot sees the update that generated the
> > second/successor tuple as committed, while at the same time also
> > somehow needing the original tuple to stay in place. Surely both
> > things can never be true at the same time.
>
> The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
> that xid 10 is still running. All it means that one backend or slot has an
> xmin or xid of 7.

Of course that's true. But I wasn't talking about the general case --
I was talking about your "xmin 10, xmax 5 -> xmin 5, xmax invalid"
update chain case specifically, with its "skewered" OldestXmin of 7.

> s1: acquire xid 5
> s2: acquire xid 7
> s3: acquire xid 10
>
> s3: insert
> s3: commit
> s1: update
> s1: commit
>
> s2: get a new snapshot, xmin 7 (or just hold no snapshot)
>
> At this point the xmin horizon is 7. The first tuple's xmin can't be
> frozen. The second tuple's xmin can be.

Basically what I'm saying about OldestXmin is that it ought to "work
transitively", from the updater to the inserter that inserted the
now-updated tuple. That is, the OldestXmin should either count both
XIDs that appear in the update chain, or neither XID.

> > I believe you're right that an update chain that looks like this one
> > is possible. However, I don't think it's possible for
> > OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
> > "skewers" the update chain like this, the value 7 from your example).
> > We ought to be able to rely on an OldestXmin value that can never let
> > such a situation emerge. Right?
>
> I don't see anything that'd guarantee that currently, nor do immediately see a
> possible way to get there.
>
> What do you think prevents such an OldestXmin?

ComputeXidHorizons() computes VACUUM's OldestXmin (actually it
computes h->data_oldest_no

Re: Allow file inclusion in pg_hba and pg_ident files

2022-11-09 Thread Michael Paquier
On Wed, Nov 09, 2022 at 12:09:01PM +0800, Julien Rouhaud wrote:
> On Wed, Nov 09, 2022 at 09:51:17AM +0900, Michael Paquier wrote:
>> Julien, please note that this is waiting on author for now.  What do
>> you think about the now-named v18-0001 and the addition of an
>> ErrorContextCallback to provide more information about the list of
>> included files on error?
> 
> Yes, I'm unfortunately fully aware that it's waiting on me.  I've been a bit
> busy this week with $work but I will try to go back to it as soon as I can,
> hopefully this week!

FWIW, I have been playing with the addition of a ErrorContextCallback
in tokenize_auth_file(), and this addition leads to a really nice
result.  With this method, it is possible to know the full chain of
events leading to a failure when tokenizing included files, which is
not available now in the logs when reloading the server.

We could extend it to have more verbose information by passing more
arguments to tokenize_auth_file(), still I'd like to think that just
knowing the line number and the full path to the file is more than
enough once you know the full chain of events.  0001 and 0002 ought to
be merged together, but I am keeping these separate to show how simple
the addition of the ErrorContextCallback is.
--
Michael
From 41ae50e3b5b8e1d9013f337e4574702830dca95a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 7 Nov 2022 13:35:44 +0900
Subject: [PATCH v19 1/3] Invent open_auth_file() in hba.c, to refactor auth
 file opening

This adds a check on the recursion depth when including auth files,
something that has never been done when processing '@' files for
database and user name lists in pg_hba.conf.
---
 src/include/libpq/hba.h  |   4 +-
 src/backend/libpq/hba.c  | 100 ++-
 src/backend/utils/adt/hbafuncs.c |  18 ++
 3 files changed, 79 insertions(+), 43 deletions(-)

diff --git a/src/include/libpq/hba.h b/src/include/libpq/hba.h
index 7ad227d34a..a84a5f0961 100644
--- a/src/include/libpq/hba.h
+++ b/src/include/libpq/hba.h
@@ -177,7 +177,9 @@ extern int	check_usermap(const char *usermap_name,
 extern HbaLine *parse_hba_line(TokenizedAuthLine *tok_line, int elevel);
 extern IdentLine *parse_ident_line(TokenizedAuthLine *tok_line, int elevel);
 extern bool pg_isblank(const char c);
+extern FILE *open_auth_file(const char *filename, int elevel, int depth,
+			char **err_msg);
 extern MemoryContext tokenize_auth_file(const char *filename, FILE *file,
-		List **tok_lines, int elevel);
+		List **tok_lines, int elevel, int depth);
 
 #endif			/* HBA_H */
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index a9f87ab5bf..d8c0b585e5 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -117,7 +117,8 @@ static const char *const UserAuthName[] =
 
 
 static List *tokenize_inc_file(List *tokens, const char *outer_filename,
-			   const char *inc_filename, int elevel, char **err_msg);
+			   const char *inc_filename, int elevel,
+			   int depth, char **err_msg);
 static bool parse_hba_auth_opt(char *name, char *val, HbaLine *hbaline,
 			   int elevel, char **err_msg);
 static int	regcomp_auth_token(AuthToken *token, char *filename, int line_num,
@@ -414,7 +415,7 @@ regexec_auth_token(const char *match, AuthToken *token, size_t nmatch,
  */
 static List *
 next_field_expand(const char *filename, char **lineptr,
-  int elevel, char **err_msg)
+  int elevel, int depth, char **err_msg)
 {
 	char		buf[MAX_TOKEN];
 	bool		trailing_comma;
@@ -431,7 +432,7 @@ next_field_expand(const char *filename, char **lineptr,
 		/* Is this referencing a file? */
 		if (!initial_quote && buf[0] == '@' && buf[1] != '\0')
 			tokens = tokenize_inc_file(tokens, filename, buf + 1,
-	   elevel, err_msg);
+	   elevel, depth + 1, err_msg);
 		else
 			tokens = lappend(tokens, make_auth_token(buf, initial_quote));
 	} while (trailing_comma && (*err_msg == NULL));
@@ -459,6 +460,7 @@ tokenize_inc_file(List *tokens,
   const char *outer_filename,
   const char *inc_filename,
   int elevel,
+  int depth,
   char **err_msg)
 {
 	char	   *inc_fullname;
@@ -468,24 +470,18 @@ tokenize_inc_file(List *tokens,
 	MemoryContext linecxt;
 
 	inc_fullname = AbsoluteConfigLocation(inc_filename, outer_filename);
+	inc_file = open_auth_file(inc_fullname, elevel, depth, err_msg);
 
-	inc_file = AllocateFile(inc_fullname, "r");
 	if (inc_file == NULL)
 	{
-		int			save_errno = errno;
-
-		ereport(elevel,
-(errcode_for_file_access(),
- errmsg("could not open secondary authentication file \"@%s\" as \"%s\": %m",
-		inc_filename, inc_fullname)));
-		*err_msg = psprintf("could not open secondary authentication file \"@%s\" as \"%s\": %s",
-			inc_filename, inc_fullname, strerror(save_errno));
+		/* error already logged */
 		pfree(inc_fullname);
 		return tokens;
 	}
 
 	/* There is possible recursion here if the file contains @ */
-	line

Re: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-09 Thread LEO HSU
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

PURPOSE:  

This feature is intended to allow projects with many micro versions that do the 
same thing, be able to ship much fewer files by specifying less granular 
stopping.
One of the projects that will benefit from this is the PostGIS project.  
Currently we ship over 300 extension scripts per version which are currently 
just symlinks to the same file.
Part of the reason for that is our changes are dependent on both PostgreSQL 
version and PostGIS version, so a simple upgrade that only considers say 
PostGIS 3.2.0--3.3.1 is not sufficient.
Also much of the time, our function definitions don't change in micros, but yet 
we need to ship them to allow users to properly upgrade.

This feature will allow us to get rid of all these symlinks or 0-byte files.

I've tested this feature against the PostgreSQL master branch on mingw64 gcc 
8.1.

BASIC FEATURES

1) As stated, this feature only works if in the .control file, has a line:

wildcard_upgrades = true

2) It does nothing different if the line is missing or set to false.

3) If there is such a line and there is no other path that takes an extension 
from it's current version to the requested version, then it will use the 
wildcard script file.

TESTING

AUTOMATED TESTS
I have confirmed tests are in place.
However the tests do not run when I do 
make check (from the root source folder)

or when I do

make installcheck-world


To run these tests, I had to

cd src/test/modules/test_extensions
make check

and see the output showing:

== running regression test queries==
test test_extensions  ... ok  186 ms
test test_extdepend   ... ok   97 ms


I confirmed the tests are in test_extensions, which has always existed.
So I assume why it's not being run in installcheck-world is something off with 
my configuration 
or it's intentionally not run.

MANUAL TESTS
 I also ran some adhoc tests of my own to confirm the behavior. and checking 
with
SET client_min_messages='DEBUG1';

To confirm that the wildcard script I have only gets called when there is no 
other path that will take the user to the new version.
I created my own extension
wildtest


1) I confirmed  It only understands % as a complete wildcard for a version 
number


So this is legal
wildtest--%--2.0.sql


This does nothing
wildtest--2%--2.0.sql


2) I confirmed that if you have a path such as


wildtest--2.0--2.2.sql
wildtest--%--2.2.sql


then the exact match trumps the wildcard. In the above case if I am on 2.0
and going to 2.2, the wildtest--2.0--2.2.sql script is used instead of the
wildtest--% one.


3) It is not possible to downgrade with the wildcard.  For example I had
paths
wildtest--%--2.1.sql  


and I was unable to go from a version 2.2 down to a version 2.1.

DOCUMENTATION

I built the html docs and confirmed that in the section of the docs where it 
defines valid properties of extension files it now includes this line:

wildcard_upgrades (boolean)

This parameter, if set to true (which is not the default), allows ALTER 
EXTENSION to consider a wildcard character % as matching any version of the 
extension. Such wildcard match will only be used when no perfect match is found 
for a version.

The new status of this patch is: Ready for Committer


Re: heavily contended lwlocks with long wait queues scale badly

2022-11-09 Thread Andres Freund
On 2022-11-09 09:38:08 -0800, Andres Freund wrote:
> I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
> to push it to HEAD if I get it done in the next few hours. Bigger issues,
> which I do not expect, should show up before tomorrow afternoon. Smaller
> things could wait till Sunday if necessary.

I didn't get to it in time, so I'll leave it for when I'm back.




Document parameter count limit

2022-11-09 Thread David G. Johnston
Inspired by a recent posting on Slack...

diff --git a/doc/src/sgml/limits.sgml b/doc/src/sgml/limits.sgml
index d5b2b627dd..5d68eef093 100644
--- a/doc/src/sgml/limits.sgml
+++ b/doc/src/sgml/limits.sgml
@@ -97,6 +97,13 @@
 32
 can be increased by recompiling
PostgreSQL

+
+   
+parameters per query
+65,535
+if you are reading this prepatorily, please redesign your query
to use temporary tables or arrays
+   
+

   
  

David J.


Re: Out-of-date clang/llvm version lists in PGAC_LLVM_SUPPORT

2022-11-09 Thread Andres Freund
Hi,

On 2022-11-08 13:08:45 -0500, Tom Lane wrote:
> I happened to notice that these lists of supported versions haven't
> been updated in a good long time:
> 
>   PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-7 llvm-config-6.0 
> llvm-config-5.0 llvm-config-4.0 llvm-config-3.9)
> 
>   PGAC_PATH_PROGS(CLANG, clang clang-7 clang-6.0 clang-5.0 clang-4.0 
> clang-3.9)
> 
> Given the lack of complaints, it seems likely that nobody is relying
> on these.  Maybe we should just nuke them?

Yea, that's probably a good idea. Pretty clear that that should happen only in
HEAD?


> I may be missing it, but it doesn't look like meson.build has any
> equivalent lists.  So that might be an argument for getting rid
> of the lists here?

The list is just in meson, it has a builtin helper for depending on llvm. So
that's not quite an argument unfortunately.

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Andres Freund
Hi,

On 2022-11-09 15:03:39 -0800, Peter Geoghegan wrote:
> > > + /*
> > > +  * Add a line pointer offset to the predecessor 
> > > array if xmax is
> > > +  * matching with xmin of next tuple (reaching via 
> > > its t_ctid).
> > > +  * Prior to PostgreSQL 9.4, we actually changed the 
> > > xmin to
> > > +  * FrozenTransactionId
> >
> > I'm doubtful it's a good idea to try to validate the 9.4 case. The 
> > likelihood
> > of getting that right seems low and I don't see us gaining much by even 
> > trying.
> 
> This is the kind of comment that I'd usually agree with, but I
> disagree in this instance because of special considerations that apply
> to amcheck (or should IMV apply, at least). We're living in a world
> where we have to assume that the pre-9.4 format can occur in the
> field. If we can't get it right in amcheck, what chance do we have
> with other new code that tickles the same areas? I think that we need
> to support obsolescent heapam representations (both
> FrozenTransactionId and xvac) here on general principle.

To me this is extending the problem into more areas rather than reducing
it. I'd have *zero* confidence in any warnings that amcheck issued that
involved <9.4 special cases.

We've previously discussed adding pg_class column tracking the PG version that
last scanned the whole relation. We really should get to that one of these
decades :(.


> Besides, why not accept some small chance of getting this wrong? The
> worst that can happen is that we'll have a relatively benign bug. If
> we get it wrong then it's a short term problem, but also an
> opportunity to be less wrong in the future -- including in places
> where the consequences of being wrong are much more serious.

I think it doesn't just affect the < 9.4 path, but also makes us implement
things differently for >= 9.4. And we loose some accuracy due to that.


> > I doubt it is correct to enter this path with next_xmin ==
> > FrozenTransactionId. This is following a ctid chain that we normally 
> > wouldn't
> > follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin 
> > condition.
> 
> We should never see FrozenTransactionId in an xmax field (nor should
> it be returned by HeapTupleHeaderGetUpdateXid() under any
> circumstances).

The field we check for FrozenTransactionId in the code I was quoting is the
xmin of the follower tuple. We follow the chain if either
cur->xmax == next->xmin or if next->xmin == FrozenTransactionId

What I'm doubting is the FrozenTransactionId path.


> Anyway, it follows that we cannot expect "next_xmin ==
> FrozenTransactionId", because that would mean that we'd called
> HeapTupleHeaderGetUpdateXid() which returned FrozenTransactionId -- an
> impossibility. (Maybe we should be checking that it really is an
> impossibility by checking the HeapTupleHeaderGetUpdateXid() return
> value, but that should be enough.)

next_xmin is acquired via HeapTupleHeaderGetXmin(next_htup), not
HeapTupleHeaderGetUpdateXid(cur_typ).


> > I don't immediately see what prevents the frozen tuple being from an 
> > entirely
> > different HOT chain than the two tuples pointing to it.
> 
> In my view it simply isn't possible for a valid HOT chain to be in
> this state in the first place. So by definition it wouldn't be a HOT
> chain.

We haven't done any visibility checking at this point and my whole point is
that there's no guarantee that the pointed-to tuple actually belongs to the
same hot chain, given that we follow as soon as "xmin == FrozenXid". So the
pointing tuple might be an orphaned tuple.


> That would be a form of corruption, which is something that
> would probably be detected by noticing orphaned heap-only tuples
> (heap-only tuples not reachable from some root item on the same page,
> or some other intermediary heap-only tuple reachable from a root
> item).

You're saying that there's no way that there's a tuple pointing to another
tuple on the same page, which the pointed-to tuple belonging to a different
HOT chain?

I'm fairly certain that that at least used to be possible, and likely is still
possible. Isn't that pretty much what you'd expect to happen if there's
concurrent aborts leading to abandoned hot chains?



> > Can't we have a an update chain that is e.g.
> > xmin 10, xmax 5 -> xmin 5, xmax invalid
> >
> > and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
> > but would allow 5 to be frozen.
> 
> I don't see how that can be possible. That is contradictory, and
> cannot possibly work, since it supposes a situation where every
> possible MVCC snapshot sees the update that generated the
> second/successor tuple as committed, while at the same time also
> somehow needing the original tuple to stay in place. Surely both
> things can never be true at the same time.

The xmin horizon is very coarse grained. Just because it is 7 doesn't mean
that xid 10 is still runni

Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2022 at 2:08 PM Andres Freund  wrote:
> To start with: I think this is an extremely helpful and important
> feature. Both for checking production systems and for finding problems during
> development.

+1.

It's painful to get this in, in part because we now have to actually
decide what the rules really are with total precision, including for
all of the tricky edge cases. The exercise of writing this code should
"keep us honest" about whether or not we really know what the
invariants are, which is more than half the battle.

> > + /*
> > +  * Add a line pointer offset to the predecessor array 
> > if xmax is
> > +  * matching with xmin of next tuple (reaching via its 
> > t_ctid).
> > +  * Prior to PostgreSQL 9.4, we actually changed the 
> > xmin to
> > +  * FrozenTransactionId
>
> I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> of getting that right seems low and I don't see us gaining much by even 
> trying.

This is the kind of comment that I'd usually agree with, but I
disagree in this instance because of special considerations that apply
to amcheck (or should IMV apply, at least). We're living in a world
where we have to assume that the pre-9.4 format can occur in the
field. If we can't get it right in amcheck, what chance do we have
with other new code that tickles the same areas? I think that we need
to support obsolescent heapam representations (both
FrozenTransactionId and xvac) here on general principle.

Besides, why not accept some small chance of getting this wrong? The
worst that can happen is that we'll have a relatively benign bug. If
we get it wrong then it's a short term problem, but also an
opportunity to be less wrong in the future -- including in places
where the consequences of being wrong are much more serious.

> I doubt it is correct to enter this path with next_xmin ==
> FrozenTransactionId. This is following a ctid chain that we normally wouldn't
> follow, because it doesn't satisfy the t_self->xmax == t_ctid->xmin condition.

We should never see FrozenTransactionId in an xmax field (nor should
it be returned by HeapTupleHeaderGetUpdateXid() under any
circumstances). We can "freeze xmax" during VACUUM, but that actually
means setting xmax to InvalidTransactionId (in rare cases it might
mean replacing a Multi with a new Multi). The terminology in this area
is a bit tricky.

Anyway, it follows that we cannot expect "next_xmin ==
FrozenTransactionId", because that would mean that we'd called
HeapTupleHeaderGetUpdateXid() which returned FrozenTransactionId -- an
impossibility. (Maybe we should be checking that it really is an
impossibility by checking the HeapTupleHeaderGetUpdateXid() return
value, but that should be enough.)

> I don't immediately see what prevents the frozen tuple being from an entirely
> different HOT chain than the two tuples pointing to it.

In my view it simply isn't possible for a valid HOT chain to be in
this state in the first place. So by definition it wouldn't be a HOT
chain. That would be a form of corruption, which is something that
would probably be detected by noticing orphaned heap-only tuples
(heap-only tuples not reachable from some root item on the same page,
or some other intermediary heap-only tuple reachable from a root
item).

> Can't we have a an update chain that is e.g.
> xmin 10, xmax 5 -> xmin 5, xmax invalid
>
> and a vacuum cutoff of 7? That'd preent the first tuple from being removed,
> but would allow 5 to be frozen.

I don't see how that can be possible. That is contradictory, and
cannot possibly work, since it supposes a situation where every
possible MVCC snapshot sees the update that generated the
second/successor tuple as committed, while at the same time also
somehow needing the original tuple to stay in place. Surely both
things can never be true at the same time.

I believe you're right that an update chain that looks like this one
is possible. However, I don't think it's possible for
OldestXmin/FreezeLimit to take on a value like that (i.e. a value that
"skewers" the update chain like this, the value 7 from your example).
We ought to be able to rely on an OldestXmin value that can never let
such a situation emerge. Right?

--
Peter Geoghegan




Re: [DOCS] Stats views and functions not in order?

2022-11-09 Thread David G. Johnston
On Mon, Nov 7, 2022 at 5:19 PM Peter Smith  wrote:

> On Mon, Nov 7, 2022 at 5:50 AM Tom Lane  wrote:
> >
> > Peter Smith  writes:
> > > Sorry, I forgot the attachments in the previous post. PSA.
> >
> > I spent a bit of time looking at this.  I agree that a lot of the
> > current ordering choices here look like they were made with the
> > advice of a dartboard, and there's a number of things that are
> > pretty blatantly just sloppy merging (like the out-of-order
> > wait-event items).  However, I'm not a big fan of "alphabetical
> > order at all costs", because that frequently leads to ordering
> > decisions that are not a lot better than random from a semantic
> > standpoint.  For example, I resist the idea that it's sensible
> > to put pg_stat_all_indexes before pg_stat_all_tables.
> > I'm unconvinced that putting pg_stat_sys_tables and
> > pg_stat_user_tables far away from pg_stat_all_tables is great,
> > either.
> >
>
> Thanks for taking the time to look at my patch. The "at all costs"
> approach was not the intention - I was just trying only to apply some
> sane ordering where I did not recognize a reason for the current
> order.
>
> > So ... how do we proceed?
> >
>
> To proceed with the existing patches I need some guidance on exactly
> which of the changes can be considered improvements versus which ones
> are maybe just trading one 'random' order for another.
>
> How about below?
>
> Table 28.1. Dynamic Statistics Views -- Alphabetical order would be a
> small improvement here, right?
>

The present ordering seems mostly OK, though just like the "Progress"
update below the bottom 6 pg_stat_progress_* entries should be
alphabetized; but leaving them as a group at the end seems desirable.

Move pg_stat_recovery_prefetch either after subscription or after activity
- the replication/received/subscription stuff all seems like it should be
grouped together.  As well as the security related ssl/gssapi.


> Table 28.2. Collected Statistics Views -- Leave this one unchanged
> (per your comments above).
>

I would suggest moving the 3 pg_statio_*_tables rows between the
pg_stat_*_tables and the pg_stat_xact_*_tables groups.

Everything pertaining to cluster, database, tables, indexes, functions.
slru and replication slots should likewise shift to the (near) top in the
cluster/database grouping.


> Table 28.12 Wait Events of type LWLock -- Seems a clear case of bad
> merging. Alphabetical order is surely needed here, right?
>

+1 Agreed.

>
> Table 28.34 Additional Statistic Functions -- Alphabetical order would
> be a small improvement here, right?
>

No.  All "reset" items should be grouped at the end like they are.  I don't
see an alternative ordering among them that is clearly superior.  Same for
the first four.


>
> Table 28.35 Per-Backend Statistics Functions --  Alphabetical order
> would be a small improvement here, right?
>
>
This one I would rearrange alphabetically.  Or, at least, I have a
different opinion of what would make a decent order but it doesn't seem all
that clearly better than alphabetical.


> > I'd be inclined to alphabetize by SQL command name, but maybe
> > leave Base Backup to the end since it's not a SQL command.
> >
>
> Yes, I had previously only looked at the content of section 28.2
> because I didn't want to get carried away by changing too much until
> there was some support for doing the first part.
>
> Now PSA a separate patch for fixing section "28.4. Progress Reporting"
> order as suggested.
>
>
This seems like a clear win.

David J.


Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Andres Freund
Hi,

To start with: I think this is an extremely helpful and important
feature. Both for checking production systems and for finding problems during
development.


> From 08fe01f5073c0a850541265494bb4a875bec7d3f Mon Sep 17 00:00:00 2001
> From: Himanshu Upadhyaya 
> Date: Fri, 30 Sep 2022 17:44:56 +0530
> Subject: [PATCH v6] Implement HOT chain validation in verify_heapam()
> 
> Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev
> 
> Discussion: 
> https://postgr.es/m/CAPF61jBBR2-iE-EmN_9v0hcQEfyz_17e5Lbb0%2Bu2%3D9ukA9sWmQ%40mail.gmail.com
> ---
>  contrib/amcheck/verify_heapam.c   | 207 ++
>  src/bin/pg_amcheck/t/004_verify_heapam.pl | 192 ++--
>  2 files changed, 388 insertions(+), 11 deletions(-)
> 
> diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
> index c875f3e5a2..007f7b2f37 100644
> --- a/contrib/amcheck/verify_heapam.c
> +++ b/contrib/amcheck/verify_heapam.c
> @@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
>   for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
>   {
>   OffsetNumber maxoff;
> + OffsetNumber predecessor[MaxOffsetNumber] = {0};
> + OffsetNumber successor[MaxOffsetNumber] = {0};
> + boollp_valid[MaxOffsetNumber] = {false};
>  
>   CHECK_FOR_INTERRUPTS();
>  
> @@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
>   for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
>ctx.offnum = OffsetNumberNext(ctx.offnum))
>   {
> + OffsetNumber nextoffnum;
> +
>   ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
>  
>   /* Skip over unused/dead line pointers */
> @@ -469,6 +474,13 @@ verify_heapam(PG_FUNCTION_ARGS)
>   report_corruption(&ctx,
> 
> psprintf("line pointer redirection to unused item at offset %u",
>   
>(unsigned) rdoffnum));
> +
> + /*
> +  * make entry in successor array, redirected 
> tuple will be
> +  * validated at the time when we loop over 
> successor array
> +  */
> + successor[ctx.offnum] = rdoffnum;
> + lp_valid[ctx.offnum] = true;
>   continue;
>   }
>  
> @@ -504,9 +516,197 @@ verify_heapam(PG_FUNCTION_ARGS)
>   /* It should be safe to examine the tuple's header, at 
> least */
>   ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, 
> ctx.itemid);
>   ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
> + lp_valid[ctx.offnum] = true;
>  
>   /* Ok, ready to check this next tuple */
>   check_tuple(&ctx);
> +
> + /*
> +  * Add the data to the successor array if next updated 
> tuple is in
> +  * the same page. It will be used later to generate the
> +  * predecessor array.
> +  *
> +  * We need to access the tuple's header to populate the
> +  * predecessor array. However the tuple is not 
> necessarily sanity
> +  * checked yet so delaying construction of predecessor 
> array until
> +  * all tuples are sanity checked.
> +  */
> + nextoffnum = 
> ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
> + if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == 
> ctx.blkno &&
> + nextoffnum != ctx.offnum)
> + {
> + successor[ctx.offnum] = nextoffnum;
> + }

I don't really understand this logic - why can't we populate the predecessor
array, if we can construct a successor entry?


> + }
> +
> + /*
> +  * Loop over offset and populate predecessor array from all 
> entries
> +  * that are present in successor array.
> +  */
> + ctx.attnum = -1;
> + for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
> +  ctx.offnum = OffsetNumberNext(ctx.offnum))
> + {
> + ItemId  curr_lp;
> + ItemId  next_lp;
> + HeapTupleHeader curr_htup;
> + HeapTupleHeader next_htup;
> + TransactionId curr_xmax;
> + TransactionId next_xmin;
> +
> + OffsetNumber nextoffnum = succes

Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2022 at 6:29 AM Imseih (AWS), Sami  wrote:
> When a user is running a parallel vacuum and the vacuum is long running
>
> due to many large indexes, it would make sense to check for failsafe earlier.

It makes sense to prefer consistency here, I suppose. The reason why
we're not consistent is because it was easier not to be, which isn't
exactly the best reason (nor the worst).

I don't think that it's obvious that we need to call the failsafe at
any particular frequency. There is probably an argument to be made for
the idea that we're not checking frequently enough (even in the common
serial VACUUM case), just as there is an argument to be made for the
opposite idea. It's not like there is some simple linear relationship
(or any kind of relationship) between the amount of physical work
performed by one VACUUM operation, and the rate of XID consumption by
the system as a whole. And so the details of how we do it have plenty
to do with what we can afford to do.

My gut instinct is that the most important thing is to at least call
lazy_check_wraparound_failsafe() once per index scan. Multiple index
scans are disproportionately involved in VACUUMs that take far longer
than expected, which are presumably the kind of VACUUMs that tend to
be running when the failsafe actually triggers. We don't really expect
the failsafe to trigger, so presumably when it actually does things haven't
been going well for some time. (Index corruption that prevents forward
progress on one particular index is another example.)

That said, one thing that does bother me in this area occurs to me: we
call lazy_check_wraparound_failsafe() from lazy_scan_heap() (before we
get to the index scans that you're talking about) at an interval that
is based on how many heap pages we've either processed (and recorded
as a scanned_pages page) *or* have skipped over using the visibility
map. In other words we use blkno here, when we should really be using
scanned_pages instead:

if (blkno - next_failsafe_block >= FAILSAFE_EVERY_PAGES)
{
lazy_check_wraparound_failsafe(vacrel);
next_failsafe_block = blkno;
}

This code effectively treats pages skipped using the visibility map as
equivalent to pages physically scanned (scanned_pages), even though
skipping requires essentially no work at all. That just isn't logical,
and feels like something worth fixing. The fundamental unit of work in
lazy_scan_heap() is a *scanned* heap page.

--
Peter Geoghegan




Re: Expand palloc/pg_malloc API

2022-11-09 Thread Tom Lane
Peter Eisentraut  writes:
> On 11.10.22 18:04, Tom Lane wrote:
>> Hmm ... the individual allocators have that info, but mcxt.c doesn't
>> have access to it.  I guess we could invent an additional "method"
>> to return the requested size of a chunk, which is only available in
>> MEMORY_CONTEXT_CHECKING builds, or maybe in !MEMORY_CONTEXT_CHECKING
>> it returns the allocated size instead.

> I'm not sure whether that amount of additional work would be useful 
> relative to the size of this patch.  Is the patch as it stands now 
> making the code less robust than what the code is doing now?

No.  It's slightly annoying that the call sites still have to track
the old size of the allocation, but I guess that they have to have
that information in order to know that they need to repalloc in the
first place.  I agree that this patch does make things easier to
read and a bit less error-prone.

Also, I realized that what I proposed above doesn't really work
anyway for this purpose.  Consider

ptr = palloc(size);
... fill all "size" bytes ...
ptr = repalloc0(ptr, size, newsize);

where the initial size request isn't a power of 2.  If production builds
rely on the initial allocated size not requested size to decide where to
start zeroing, this would work (no uninitialized holes) in a debug build,
but leave some uninitialized bytes in a production build, which is
absolutely horrible.  So I guess we have to rely on the callers to
track their requests.

> In the meantime, here is an updated patch with the argument order 
> swapped and an additional assertion, as previously discussed.

I think it'd be worth expending an actual runtime test in repalloc0,
that is

if (unlikely(oldsize > size))
elog(ERROR, "invalid repalloc0 call: oldsize %zu, new size %zu",
 oldsize, size);

This seems cheap compared to the cost of the repalloc+memset, and the
consequences of not detecting the error seem pretty catastrophic ---
the memset would try to zero almost your whole address space.

No objections beyond that.  I've marked this RFC.

regards, tom lane




Re: real/float example for testlibpq3

2022-11-09 Thread Mark Wong
On Thu, Nov 03, 2022 at 09:55:22AM -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On 01.11.22 09:15, Tom Lane wrote:
> >> Agreed that the libpq manual is not the place for this, but I feel
> >> like it will also be clutter in "Data Types".  Perhaps we should
> >> invent a new appendix or the like?  Somewhere near the wire protocol
> >> docs seems sensible.
> 
> > Would that clutter the protocol docs? ;-)
> 
> I said "near", not "in".  At the time I was thinking "new appendix",
> but I now recall that the wire protocol docs are not an appendix
> but a chapter in the Internals division.  So that doesn't seem like
> quite the right place anyway.
> 
> Perhaps a new chapter under "IV. Client Interfaces" is the right
> place?
> 
> If we wanted to get aggressive, we could move most of the nitpicky details
> about datatype text formatting (e.g., the array quoting rules) there too.
> I'm not set on that, but it'd make datatype.sgml smaller which could
> hardly be a bad thing.
> 
> > I suppose figuring out exactly where to put it and how to mark it up, 
> > etc., in a repeatable fashion is part of the job here.
> 
> Yup.

I'll take a stab at adding a new chapter and share how that looks.

Regards,
Mark

--
Mark Wong
EDB https://enterprisedb.com




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen
On Wed, Nov 9, 2022 at 2:08 PM David Christensen
 wrote:
> Justin sez:
> > I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> > majority of TAP tests don't.
> >
> > $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
> >1435 safe_psql('postgres',
> > 335 safe_psql(
> >  23 safe_psql($connect_db,
>
> If there was a reason, I don't recall offhand; I will test removing it
> and if things still work will consider it good enough.

Things blew up when I did that; rather than hunt it down, I just left it in. :-)

Enclosed is v7, with changes thus suggested thus far.


v7-0001-Teach-pg_waldump-to-extract-FPIs-from-the-WAL-str.patch
Description: Binary data


Re: allow segment size to be set to < 1GiB

2022-11-09 Thread Andres Freund
Hi,

On 2022-11-09 14:44:42 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > A second question: Both autoconf and meson print the segment size as GB 
> > right
> > now. Obviously that'll print out a size of 0 for a segsize < 1GB.
> 
> > The easiest way to would be to just display the number of blocks, but that's
> > not particularly nice.
> 
> Well, it would be fine if you'd written --with-segsize-blocks, wouldn't
> it?  Can we make the printout format depend on which switch was used?

Not sure why I didn't think of that...

Updated patch attached.

I made one autoconf and one meson CI task use a small block size, but just to
ensure it work on both. I'd probably leave it set on one, so we keep the
coverage for cfbot?

Greetings,

Andres Freund
>From 7a76be232dd0587e9c84af4e4481d5a98f94bd22 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 9 Nov 2022 12:01:54 -0800
Subject: [PATCH v2] Add option to specify segment size in blocks

The tests don't have much coverage of segment related code, as we don't create
large enough tables. To make it easier to test these paths, add a new option
specifying the segment size in blocks.

Set the new option to 6 blocks in one of the CI tasks. Smaller numbers
currently fail one of the tests, for understandable reasons.

While at it, fix some segment size related issues in the meson build.

Author: Andres Freund 
Discussion: https://postgr.es/m/20221107171355.c23fzwanfzq2p...@awork3.anarazel.de
---
 meson.build| 24 ++---
 meson_options.txt  |  3 ++
 configure.ac   | 36 ++-
 doc/src/sgml/installation.sgml | 14 
 .cirrus.yml|  2 ++
 configure  | 63 --
 6 files changed, 118 insertions(+), 24 deletions(-)

diff --git a/meson.build b/meson.build
index ce2f223a409..b0fcf82a9c2 100644
--- a/meson.build
+++ b/meson.build
@@ -418,7 +418,19 @@ meson_bin = find_program(meson_binpath, native: true)
 
 cdata.set('USE_ASSERT_CHECKING', get_option('cassert') ? 1 : false)
 
-cdata.set('BLCKSZ', get_option('blocksize').to_int() * 1024, description:
+blocksize = get_option('blocksize').to_int() * 1024
+
+if get_option('segsize_blocks') != 0
+  if get_option('segsize') != 1
+warning('both segsize and segsize_blocks specified, segsize_blocks wins')
+  endif
+
+  segsize = get_option('segsize_blocks')
+else
+  segsize = (get_option('segsize') * 1024 * 1024 * 1024) / blocksize
+endif
+
+cdata.set('BLCKSZ', blocksize, description:
 '''Size of a disk block --- this also limits the size of a tuple. You can set
it bigger if you need bigger tuples (although TOAST should reduce the need
to have large tuples, since fields can be spread across multiple tuples).
@@ -428,7 +440,7 @@ cdata.set('BLCKSZ', get_option('blocksize').to_int() * 1024, description:
Changing BLCKSZ requires an initdb.''')
 
 cdata.set('XLOG_BLCKSZ', get_option('wal_blocksize').to_int() * 1024)
-cdata.set('RELSEG_SIZE', get_option('segsize') * 131072)
+cdata.set('RELSEG_SIZE', segsize)
 cdata.set('DEF_PGPORT', get_option('pgport'))
 cdata.set_quoted('DEF_PGPORT_STR', get_option('pgport').to_string())
 cdata.set_quoted('PG_KRB_SRVNAM', get_option('krb_srvnam'))
@@ -3053,9 +3065,11 @@ if meson.version().version_compare('>=0.57')
 
   summary(
 {
-  'data block size': cdata.get('BLCKSZ'),
-  'WAL block size': cdata.get('XLOG_BLCKSZ') / 1024,
-  'segment size': cdata.get('RELSEG_SIZE') / 131072,
+  'data block size': '@0@ kB'.format(cdata.get('BLCKSZ') / 1024),
+  'WAL block size': '@0@ kB'.format(cdata.get('XLOG_BLCKSZ') / 1024),
+  'segment size': get_option('segsize_blocks') != 0 ?
+'@0@ blocks'.format(cdata.get('RELSEG_SIZE')) :
+'@0@ GB'.format(get_option('segsize')),
 },
 section: 'Data layout',
   )
diff --git a/meson_options.txt b/meson_options.txt
index c7ea57994dc..0d2a34fd154 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -13,6 +13,9 @@ option('wal_blocksize', type : 'combo',
 option('segsize', type : 'integer', value : 1,
   description : '''Segment size, in gigabytes''')
 
+option('segsize_blocks', type : 'integer', value: 0,
+  description : '''Segment size, in blocks''')
+
 
 # Miscellaneous options
 
diff --git a/configure.ac b/configure.ac
index f76b7ee31fc..94542e862cf 100644
--- a/configure.ac
+++ b/configure.ac
@@ -285,15 +285,31 @@ AC_DEFINE_UNQUOTED([BLCKSZ], ${BLCKSZ}, [
 #
 # Relation segment size
 #
-AC_MSG_CHECKING([for segment size])
 PGAC_ARG_REQ(with, segsize, [SEGSIZE], [set table segment size in GB [1]],
  [segsize=$withval],
  [segsize=1])
-# this expression is set up to avoid unnecessary integer overflow
-# blocksize is already guaranteed to be a factor of 1024
-RELSEG_SIZE=`expr '(' 1024 / ${blocksize} ')' '*' ${segsize} '*' 1024`
-test $? -eq 0 || exit 1
-AC_MSG_RESULT([${segsize}GB])
+PGAC_ARG_REQ(with, segsize-blocks, [SEGSIZE_BLOCKS], [set ta

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen
> > 6.
> > +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> > Use pg_dir_create_mode instead of hard-coded 0007?
>
> I think I thought of that when I first looked at the patch ...  but, I'm
> not sure, since it says:
>
> src/include/common/file_perm.h-/* Modes for creating directories and files IN 
> THE DATA DIRECTORY */
> src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode;

Looks like it's pretty evenly split in src/bin:

$ git grep -o -E -w '(pg_mkdir_p|mkdir)' '**.c' | sort | uniq -c
  3 initdb/initdb.c:mkdir
  3 initdb/initdb.c:pg_mkdir_p
  1 pg_basebackup/bbstreamer_file.c:mkdir
  2 pg_basebackup/pg_basebackup.c:pg_mkdir_p
  1 pg_dump/pg_backup_directory.c:mkdir
  1 pg_rewind/file_ops.c:mkdir
  4 pg_upgrade/pg_upgrade.c:mkdir

So if that is the preferred approach I'll go ahead and use it.

> I was wondering if there's any reason to do "CREATE DATABASE".  The vast
> majority of TAP tests don't.
>
> $ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
>1435 safe_psql('postgres',
> 335 safe_psql(
>  23 safe_psql($connect_db,

If there was a reason, I don't recall offhand; I will test removing it
and if things still work will consider it good enough.

David




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread David Christensen
On Wed, Nov 9, 2022 at 6:30 AM Bharath Rupireddy
 wrote:
>
> On Wed, Nov 9, 2022 at 5:08 AM David Christensen
>  wrote:
> >
> > Enclosed is v6, which squashes your refactor and adds the additional
> > recent suggestions.
>
> Thanks for working on this feature. Here are some comments for now. I
> haven't looked at the tests, I will take another look at the code and
> tests after these and all other comments are addressed.
>
> 1. For ease of review, please split the test patch to 0002.

Per later discussion seems like new feature tests are fine in the same
patch, yes?

> 2. I'm unable to understand the use-case for --fixup-fpi option.
> pg_waldump is supposed to be just WAL reader, and must not return any
> modified information, with --fixup-fpi option, the patch violates this
> principle i.e. it sets page LSN and returns. Without actually
> replaying the WAL record on the page, how is it correct to just set
> the LSN? How will it be useful? ISTM, we must ignore this option
> unless there's a strong use-case.

How I was envisioning this was for cases like extreme surgery for
corrupted pages, where you extract the page from WAL but it has lsn
and checksum set so you could do something like `dd if=fixup-block
of=relation ...`, so it *simulates* the replay of said fullpage blocks
in cases where for some reason you can't play the intermediate
records; since this is always a fullpage block, it's capturing what
would be the snapshot so you could manually insert somewhere as needed
without needing to replay (say if dealing with an incomplete or
corrupted WAL stream).

> 3.
> +if (fork >= 0 && fork <= MAX_FORKNUM)
> +{
> +if (fork)
> +sprintf(forkname, "_%s", forkNames[fork]);
> +else
> +forkname[0] = 0;
> +}
> +else
> +pg_fatal("found invalid fork number: %u", fork);
>
> Isn't the above complex? What's the problem with something like below?
> Why do we need if (fork) - else block?
>
> if (fork >= 0  && fork <= MAX_FORKNUM)
> sprintf(forkname, "_%s", forkNames[fork]);
> else
> pg_fatal("found invalid fork number: %u", fork);

This was to suppress any suffix for main forks, but yes, could
simplify and include the `_` in the suffix name.  Will include such a
change.

> 3.
> +charpage[BLCKSZ] = {0};
> I think when writing to a file, we need PGAlignedBlock rather than a
> simple char array of bytes, see the description around PGAlignedBlock
> for why it is so.

Easy enough change, and makes sense.

> 4.
> +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can
> avoid fileno() system calls, no? Do you need the file position to
> remain the same after writing, hence pg_pwrite()?

I don't recall the original motivation, TBH.

> 5.
> +if (!RestoreBlockImage(record, block_id, page))
> +continue;
> +
> +/* we have our extracted FPI, let's save it now */
> After extracting the page from the WAL record, do we need to perform a
> checksum on it?

That is there in fixup mode (or should be).  Are you thinking this
should also be set if not in fixup mode?  That defeats the purpose of
the raw page extract, which is to see *exactly* what the WAL stream
has.

> 6.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Use pg_dir_create_mode instead of hard-coded 0007?

Sure.

> 7.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> +fsync(fileno(OPF));
> +fclose(OPF);
> Since you're creating the directory in case it's not available, you
> need to fsync the directory too?

Sure.

> 8.
> +case 'W':
> +case 'X':
> +config.fixup_fpw = (option == 'X');
> +config.save_fpw_path = pg_strdup(optarg);
> +break;
> Just set  config.fixup_fpw = false before the switch block starts,
> like the other variables, and then perhaps doing like below is more
> readable:
> case 'W':
> config.save_fpw_path = pg_strdup(optarg);
> case 'X':
>config.fixup_fpw = true;
>config.save_fpw_path = pg_strdup(optarg);

Like separate opt processing with their own `break` statement?
Probably a bit more readable/conventional.

> 9.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Should we use pg_mkdir_p() instead of mkdir()?

Sure.

Thanks,

David




Re: allow segment size to be set to < 1GiB

2022-11-09 Thread Tom Lane
Andres Freund  writes:
> A second question: Both autoconf and meson print the segment size as GB right
> now. Obviously that'll print out a size of 0 for a segsize < 1GB.

> The easiest way to would be to just display the number of blocks, but that's
> not particularly nice.

Well, it would be fine if you'd written --with-segsize-blocks, wouldn't
it?  Can we make the printout format depend on which switch was used?

regards, tom lane




Re: allow segment size to be set to < 1GiB

2022-11-09 Thread Andres Freund
On 2022-11-08 18:28:08 -0800, Andres Freund wrote:
> On 2022-11-07 21:36:33 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2022-11-07 12:52:25 -0500, Tom Lane wrote:
> > >> How about instead allowing the segment size to be set in pages?
> > 
> > > In addition or instead of --with-segsize/-Dsegsize?
> > 
> > In addition to.  What I meant by "instead" was to replace
> > your proposal of --with-segsize-mb.
> 
> Working on updating the patch.
> 
> One semi-interesting bit is that <= 5 blocks per segment fails, because
> corrupt_page_checksum() doesn't know about segments and
> src/bin/pg_basebackup/t/010_pg_basebackup.pl does

A second question: Both autoconf and meson print the segment size as GB right
now. Obviously that'll print out a size of 0 for a segsize < 1GB.

The easiest way to would be to just display the number of blocks, but that's
not particularly nice. We could show kB, but that ends up being large. Or we
can have some code to adjust the unit, but that seems a bit overkill.

Opinions?




Re: psql: Add command to use extended query protocol

2022-11-09 Thread Daniel Verite
Peter Eisentraut wrote:

> Is there a use case for a global setting?

I assume that we may sometimes want to use the
extended protocol on all queries of a script, like
pgbench does with --protocol=extended.
Outside of psql, it's too complicated to parse a SQL script to
replace the end-of-query semicolons with \gp, whereas
a psql setting solves this effortlessly.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: refactor ownercheck and aclcheck functions

2022-11-09 Thread Corey Huinker
>
> After considering this again, I decided to brute-force this and get rid
> of all the trivial wrapper functions and also several of the special
> cases.  That way, there is less confusion at the call sites about why
> this or that style is used in a particular case.  Also, it now makes
> sure you can't accidentally use the generic functions when a particular
> one should be used.
>

+1

However, the aclcheck patch isn't applying for me now. That patch modifies
37 files, so it's hard to say just which commit conflicts.


Re: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-09 Thread Sandro Santilli
And now a version of the patch including documentation and regression tests.
Anything you see I should improve ?

--strk;

On Fri, Nov 04, 2022 at 06:31:58PM +0100, Sandro Santilli wrote:
> On Mon, Oct 31, 2022 at 01:55:05AM -0400, Regina Obe wrote:
> > 
> > Sandro, can you submit an updated version of this patch.
> > I was testing it out and looked good first time.
> 
> Attached updated version of the patch.
> 
> > If everyone is okay with this patch, we'll go ahead and add tests and
> > documentation to go with it.
> 
> Yes please let us know if there's any chance of getting this
> mechanism approved or if we should keep advertising works around
> this limitation (it's not just installing 100s of files but also
> still missing needed upgrade paths when doing so).
> 
> 
> --strk;
>From 4b10297315d2db4365e6f47e375588394b260d0d Mon Sep 17 00:00:00 2001
From: Sandro Santilli 
Date: Wed, 14 Sep 2022 11:10:10 +0200
Subject: [PATCH] Allow wildcard (%) in extension upgrade paths

A wildcard character "%" will be accepted in the
"source" side of the upgrade script and be considered
usable to upgrade any version to the "target" side.

Using wildcards needs to be explicitly requested by
extensions via a "wildcard_upgrades" setting in their
control file.

Includes regression test and documentation.
---
 doc/src/sgml/extend.sgml  | 14 +
 src/backend/commands/extension.c  | 58 +--
 src/test/modules/test_extensions/Makefile |  6 +-
 .../expected/test_extensions.out  | 15 +
 .../test_extensions/sql/test_extensions.sql   |  7 +++
 .../test_ext_wildcard1--%--2.0.sql|  6 ++
 .../test_ext_wildcard1--1.0.sql   |  6 ++
 .../test_ext_wildcard1.control|  4 ++
 8 files changed, 108 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index 46e873a166..4012652574 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -807,6 +807,20 @@ RETURNS anycompatible AS ...

   
  
+
+ 
+  wildcard_upgrades (boolean)
+  
+   
+This parameter, if set to true (which is not the
+default), allows ALTER EXTENSION to consider
+a wildcard character % as matching any version of
+the extension. Such wildcard match will only be used when no
+perfect match is found for a version.
+   
+  
+ 
+
 
 
 
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 1a62e5dac5..ea8825fcff 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -86,6 +86,7 @@ typedef struct ExtensionControlFile
 	bool		relocatable;	/* is ALTER EXTENSION SET SCHEMA supported? */
 	bool		superuser;		/* must be superuser to install? */
 	bool		trusted;		/* allow becoming superuser on the fly? */
+	bool		wildcard_upgrades;  /* allow using wildcards in upgrade scripts */
 	int			encoding;		/* encoding of the script file, or -1 */
 	List	   *requires;		/* names of prerequisite extensions */
 } ExtensionControlFile;
@@ -128,6 +129,7 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   bool cascade,
   bool is_create);
 static char *read_whole_file(const char *filename, int *length);
+static bool file_exists(const char *name);
 
 
 /*
@@ -579,6 +581,14 @@ parse_extension_control_file(ExtensionControlFile *control,
 		 errmsg("parameter \"%s\" requires a Boolean value",
 item->name)));
 		}
+		else if (strcmp(item->name, "wildcard_upgrades") == 0)
+		{
+			if (!parse_bool(item->value, &control->wildcard_upgrades))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("parameter \"%s\" requires a Boolean value",
+item->name)));
+		}
 		else if (strcmp(item->name, "encoding") == 0)
 		{
 			control->encoding = pg_valid_server_encoding(item->value);
@@ -636,6 +646,7 @@ read_extension_control_file(const char *extname)
 	control->relocatable = false;
 	control->superuser = true;
 	control->trusted = false;
+	control->wildcard_upgrades = false;
 	control->encoding = -1;
 
 	/*
@@ -890,7 +901,15 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (from_version == NULL)
 		elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version);
 	else
+	{
+		if ( control->wildcard_upgrades && ! file_exists(filename) )
+		{
+			elog(DEBUG1, "extension upgrade script \"%s\" does not exist, will try wildcard", filename);
+			/* if filename does not exist, try wildcard */
+			filename = get_extension_script_filename(control, "%", version);
+		}
 		elog(DEBUG1, "executing extension script for \"%s\" update from version '%s

Re: heavily contended lwlocks with long wait queues scale badly

2022-11-09 Thread Andres Freund
Hi,

On 2022-11-09 15:54:16 +0530, Bharath Rupireddy wrote:
> On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
>  wrote:
> >
> > > Updated patch attached.
> >
> > Thanks. It looks good to me. However, some minor comments on the v3 patch:
> >
> > 1.
> > -if (MyProc->lwWaiting)
> > +if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
> >  elog(PANIC, "queueing for lock while waiting on another one");
> >
> > Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
> > MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?
> >
> > Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
> > MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
> > LW_WS_WAITING?

I don't think that's a good idea - it'll just mean we have to modify more
places if we add another state, without making anything more robust.


> > 2.
> > /* Awaken any waiters I removed from the queue. */
> > proclist_foreach_modify(iter, &wakeup, lwWaitLink)
> > {
> >
> > @@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
> >   * another lock.
> >   */
> >  pg_write_barrier();
> > -waiter->lwWaiting = false;
> > +waiter->lwWaiting = LW_WS_NOT_WAITING;
> >  PGSemaphoreUnlock(waiter->sem);
> >  }
> >
> > /*
> >  * Awaken any waiters I removed from the queue.
> >  */
> > proclist_foreach_modify(iter, &wakeup, lwWaitLink)
> > {
> > PGPROC   *waiter = GetPGProcByNumber(iter.cur);
> >
> > proclist_delete(&wakeup, iter.cur, lwWaitLink);
> > /* check comment in LWLockWakeup() about this barrier */
> > pg_write_barrier();
> > waiter->lwWaiting = LW_WS_NOT_WAITING;
> >
> > Can we add an assertion Assert(waiter->lwWaiting ==
> > LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
> > list and set the LW_WS_NOT_WAITING flag in above loops, having an
> > assertion is better here IMO.

I guess it can't hurt - but it's not really related to the changes in the
patch, no?


> I looked at the v3 patch again today and ran some performance tests.
> The results look impressive as they were earlier. Andres, any plans to
> get this in?

I definitely didn't want to backpatch before this point release. But it seems
we haven't quite got to an agreement what to do about backpatching. It's
probably best to just commit it to HEAD and let the backpatch discussion
happen concurrently.

I'm on a hike, without any connectivity, Thu afternoon - Sun. I think it's OK
to push it to HEAD if I get it done in the next few hours. Bigger issues,
which I do not expect, should show up before tomorrow afternoon. Smaller
things could wait till Sunday if necessary.

Greetings,

Andres Freund




Re: Remove redundant declaration for XidInMVCCSnapshot

2022-11-09 Thread Alvaro Herrera
On 2022-Nov-09, Japin Li wrote:

> Commit b7eda3e0e3 moves XidINMVCCSnapshot into snapmgr.{c,h},
> however, it forgets the declaration of XidINMVCCSnapshot in
> heapam.h.

True.  Pushed, thanks.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
#error "Operator lives in the wrong universe"
  ("Use of cookies in real-time system development", M. Gleixner, M. Mc Guire)




Re: Error for WITH options on partitioned tables

2022-11-09 Thread Tom Lane
Simon Riggs  writes:
> Karina's changes make sense to me, so +1.
> This is a minor patch, so I will set this as Ready For Committer.

Pushed with minor fiddling:

* I concur with Karina's thought that ERRCODE_WRONG_OBJECT_TYPE
is the most on-point errcode for this.  The complaint is specifically
about the table relkind and has nothing to do with the storage
parameter per se.  I also agree that it's not worth trying to use
a different errcode for CREATE vs. ALTER.

* The HINT message wasn't per project style (it should be formatted as
a complete sentence), and I thought using "parameters for" in the
main message but "parameters on" in the hint was oddly inconsistent.

regards, tom lane




Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-09 Thread Peter Eisentraut

On 08.08.22 17:21, Yugo NAGATA wrote:

In fact, the result of IsInTransactionBlock does not make senses at
all in pipe-line mode regardless to the fix. ANALYZE could commit all
previous commands in pipelining, and this may not be user expected
behaviour.

This seems pretty much isomorphic to the fact that CREATE DATABASE
will commit preceding steps in the pipeline.

I am not sure if we can think CREATE DATABASE case and ANLALYZE case
similarly. First, CREATE DATABASE is one of the commands that cannot be
executed inside a transaction block, but ANALYZE can be. So, users would
not be able to know ANALYZE in a pipeline causes a commit from the
documentation. Second, ANALYZE issues a commit internally in an early
stage not only after it finished successfully. For example, even if
ANALYZE is failing because a not-existing column name is specified, it
issues a commit before checking the column name. This makes more hard
to know which statements will be committed and which statements not
committed in a pipeline. Also, as you know, there are other commands
that issue internal commits.


This has broken the following use:

parse: create temporary table t1 (a int) on commit drop
bind
execute
parse: analyze t1
bind
execute
parse: select * from t1
bind
execute
sync

I think the behavior of IsInTransactionBlock() needs to be further 
refined to support this.  If we are worried about external callers, 
maybe we need to provide a separate version.  AFAICT, all the callers of 
IsInTransactionBlock() over time have been in vacuum/analyze-related 
code, so perhaps in master we should just move it there.






Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-09 Thread Tom Lane
Peter Eisentraut  writes:
> This has broken the following use:

> parse: create temporary table t1 (a int) on commit drop
> bind
> execute
> parse: analyze t1
> bind
> execute
> parse: select * from t1
> bind
> execute
> sync

> I think the behavior of IsInTransactionBlock() needs to be further 
> refined to support this.

Hmm.  Maybe the right way to think about this is "if we have completed an
EXECUTE, and not yet received a following SYNC, then report that we are in
a transaction block"?  But I'm not sure if that breaks any other cases.

regards, tom lane




Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-09 Thread Tom Lane
Yugo NAGATA  writes:
> Tom Lane  wrote:
>> What do you think of the attached wording?

> It looks good to me. That describes the expected behaviour exactly.

Pushed that, then.

>> I don't think the pipeline angle is of concern to anyone who might be
>> reading these comments with the aim of understanding what guarantees
>> they have.  Perhaps there should be more about that in the user-facing
>> docs, though.

> I agree with that we don't need to mention pipelining in these comments,
> and that we need more in the documentation. I attached a doc patch to add
> a mention of commands that do internal commit to the pipelining section.
> Also, this adds a reference for the pipelining protocol to the libpq doc.

Hmm ... I don't really find either of these changes to be improvements.
The fact that, say, multi-table ANALYZE uses multiple transactions
seems to me to be a property of that statement, not of the protocol.

regards, tom lane




Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-09 Thread Andres Freund
Hi,

2022-11-09 08:54:54 -0500, Reid Thompson wrote:
> Thanks for looking at this and for the feedback. Responses inline below.
> > > +void
> > > +pgstat_report_backend_allocated_bytes_decrease(uint64
> > > deallocation)
> > > +{
> > > +   volatile PgBackendStatus *beentry = MyBEEntry;
> > > +
> > > +   /*
> > > +    * Cases may occur where shared memory from a previous postmaster
> > > +    * invocation still exist. These are cleaned up at startup by
> > > +    * dsm_cleanup_using_control_segment. Limit decreasing memory 
> > > allocated to
> > > +    * zero in case no corresponding prior increase exists or 
> > > decrease has
> > > +    * already been accounted for.
> > > +    */
> > 
> > I don't really follow - postmaster won't ever have a backend status
> > array, so how would they be tracked here?
> 
> On startup, a check is made for leftover dsm control segments in the
> DataDir. It appears possible that in certain situations on startup we
> may find and destroy stale segments and thus decrement the allocation
> variable. 
> 
> I based this off of:
> /ipc/dsm.c
> 
> dsm_postmaster_startup:
>  150 dsm_postmaster_startup(PGShmemHeader *shim)
>  {
> ...
>  281 }

I don't think we should account for memory allocations done in postmaster in
this patch. They'll otherwise be counted again in each of the forked
backends. As this cleanup happens during postmaster startup, we'll have to
make sure accounting is reset during backend startup.

Greetings,

Andres Freund




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread Alvaro Herrera
On 2022-Nov-09, Justin Pryzby wrote:

> On Wed, Nov 09, 2022 at 06:00:40PM +0530, Bharath Rupireddy wrote:

> > 1. For ease of review, please split the test patch to 0002.
> 
> This is just my opinion, but .. why ?  Since it's easy to
> filter/skip/display a file, I don't think it's usually useful to have
> separate patches for tests or docs.

I concur with Justin.  When a patch is bugfix and a test is added that
verifies it, I like to keep the test in a separate commit (for submit
purposes and in my personal repo -- not for the official push!) so that
I can git-checkout to just the test and make sure it fails ahead of
pushing the fix commit.  But for a new feature, there's no reason to do
so.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-11-09 Thread Imseih (AWS), Sami
While looking through vacuum code, I noticed that
unlike non-parallel vacuum, parallel vacuum only gets
a failsafe check after an entire index cycle completes.

In vacuumlazy.c, lazy_check_wraparound_failsafe is checked
after every index completes, while in parallel, it is checked
after an entire index cycle completed.

if (!ParallelVacuumIsActive(vacrel))
{
for (int idx = 0; idx < vacrel->nindexes; idx++)
{
Relationindrel 
= vacrel->indrels[idx];
IndexBulkDeleteResult *istat = 
vacrel->indstats[idx];

vacrel->indstats[idx] =

lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,

  vacrel);

/*
* Done vacuuming an index. 
Increment the indexes completed
*/

pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED,


idx + 1);

if 
(lazy_check_wraparound_failsafe(vacrel))
{
/* Wraparound 
emergency -- end current index scan */
allindexes = 
false;
break;
}
}
}
else
{
/* Outsource everything to parallel variant */

parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, vacrel->old_live_tuples,


vacrel->num_index_scans);

/*
* Do a postcheck to consider applying 
wraparound failsafe now.  Note
* that parallel VACUUM only gets the precheck 
and this postcheck.
*/
if (lazy_check_wraparound_failsafe(vacrel))
allindexes = false;
}

When a user is running a parallel vacuum and the vacuum is long running
due to many large indexes, it would make sense to check for failsafe earlier.

Also, checking after every index for parallel vacuum will provide the same
failsafe behavior for both parallel and non-parallel vacuums.

To make this work, it is possible to call lazy_check_wraparound_failsafe
inside parallel_vacuum_process_unsafe_indexes and
parallel_vacuum_process_safe_indexes of vacuumparallel.c


Regards,

Sami Imseih
Amazon Web Services (AWS)



Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-09 Thread Reid Thompson
Hi Melanie,
Thank you for looking at this and for the feedback. Responses inline
below.

On Mon, 2022-11-07 at 16:17 -0500, Melanie Plageman wrote:
> On Fri, Nov 04, 2022 at 08:56:13AM -0400, Reid Thompson wrote:
> > From a8de5d29c0c6f10962181926a49ad4fec1e52bd1 Mon Sep 17 00:00:00
> > 2001
> > From: Reid Thompson 
> > Date: Thu, 11 Aug 2022 12:01:25 -0400
> > Subject: [PATCH] Add tracking of backend memory allocated to 
> > pg_stat_activity
> > +
> 
> It doesn't seem like you need the backend_ prefix in the view since
> that is implied by it being in pg_stat_activity.

I will remove the prefix.

> For the wording on the description, I find "memory allocated to this
> backend" a bit confusing. Perhaps you could reword it to make clear
> you mean that the number represents the balance of allocations by this
> backend. Something like:
> 
> Memory currently allocated to this backend in bytes. This is the
> balance of bytes allocated and freed by this backend.
> I would also link to the system administration function
> pg_size_pretty() so users know how to easily convert the value.

Thanks, I'll make these changes

> If you end up removing shared memory as Andres suggests in [1], I
> would link pg_shmem_allocations view here and point out that shared memory
> allocations can be viewed there instead (and why).
> 
> You could instead add dynamic shared memory allocation to the
> pg_shmem_allocations view as suggested as follow-on work by the
> commit which introduced it, ed10f32e3.
> 
> > +/* 
> > + * pgstat_report_backend_allocated_bytes_increase() -
> > + *
> > + * Called to report increase in memory allocated for this backend
> > + * 
> > + */
> 
> It seems like you could combine the
> pgstat_report_backend_allocated_bytes_decrease/increase() by either
> using a signed integer to represent the allocation/deallocation or passing in
> a "direction" that is just a positive or negative multiplier enum.
> 
> Especially if you don't use the write barriers, I think you could
> simplify the logic in the two functions.
> 
> If you do combine them, you might shorten the name to
> pgstat_report_backend_allocation() or pgstat_report_allocation().

Agreed. This seems a cleaner, simpler way to go.  I'll add it to the
TODO list.

> > +   /*
> > +    * Do not allow backend_allocated_bytes to go below zero.
> > ereport if we
> > +    * would have. There's no need for a lock around the read
> > here as it's
> > +    * being referenced from the same backend which means that
> > there shouldn't
> > +    * be concurrent writes. We want to generate an ereport in
> > these cases.
> > +    */
> > +   if (deallocation > beentry->backend_allocated_bytes)
> > +   {
> > +   ereport(LOG, errmsg("decrease reduces reported
> > backend memory allocated below zero; setting reported to 0"));
> > +
> 
> I also think it would be nice to include the deallocation amount and
> backend_allocated_bytes amount in the log message.
> It also might be nice to start the message with something more clear
> than "decrease".
> For example, I would find this clear as a user:
> 
> Backend [backend_type or pid] deallocated [deallocation number] bytes,
> [backend_allocated_bytes - deallocation number] more than this backend
> has reported allocating.

Sounds good, I'll implement these changes

> > diff --git a/src/backend/utils/adt/pgstatfuncs.c
> > b/src/backend/utils/adt/pgstatfuncs.c
> > index 96bffc0f2a..b6d135ad2f 100644
> > --- a/src/backend/utils/adt/pgstatfuncs.c
> > +++ b/src/backend/utils/adt/pgstatfuncs.c
> > @@ -553,7 +553,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
> >  Datum
> >  pg_stat_get_activity(PG_FUNCTION_ARGS)
> >  {
> > -#define PG_STAT_GET_ACTIVITY_COLS  30
> > +#define PG_STAT_GET_ACTIVITY_COLS  31
> > int num_backends =
> >  
> > +   values[30] = 
> > UInt64GetDatum(beentry->backend_allocated_bytes);
> 
> Though not the fault of this patch, it is becoming very difficult to
> keep the columns straight in pg_stat_get_activity(). Perhaps you
> could add a separate commit to add an enum for the columns so the function
> is easier to understand.
> 
> > diff --git a/src/include/utils/backend_status.h
> > b/src/include/utils/backend_status.h
> > index b582b46e9f..75d87e8308 100644
> > --- a/src/include/utils/backend_status.h
> > +++ b/src/include/utils/backend_status.h
> > @@ -169,6 +169,9 @@ typedef struct PgBackendStatus
> >  
> > /* query identifier, optionally computed using
> > post_parse_analyze_hook */
> > uint64  st_query_id;
> > +
> > +   /* Current memory allocated to this backend */
> > +   uint64  backend_allocated_bytes;
> >  } PgBackendStatus;
> 
> I don't think you need the backend_ prefix here since it is in
> PgBackendStatus.

Agreed again, I'll remove the prefix.

> > @@ -313,7 +316,9 @@ extern const char
> > *pgstat_get_backend_curren

Update some more ObjectType switch statements to not have default

2022-11-09 Thread Peter Eisentraut

This arose during the review of another patch.

We often omit the default case of a switch statement to allow the 
compiler to complain if an enum case has been missed.  I found a few 
where that wasn't done yet, but it would make sense and would have found 
an omission in another patch.From 37a03b420ce52f02119a0085f3c5fe3b44fb Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 9 Nov 2022 14:52:33 +0100
Subject: [PATCH] Update some more ObjectType switch statements to not have
 default

This allows the compiler to complain if a case has been missed.  In
these instances, the tradeoff of having to list a few unneeded cases
to silence the compiler seems better than the risk of actually missing
one.
---
 src/backend/catalog/objectaddress.c | 26 +++
 src/backend/commands/dropcmds.c | 39 +++--
 2 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/src/backend/catalog/objectaddress.c 
b/src/backend/catalog/objectaddress.c
index c7de7232b890..f36581d106b2 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -960,7 +960,7 @@ ObjectAddress
 get_object_address(ObjectType objtype, Node *object,
   Relation *relp, LOCKMODE lockmode, bool 
missing_ok)
 {
-   ObjectAddress address;
+   ObjectAddress address = {InvalidOid, InvalidOid, 0};
ObjectAddress old_address = {InvalidOid, InvalidOid, 0};
Relationrelation = NULL;
uint64  inval_count;
@@ -991,6 +991,7 @@ get_object_address(ObjectType objtype, Node *object,

   &relation, lockmode,

   missing_ok);
break;
+   case OBJECT_ATTRIBUTE:
case OBJECT_COLUMN:
address =
get_object_address_attribute(objtype, 
castNode(List, object),
@@ -1163,14 +1164,12 @@ get_object_address(ObjectType objtype, Node *object,

 missing_ok);
address.objectSubId = 0;
break;
-   default:
-   elog(ERROR, "unrecognized object type: %d", 
(int) objtype);
-   /* placate compiler, in case it thinks elog 
might return */
-   address.classId = InvalidOid;
-   address.objectId = InvalidOid;
-   address.objectSubId = 0;
+   /* no default, to let compiler warn about 
missing case */
}
 
+   if (!address.classId)
+   elog(ERROR, "unrecognized object type: %d", (int) 
objtype);
+
/*
 * If we could not find the supplied object, return without 
locking.
 */
@@ -2635,9 +2634,16 @@ check_object_ownership(Oid roleid, ObjectType objtype, 
ObjectAddress address,
aclcheck_error(ACLCHECK_NOT_OWNER, objtype,
   
NameListToString(castNode(List, object)));
break;
-   default:
-   elog(ERROR, "unrecognized object type: %d",
-(int) objtype);
+   case OBJECT_AMOP:
+   case OBJECT_AMPROC:
+   case OBJECT_DEFAULT:
+   case OBJECT_DEFACL:
+   case OBJECT_PUBLICATION_NAMESPACE:
+   case OBJECT_PUBLICATION_REL:
+   case OBJECT_USER_MAPPING:
+   /* These are currently not supported or don't make 
sense here. */
+   elog(ERROR, "unsupported object type: %d", (int) 
objtype);
+   break;
}
 }
 
diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c
index 26157eb4e3f5..8bac58a9941a 100644
--- a/src/backend/commands/dropcmds.c
+++ b/src/backend/commands/dropcmds.c
@@ -480,10 +480,45 @@ does_not_exist_skipping(ObjectType objtype, Node *object)
msg = gettext_noop("publication \"%s\" does not exist, 
skipping");
name = strVal(object);
break;
-   default:
-   elog(ERROR, "unrecognized object type: %d", (int) 
objtype);
+
+   case OBJECT_COLUMN:
+   case OBJECT_DATABASE:
+   case OBJECT_FOREIGN_TABLE:
+   case OBJECT_INDEX:
+   case OBJECT_MATVIEW:
+   case OBJECT_ROLE:
+   case OBJECT_SEQUENCE:
+   case OBJECT_SUBSCRIPT

Re: New docs chapter on Transaction Management and related changes

2022-11-09 Thread Robert Treat
On Mon, Nov 7, 2022 at 5:04 PM Laurenz Albe  wrote:
>
> On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote:
> > Agreed; new compilation patch attached, including mine and then
> > Robert's suggested rewordings.
>
> Thanks.  There is clearly a lot of usefule information in this.
>
> Some comments:
>

> > --- a/doc/src/sgml/ref/release_savepoint.sgml
> > +++ b/doc/src/sgml/ref/release_savepoint.sgml
> > @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] 
> > savepoint_name
> >Description
> >
> >
> > -   RELEASE SAVEPOINT destroys a savepoint previously 
> > defined
> > -   in the current transaction.
> > +   RELEASE SAVEPOINT will subcommit the subtransaction
> > +   established by the named savepoint, if one exists. This will release
> > +   any resources held by the subtransaction. If there were any
> > +   subtransactions of the named savepoint, these will also be subcommitted.
> >
> >
> >
>
> "Subtransactions of the named savepoint" is somewhat confusing; how about
> "subtransactions of the subtransaction established by the named savepoint"?
>
> If that is too long and explicit, perhaps "subtransactions of that 
> subtransaction".
>

Personally, I think these are more confusing.

> > --- a/doc/src/sgml/ref/rollback.sgml
> > +++ b/doc/src/sgml/ref/rollback.sgml
> > @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
> >  AND CHAIN
> >  
> >   
> > -  If AND CHAIN is specified, a new transaction is
> > +  If AND CHAIN is specified, a new unaborted 
> > transaction is
> >immediately started with the same transaction characteristics (see 
> >  >linkend="sql-set-transaction"/>) as the just finished one.  
> > Otherwise,
> >no new transaction is started.
>
> I don't think that is an improvement.  "Unaborted" is an un-word.  A new 
> transaction
> is always "unaborted", isn't it?
>

I thought about this as well when reviewing it, but I do think
something is needed for the case where you have a transaction which
has suffered an error and then you issue "rollback and chain"; if you
just say "a new transaction is immediately started with the same
transaction characteristics" it might imply to some the new
transaction has some kind of carry over of the previous broken
transaction... the use of the word unaborted makes it clear that the
new transaction is 100% functional.

> > --- a/doc/src/sgml/wal.sgml
> > +++ b/doc/src/sgml/wal.sgml
> > @@ -909,4 +910,36 @@
> > seem to be a problem in practice.
> >
> >   
> > +
> > + 
> > +
> > +  Two-Phase Transactions
> > +
> > +  
> > +   PostgreSQL supports a two-phase commit (2PC)
> [...]
> > +   pg_twophase directory. Currently-prepared
> > +   transactions can be inspected using  > +   
> > linkend="view-pg-prepared-xacts">pg_prepared_xacts.
> > +  
> > + 
> > +
> >  
>
> I don't like "currently-prepared".  How about:
> "Transaction that are currently prepared can be inspected..."
>

This seems to align with other usage, so +1

> This is clearly interesting information, but I don't think the WAL chapter is 
> the right
> place for this.  "pg_twophase" is already mentioned in "storage.sgml", and 
> details about
> when exactly a prepared transaction is persisted may exceed the details level 
> needed by
> the end user.
>
> I'd look for that information in the reference page for PREPARE TRANSACTION; 
> perhaps
> that would be a better place.  Or, even better, the new "xact.sgml" chapter.
>
> > --- /dev/null
> > +++ b/doc/src/sgml/xact.sgml
>
> +  Transaction Management
>
> +   The word transaction is often abbreviated as "xact".
>
> Should use  here.
>
> > +   Transactions and Identifiers
>
> > +   
> > +Once a transaction writes to the database, it is assigned a
> > +non-virtual TransactionId (or xid),
> > +e.g., 278394. Xids are assigned sequentially
> > +using a global counter used by all databases within the
> > +PostgreSQL cluster. This property is used by
> > +the transaction system to order transactions by their first database
> > +write, i.e., lower-numbered xids started writing before higher-numbered
> > +xids.  Of course, transactions might start in a different order.
> > +   
>
> "This property"?  How about:
> "Because transaction IDs are assigned sequentially, the transaction system can
> use them to order transactions by their first database write"
>

+1

> I would want some additional information here: why does the transaction 
> system have
> to order transactions by their first database write?
>
> "Of course, transactions might start in a different order."
>
> Now that confuses me.  Are you saying that BEGIN could be in a different order
> than the first database write?  Perhaps like this:
>
> "Note that the order in which transactions perform their first database write
> might be different from the order in which the transactions started."
>

+1

> > +The internal transaction ID type xid is 32-bits wide
>
> There should be no hyphen in "32 bi

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread Justin Pryzby
On Wed, Nov 09, 2022 at 06:00:40PM +0530, Bharath Rupireddy wrote:
> On Wed, Nov 9, 2022 at 5:08 AM David Christensen 
>  wrote:
> >
> > Enclosed is v6, which squashes your refactor and adds the additional
> > recent suggestions.
> 
> Thanks for working on this feature. Here are some comments for now. I
> haven't looked at the tests, I will take another look at the code and
> tests after these and all other comments are addressed.
> 
> 1. For ease of review, please split the test patch to 0002.

This is just my opinion, but .. why ?  Since it's easy to
filter/skip/display a file, I don't think it's usually useful to have
separate patches for tests or docs.

> 6.
> +if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
> Use pg_dir_create_mode instead of hard-coded 0007?

I think I thought of that when I first looked at the patch ...  but, I'm
not sure, since it says:

src/include/common/file_perm.h-/* Modes for creating directories and files IN 
THE DATA DIRECTORY */
src/include/common/file_perm.h:extern PGDLLIMPORT int pg_dir_create_mode;

I was wondering if there's any reason to do "CREATE DATABASE".  The vast
majority of TAP tests don't.

$ git grep -ho 'safe_psql[^ ]*' '*pl' |sort |uniq -c |sort -nr |head
   1435 safe_psql('postgres',
335 safe_psql(
 23 safe_psql($connect_db,

-- 
Justin




Re: HOT chain validation in verify_heapam()

2022-11-09 Thread Aleksander Alekseev
Hi Himanshu,

> Test cases are now part of this v6 patch.

I believe the patch is in pretty good shape now. I'm going to change
its status to "Ready for Committer" soon unless there are going to be
any objections.

-- 
Best regards,
Aleksander Alekseev




Re: Add tracking of backend memory allocated to pg_stat_activity

2022-11-09 Thread Reid Thompson
Hi Andres,
Thanks for looking at this and for the feedback. Responses inline below.

On Fri, 2022-11-04 at 19:41 -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-11-04 08:56:13 -0400, Reid Thompson wrote:
> 
> I'm not convinced that counting DSM values this way is quite right.
> There are a few uses of DSMs that track shared resources, with the biggest
> likely being the stats for relations etc.  I suspect that tracking that via
> backend memory usage will be quite confusing, because fairly random backends 
> that
> had to grow the shared state end up being blamed with the memory usage in
> perpituity - and then suddenly that memory usage will vanish when that 
> backend exits,
> despite the memory continuing to exist.

Ok, I'll make an attempt to identify these allocations and manage them
elsewhere.

> 
> 
> > @@ -734,6 +747,7 @@ AllocSetAlloc(MemoryContext context, Size size)
> > return NULL;
> >  
> > context->mem_allocated += blksize;
> > +   pgstat_report_backend_allocated_bytes_increase(blksize);
> >  
> > block->aset = set;
> > block->freeptr = block->endptr = ((char *) block) + blksize;
> > @@ -944,6 +958,7 @@ AllocSetAlloc(MemoryContext context, Size size)
> > return NULL;
> >  
> > context->mem_allocated += blksize;
> > +   pgstat_report_backend_allocated_bytes_increase(blksize);
> >  
> > block->aset = set;
> > block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
> > @@ -1043,6 +1058,7 @@ AllocSetFree(void *pointer)
> > block->next->prev = block->prev;
> >  
> > set->header.mem_allocated -= block->endptr - ((char *) 
> > block);
> > +   
> > pgstat_report_backend_allocated_bytes_decrease(block->endptr - ((char *) 
> > block));
> >  
> >  #ifdef CLOBBER_FREED_MEMORY
> > wipe_mem(block, block->freeptr - ((char *) block));
> 
> I suspect this will be noticable cost-wise. Even though these paths aren't the
> hottest memory allocation paths, by nature of going down into malloc, adding
> an external function call that then does a bunch of branching etc. seems
> likely to add up to some.  See below for how I think we can deal with
> that...
> 
> 
> > +
> > +/* 
> > + * pgstat_report_backend_allocated_bytes_increase() -
> > + *
> > + * Called to report increase in memory allocated for this backend
> > + * 
> > + */
> > +void
> > +pgstat_report_backend_allocated_bytes_increase(uint64 allocation)
> > +{
> > +   volatile PgBackendStatus *beentry = MyBEEntry;
> > +
> > +   if (!beentry || !pgstat_track_activities)
> > +   {
> > +   /*
> > +    * Account for memory before pgstats is initialized. This 
> > will be
> > +    * migrated to pgstats on initialization.
> > +    */
> > +   backend_allocated_bytes += allocation;
> > +
> > +   return;
> > +   }
> > +
> > +   /*
> > +    * Update my status entry, following the protocol of bumping
> > +    * st_changecount before and after.  We use a volatile pointer here 
> > to
> > +    * ensure the compiler doesn't try to get cute.
> > +    */
> > +   PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
> > +   beentry->backend_allocated_bytes += allocation;
> > +   PGSTAT_END_WRITE_ACTIVITY(beentry);
> > +}
> 
> This is quite a few branches, including write/read barriers.
> 
> It doesn't really make sense to use the PGSTAT_BEGIN_WRITE_ACTIVITY() pattern
> here - you're just updating a single value, there's nothing to be gained by
> it. The point of PGSTAT_BEGIN_*ACTIVITY() stuff is to allow readers to get a
> consistent view of multiple values - but there aren't multiple values
> here!

I'll remove the barriers - initially I copied how prior functions were
coded as my template ala
pgstat_report_query_id, pgstat_report_xact_timestamp.

> 
> To avoid the overhead of checking (!beentry || !pgstat_track_activities) and
> the external function call, I think you'd be best off copying the trickery I
> introduced for pgstat_report_wait_start(), in 225a22b19ed.
> 
> I.e. make pgstat_report_backend_allocated_bytes_increase() a static inline
> function that unconditionally updates something like
> *my_backend_allocated_memory.  To deal with the case of (!beentry ||
> !pgstat_track_activities), that variable initially points to some backend
> local state and is set to the shared state in pgstat_bestart().
> 
> This additionally has the nice benefit that you can track memory usage from
> before pgstat_bestart(), it'll be in the local variable.

OK, I think I can mimic the code you reference.

> 
> > +void
> > +pgstat_report_backend_allocated_bytes_decrease(uint64
> > deallocation)
> > +{
> > +   volatile PgBackendStatus *beentry = MyBEEntry;
> > +
> > +   /*
> > +    * Cases may occur where shared memory from a previo

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-09 Thread Bharath Rupireddy
On Wed, Nov 9, 2022 at 5:08 AM David Christensen
 wrote:
>
> Enclosed is v6, which squashes your refactor and adds the additional
> recent suggestions.

Thanks for working on this feature. Here are some comments for now. I
haven't looked at the tests, I will take another look at the code and
tests after these and all other comments are addressed.

1. For ease of review, please split the test patch to 0002.

2. I'm unable to understand the use-case for --fixup-fpi option.
pg_waldump is supposed to be just WAL reader, and must not return any
modified information, with --fixup-fpi option, the patch violates this
principle i.e. it sets page LSN and returns. Without actually
replaying the WAL record on the page, how is it correct to just set
the LSN? How will it be useful? ISTM, we must ignore this option
unless there's a strong use-case.
3.
+if (fork >= 0 && fork <= MAX_FORKNUM)
+{
+if (fork)
+sprintf(forkname, "_%s", forkNames[fork]);
+else
+forkname[0] = 0;
+}
+else
+pg_fatal("found invalid fork number: %u", fork);

Isn't the above complex? What's the problem with something like below?
Why do we need if (fork) - else block?

if (fork >= 0  && fork <= MAX_FORKNUM)
sprintf(forkname, "_%s", forkNames[fork]);
else
pg_fatal("found invalid fork number: %u", fork);

3.
+charpage[BLCKSZ] = {0};
I think when writing to a file, we need PGAlignedBlock rather than a
simple char array of bytes, see the description around PGAlignedBlock
for why it is so.

4.
+if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
Why pg_pwrite(), why not just fwrite()? If fwrite() is used, you can
avoid fileno() system calls, no? Do you need the file position to
remain the same after writing, hence pg_pwrite()?

5.
+if (!RestoreBlockImage(record, block_id, page))
+continue;
+
+/* we have our extracted FPI, let's save it now */
After extracting the page from the WAL record, do we need to perform a
checksum on it?

6.
+if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
Use pg_dir_create_mode instead of hard-coded 0007?

7.
+if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
+fsync(fileno(OPF));
+fclose(OPF);
Since you're creating the directory in case it's not available, you
need to fsync the directory too?

8.
+case 'W':
+case 'X':
+config.fixup_fpw = (option == 'X');
+config.save_fpw_path = pg_strdup(optarg);
+break;
Just set  config.fixup_fpw = false before the switch block starts,
like the other variables, and then perhaps doing like below is more
readable:
case 'W':
config.save_fpw_path = pg_strdup(optarg);
case 'X':
   config.fixup_fpw = true;
   config.save_fpw_path = pg_strdup(optarg);

9.
+if (dir_status == 0 && mkdir(config.save_fpw_path, 0700) < 0)
Should we use pg_mkdir_p() instead of mkdir()?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: libpq error message refactoring

2022-11-09 Thread Alvaro Herrera
Hello

I gave this series a quick look.  Overall it seems a good idea, since
the issue of newlines-or-not is quite bothersome for the libpq
translations.

> +/*
> + * Append a formatted string to the given buffer, after translation.  A
> + * newline is automatically appended; the format should not end with a
> + * newline.
> + */

I find the "after translation" bit unclear -- does it mean that the
caller should have already translated, or is it the other way around?  I
would say "Append a formatted string to the given buffer, after
translating it", which (to me) conveys more clearly that translation
occurs here.


> + /* Loop in case we have to retry after enlarging the buffer. */
> + do
> + {
> + errno = save_errno;
> + va_start(args, fmt);
> + done = appendPQExpBufferVA(errorMessage, libpq_gettext(fmt), 
> args);

I wonder if it makes sense to do libpq_gettext() just once, instead of
redoing it on each iteration.

> +void
> +libpq_append_conn_error(PGconn *conn, const char *fmt, ...)

These two routines are essentially identical.  While we could argue
about sharing an underlying implementation, I think it's okay the way
you have it, because the overheard of sharing it would make that
pointless, given how short they are.


> +extern void libpq_append_error(PQExpBuffer errorMessage, const char *fmt, 
> ...) pg_attribute_printf(2, 3);
> +extern void libpq_append_conn_error(PGconn *conn, const char *fmt, ...) 
> pg_attribute_printf(2, 3);

pg_attribute_printf marker present -- check.

> -GETTEXT_TRIGGERS = libpq_gettext pqInternalNotice:2
> -GETTEXT_FLAGS= libpq_gettext:1:pass-c-format pqInternalNotice:2:c-format
> +GETTEXT_TRIGGERS = libpq_append_conn_error:2 \
> +   libpq_append_error:2 \
> +   libpq_gettext pqInternalNotice:2
> +GETTEXT_FLAGS= libpq_append_conn_error:2:c-format \
> +   libpq_append_error:2:c-format \
> +   libpq_gettext:1:pass-c-format pqInternalNotice:2:c-format

Looks good.

> --- a/src/interfaces/libpq/pqexpbuffer.h
> +++ b/src/interfaces/libpq/pqexpbuffer.h

> +/*
> + * appendPQExpBufferVA
> + * Shared guts of printfPQExpBuffer/appendPQExpBuffer.
> + * Attempt to format data and append it to str.  Returns true if done
> + * (either successful or hard failure), false if need to retry.
> + *
> + * Caution: callers must be sure to preserve their entry-time errno
> + * when looping, in case the fmt contains "%m".
> + */
> +extern bool appendPQExpBufferVA(PQExpBuffer str, const char *fmt, va_list 
> args) pg_attribute_printf(2, 0);

As an API user, I don't care that this is shared guts for something
else, I just care about what it does.  I think deleting that line is a
sufficient fix.

> - appendPQExpBufferStr(&conn->errorMessage,
> -  
> libpq_gettext("malformed SCRAM message (empty message)\n"));
> + libpq_append_conn_error(conn, "malformed SCRAM message 
> (empty message)");

Overall, this type of change looks positive.  I didn't review all these
changes too closely other than the first couple of dozens, as there are
way too many; I suppose you did these with some Emacs macros or something?

> @@ -420,7 +418,8 @@ pqsecure_raw_write(PGconn *conn, const void *ptr, size_t 
> len)
>   snprintf(msgbuf, sizeof(msgbuf),
>libpq_gettext("server closed 
> the connection unexpectedly\n"
>  
> "\tThis probably means the server terminated abnormally\n"
> -
> "\tbefore or while processing the request.\n"));
> +
> "\tbefore or while processing the request."));
> + strlcat(msgbuf, "\n", sizeof(msgbuf));
>   conn->write_err_msg = strdup(msgbuf);
>   /* Now claim the write succeeded */
>   n = len;

In these two places, we're writing the error message manually to a
separate variable, so the extra \n is necessary.  It looks a bit odd to
do it with strlcat() after the fact, but AFAICT it's necessary as it
keeps the \n out of the translation catalog, which is good.  This is
nonobvious, so perhaps add a comment about it.

Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Before you were born your parents weren't as boring as they are now. They
got that way paying your bills, cleaning up your room and listening to you
tell them how idealistic you are."  -- Charles J. Sykes' advice to teenagers




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-09 Thread Amit Kapila
On Mon, Nov 7, 2022 at 1:46 PM Peter Smith  wrote:
>
> Here are my review comments for v42-0001
...
...
>
> 8.
>
> + /*
> + * Resend the pending message to parallel apply worker to cleanup the
> + * queue. Note that parallel apply worker will just ignore this message
> + * as it has already handled this message while applying spooled
> + * messages.
> + */
> + result = shm_mq_send(winfo->mq_handle, strlen(winfo->pending_msg),
> + winfo->pending_msg, false, true);
>
> If I understand this logic it seems a bit hacky. From the comment, it
> seems you are resending a message that you know/expect to be ignored
> simply to make it disappear. (??). Isn't there some other way to clear
> the pending message without requiring a bogus send?
>

IIUC, this handling is required for the case when we are not able to
send a message to parallel apply worker and switch to serialize mode
(write remaining data to file). Basically, it is possible that the
message is only partially sent and there is no way clean the queue. I
feel we can directly free the worker in this case even if there is a
space in the worker pool. The other idea could be that we detach from
shm_mq and then invent a way to re-attach it after we try to reuse the
same worker.

-- 
With Regards,
Amit Kapila.




Remove redundant declaration for XidInMVCCSnapshot

2022-11-09 Thread Japin Li

Commit b7eda3e0e3 moves XidINMVCCSnapshot into snapmgr.{c,h},
however, it forgets the declaration of XidINMVCCSnapshot in
heapam.h.

Attached removes the redundant declaration in heapam.h.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From 16f4bd33d7886c872319393dfebb5dd570b750ab Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Wed, 9 Nov 2022 18:40:11 +0800
Subject: [PATCH v1 1/1] Remove redundant declaration for XidInMVCCSnapshot

---
 src/include/access/heapam.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9dab35551e..5b07875740 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -213,7 +213,6 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
  uint16 infomask, TransactionId xid);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
-extern bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
   struct GlobalVisState *vistest);
 
-- 
2.25.1



Re: Reviving lost replication slots

2022-11-09 Thread Amit Kapila
On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
 wrote:
>
 Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
disallow reviving the slot?
>

I think the intent is to compute the correct value for
replicationSlotMinLSN as we use restart_lsn for it and using the
invalidated slot's restart_lsn value for it doesn't make sense.

-- 
With Regards,
Amit Kapila.




Re: Small TAP improvements

2022-11-09 Thread Andrew Dunstan


On 2022-11-06 Su 08:51, Álvaro Herrera wrote:
> On 2022-Jun-14, Andrew Dunstan wrote:
>
>> OK, here's a more principled couple of patches. For config_data, if you
>> give multiple options it gives you back the list of values. If you don't
>> specify any, in scalar context it just gives you back all of pg_config's
>> output, but in array context it gives you a map, so you should be able
>> to say things like:
>>
>> my %node_config = $node->config_data;
> Hi, it looks to me like these were forgotten?
>

Yeah, will get to it this week.


Thanks for the reminder.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Reviving lost replication slots

2022-11-09 Thread Bharath Rupireddy
On Wed, Nov 9, 2022 at 3:53 PM Amit Kapila  wrote:
>
> On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi
> >  wrote:
> > >
> > > I don't think walsenders fetching segment from archive is totally
> > > stupid. With that feature, we can use fast and expensive but small
> > > storage for pg_wal, while avoiding replciation from dying even in
> > > emergency.
> >
> > It seems like a useful feature to have at least as an option and it
> > saves a lot of work - failovers, expensive rebuilds of
> > standbys/subscribers, manual interventions etc.
> >
> > If you're saying that even the walsedners serving logical replication
> > subscribers would go fetch from the archive location for the removed
> > WAL files, it mandates enabling archiving on the subscribers.
> >
>
> Why archiving on subscribers is required? Won't it be sufficient if
> that is enabled on the publisher where we have walsender?

Ugh. A typo. I meant it mandates enabling archiving on publishers.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: heavily contended lwlocks with long wait queues scale badly

2022-11-09 Thread Bharath Rupireddy
On Tue, Nov 1, 2022 at 12:46 PM Bharath Rupireddy
 wrote:
>
> > Updated patch attached.
>
> Thanks. It looks good to me. However, some minor comments on the v3 patch:
>
> 1.
> -if (MyProc->lwWaiting)
> +if (MyProc->lwWaiting != LW_WS_NOT_WAITING)
>  elog(PANIC, "queueing for lock while waiting on another one");
>
> Can the above condition be MyProc->lwWaiting == LW_WS_WAITING ||
> MyProc->lwWaiting == LW_WS_PENDING_WAKEUP for better readability?
>
> Or add an assertion Assert(MyProc->lwWaiting != LW_WS_WAITING &&
> MyProc->lwWaiting != LW_WS_PENDING_WAKEUP); before setting
> LW_WS_WAITING?
>
> 2.
> /* Awaken any waiters I removed from the queue. */
> proclist_foreach_modify(iter, &wakeup, lwWaitLink)
> {
>
> @@ -1044,7 +1052,7 @@ LWLockWakeup(LWLock *lock)
>   * another lock.
>   */
>  pg_write_barrier();
> -waiter->lwWaiting = false;
> +waiter->lwWaiting = LW_WS_NOT_WAITING;
>  PGSemaphoreUnlock(waiter->sem);
>  }
>
> /*
>  * Awaken any waiters I removed from the queue.
>  */
> proclist_foreach_modify(iter, &wakeup, lwWaitLink)
> {
> PGPROC   *waiter = GetPGProcByNumber(iter.cur);
>
> proclist_delete(&wakeup, iter.cur, lwWaitLink);
> /* check comment in LWLockWakeup() about this barrier */
> pg_write_barrier();
> waiter->lwWaiting = LW_WS_NOT_WAITING;
>
> Can we add an assertion Assert(waiter->lwWaiting ==
> LW_WS_PENDING_WAKEUP) in the above two places? We prepare the wakeup
> list and set the LW_WS_NOT_WAITING flag in above loops, having an
> assertion is better here IMO.
>
> Below are test results with v3 patch. +1 for back-patching it.
>
> HEADPATCHED
> 13414234289
> 27276069720
> 4136300131848
> 8210809210192
> 16240718242744
> 32297587297354
> 64341939343036
> 128383615383801
> 256342094337680
> 512263194288629
> 768145526261553
> 1024107267241811
> 204835716188389
> 409612415120300
>
> PG15PATCHED
> 13450334078
> 27370872054
> 4139415133321
> 8212396211390
> 16242227242584
> 32303441309288
> 64362680339211
> 128378645344291
> 256340016344291
> 512290044293337
> 768140277264618
> 102496191247636
> 204835158181488
> 409612164118610

I looked at the v3 patch again today and ran some performance tests.
The results look impressive as they were earlier. Andres, any plans to
get this in?

pgbench with SELECT txid_current();:
ClientsHEADPATCHED
13461333611
27263470546
4137885136911
8216470216076
16242535245392
32299952304740
64329788347401
128378296386873
256344939343832
512292196295839
768144212260102
1024101525250263
204835594185878
409611842104227

pgbench with insert into pgbench_accounts table:
ClientsHEADPATCHED
116601600
218481746
435473395
873306754
161310313613
322601126372
645233152594
1289331395526
256127373126182
512126712127857
768116765119227
1024111464112499
20485883892756
40962606660543

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Reviving lost replication slots

2022-11-09 Thread Amit Kapila
On Wed, Nov 9, 2022 at 3:00 PM Bharath Rupireddy
 wrote:
>
> On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi
>  wrote:
> >
> > I don't think walsenders fetching segment from archive is totally
> > stupid. With that feature, we can use fast and expensive but small
> > storage for pg_wal, while avoiding replciation from dying even in
> > emergency.
>
> It seems like a useful feature to have at least as an option and it
> saves a lot of work - failovers, expensive rebuilds of
> standbys/subscribers, manual interventions etc.
>
> If you're saying that even the walsedners serving logical replication
> subscribers would go fetch from the archive location for the removed
> WAL files, it mandates enabling archiving on the subscribers.
>

Why archiving on subscribers is required? Won't it be sufficient if
that is enabled on the publisher where we have walsender?

-- 
With Regards,
Amit Kapila.




Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-09 Thread Yugo NAGATA
On Sun, 06 Nov 2022 12:54:17 -0500
Tom Lane  wrote:

> Yugo NAGATA  writes:
> >> The attached patch tries to add comments explaining it on the functions.
> 
> > I forward it to the hackers list because the patch is to fix comments.
> 
> What do you think of the attached wording?

It looks good to me. That describes the expected behaviour exactly.

> I don't think the pipeline angle is of concern to anyone who might be
> reading these comments with the aim of understanding what guarantees
> they have.  Perhaps there should be more about that in the user-facing
> docs, though.

I agree with that we don't need to mention pipelining in these comments,
and that we need more in the documentation. I attached a doc patch to add
a mention of commands that do internal commit to the pipelining section.
Also, this adds a reference for the pipelining protocol to the libpq doc.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 3c9bd3d673..727a024e60 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -5058,7 +5058,8 @@ int PQflush(PGconn *conn);
While the pipeline API was introduced in
PostgreSQL 14, it is a client-side feature
which doesn't require special server support and works on any server
-   that supports the v3 extended query protocol.
+   that supports the v3 extended query protocol; see .
+
   
 
   
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 5fdd429e05..2edd42d7e9 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1095,6 +1095,9 @@ SELCT 1/0;
 that cannot be executed inside a transaction block.  If one of
 these is executed in a pipeline, it will, upon success, force an
 immediate commit to preserve database consistency.
+In addition, execution one of some commands (such as
+VACUUM ANALYZE) that start/commit transactions
+internally also can cause an immediate commit even if it fails.
 A Sync immediately following one of these has no effect except to
 respond with ReadyForQuery.



Re: Reviving lost replication slots

2022-11-09 Thread Bharath Rupireddy
On Wed, Nov 9, 2022 at 2:02 PM Kyotaro Horiguchi
 wrote:
>
> I don't think walsenders fetching segment from archive is totally
> stupid. With that feature, we can use fast and expensive but small
> storage for pg_wal, while avoiding replciation from dying even in
> emergency.

It seems like a useful feature to have at least as an option and it
saves a lot of work - failovers, expensive rebuilds of
standbys/subscribers, manual interventions etc.

If you're saying that even the walsedners serving logical replication
subscribers would go fetch from the archive location for the removed
WAL files, it mandates enabling archiving on the subscribers. And we
know that the archiving is not cheap and has its own advantages and
disadvantages, so the feature may or may not help.
If you're saying that only the walsedners serving streaming
replication standbys would go fetch from the archive location for the
removed WAL files, it's easy to implement, however it is not a
complete feature and doesn't solve the problem for logical
replication.
With the feature, it'll be something like 'you, as primary/publisher,
archive the WAL files and when you don't have them, you'll restore
them', it may not sound elegant, however, it can solve the lost
replication slots problem.
And, the cost of restoring WAL files from the archive might further
slow down the replication thus increasing the replication lag.
And, one need to think, how many such WAL files are restored and kept,
whether they'll be kept in pg_wal or some other directory, how will
the disk full, fetching too old or too many WAL files for replication
slots lagging behind, removal of unnecessary WAL files etc. be
handled.

I'm not sure about other implications at this point of time.

Perhaps, implementing this feature as a core/external extension by
introducing segment_open() or other necessary hooks might be worth it.

If implemented in some way, I think the scope of replication slot
invalidation/max_slot_wal_keep_size feature gets reduced or it can be
removed completely, no?

> However, supposing that WalSndSegmentOpen() fetches segments from
> archive as the fallback and that succeeds, the slot can survive
> missing WAL in pg_wal in the first place. So this patch doesn't seem
> to be needed for the purpose.

That is a simple solution one can think of and provide for streaming
replication standbys, however, is it worth implementing it in the core
as explained above?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: thinko in basic_archive.c

2022-11-09 Thread Bharath Rupireddy
On Tue, Nov 8, 2022 at 3:18 AM Nathan Bossart  wrote:
>
> The call to snprintf() should take care of adding a terminating null byte
> in the right place.

Ah, my bad. MemSet is avoided in v5 patch setting only the first byte.

> > So, IIUC, your point here is what if the copy_file fails to create the
> > temp file when it already exists. With the name collision being a rare
> > scenario, given the pid and timestamp variables, I'm not sure if
> > copy_file can ever fail because the temp file already exists (with
> > errno EEXIST). However, if we want to be extra-cautious, checking if
> > temp file exists with file_exists() before calling copy_file() might
> > help avoid such cases. If we don't want to have extra system call (via
> > file_exists()) to check the temp file existence, we can think of
> > sending a flag to copy_file(src, dst, &is_dst_file_created) and use
> > is_dst_file_created in the shutdown callback. Thoughts?
>
> Presently, if copy_file() encounters a pre-existing file, it should ERROR,
> which will be caught in the sigsetjmp() block in basic_archive_file().  The
> shutdown callback shouldn't run in this scenario.

Determining the "file already exists" error/EEXIST case from a bunch
of other errors in copy_file() is tricky. However, I quickly hacked up
copy_file() by adding elevel parameter, please see the attached
v5-0001.

> I think this cleanup logic should run in both the shutdown callback and the
> sigsetjmp() block, but it should only take action (i.e., deleting the
> leftover temporary file) if the ERROR or shutdown occurs after creating the
> file in copy_file() and before renaming the temporary file to its final
> destination.

Please see attached v5 patch set.

If the direction seems okay, I'll add a CF entry.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From cd805d6d26e66dcfc51ca6632a0147827a5b9544 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 9 Nov 2022 08:29:19 +
Subject: [PATCH v5] Refactor copy_file() to remove leftover temp files in
 basic_archive

---
 contrib/basic_archive/basic_archive.c | 23 ++-
 src/backend/storage/file/copydir.c| 56 ---
 src/backend/storage/file/reinit.c |  2 +-
 src/include/storage/copydir.h |  2 +-
 4 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 9f221816bb..1b9f48eeed 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -275,14 +275,33 @@ basic_archive_file_internal(const char *file, const char *path)
 	 * Copy the file to its temporary destination.  Note that this will fail
 	 * if temp already exists.
 	 */
-	copy_file(unconstify(char *, path), temp);
+	if (copy_file(unconstify(char *, path), temp, LOG) != 0)
+	{
+		/* Remove the leftover temporary file. */
+		if (errno != EEXIST)
+			unlink(temp);
+
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not copy file \"%s\" to temporary file \"%s\": %m",
+		path, temp)));
+	}
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
 	 * Note that this will overwrite any existing file, but this is only
 	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename(temp, destination, ERROR);
+	if (durable_rename(temp, destination, LOG) != 0)
+	{
+		/* Remove the leftover temporary file. */
+		unlink(temp);
+
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not rename temporary file \"%s\" to \"%s\": %m",
+		temp, destination)));
+	}
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 0cea4cbc89..ee20c83949 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -71,7 +71,7 @@ copydir(char *fromdir, char *todir, bool recurse)
 copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
-			copy_file(fromfile, tofile);
+			copy_file(fromfile, tofile, ERROR);
 	}
 	FreeDir(xldir);
 
@@ -113,8 +113,8 @@ copydir(char *fromdir, char *todir, bool recurse)
 /*
  * copy one file
  */
-void
-copy_file(char *fromfile, char *tofile)
+int
+copy_file(char *fromfile, char *tofile, int elevel)
 {
 	char	   *buffer;
 	int			srcfd;
@@ -138,28 +138,35 @@ copy_file(char *fromfile, char *tofile)
 #define FLUSH_DISTANCE (1024 * 1024)
 #endif
 
-	/* Use palloc to ensure we get a maxaligned buffer */
-	buffer = palloc(COPY_BUF_SIZE);
-
 	/*
 	 * Open the files
 	 */
 	srcfd = OpenTransientFile(fromfile, O_RDONLY | PG_BINARY);
 	if (srcfd < 0)
-		ereport(ERROR,
+	{
+		ereport(elevel,
 (errcode_for_file_access(),
  errmsg("could not open file \"%s\": %m", fromfile)));
+		return -1;
+	}
 
 	dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG

Re: Reviving lost replication slots

2022-11-09 Thread Kyotaro Horiguchi
I don't think walsenders fetching segment from archive is totally
stupid. With that feature, we can use fast and expensive but small
storage for pg_wal, while avoiding replciation from dying even in
emergency.

At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi 
 wrote in 
> > If it's a streaming replication slot, the standby will anyway jump to
> > archive mode ignoring the replication slot and the slot will never be
> > usable again unless somebody creates a new replication slot and
> > provides it to the standby for reuse.
> > If it's a logical replication slot, the subscriber will start to
> > diverge from the publisher and the slot will have to be revived
> > manually i.e. created again.
> >
> 
> Physical slots can be revived with standby downloading the WAL from the
> archive directly. This patch is helpful for the logical slots.

However, supposing that WalSndSegmentOpen() fetches segments from
archive as the fallback and that succeeds, the slot can survive
missing WAL in pg_wal in the first place. So this patch doesn't seem
to be needed for the purpose.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: security_context_t marked as deprecated in libselinux 3.1

2022-11-09 Thread Alvaro Herrera
On 2022-Nov-09, Michael Paquier wrote:

> I got to look at that, now that the minor releases have been tagged,
> and the change has no conflicts down to 9.3.  9.2 needed a slight
> tweak, though, but it seemed fine as well.  (Tested the build on all
> branches.)  So done all the way down, sticking with the new no-warning
> policy for all the buildable branches.

Thank you :-)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: New docs chapter on Transaction Management and related changes

2022-11-09 Thread Laurenz Albe
On Mon, 2022-11-07 at 22:43 -0500, Bruce Momjian wrote:
> > I thought some more about the patch, and I don't like the title
> > "Transaction Management" for the new chapter.  I'd expect some more
> > from a chapter titled "Internals" / "Transaction Management".
> > 
> > In reality, the new chapter is about transaction IDs.  So perhaps the
> > name should reflect that, so that it does not mislead the reader.
> 
> I renamed it to "Transaction Processing" since we also cover locking and
> subtransactions.  How is that?

It is better.  Did you take my suggestions from [1] into account in your
latest cumulative patch in [2]?  Otherwise, it will be difficult to
integrate both.

Yours,
Laurenz Albe

 [1]: 
https://postgr.es/m/3603e6e85544daa5300c7106c31bc52673711cd0.camel%40cybertec.at
 [2]: https://postgr.es/m/Y2nP04/3BHQOviVB%40momjian.us