Re: Insertion Sort Improvements

2022-09-28 Thread Benjamin Coutu
> "Looks like"?

I cannot find the reference, but I've read a while back that a well-known 
company from Redwood uses it for their in-memory columnar storage. That might 
have just been a rumor or might have been research only - not sure. It does not 
really matter anyways.

> SortTuples are currently 24 bytes, and supported vector registers are 16 
> bytes, so not sure how you think that would work.

The thought was to logically group multiple sort tuples together and then 
create a vectorized version of that group with just the primitive type sort key 
as well as a small-sized index/offset into that sort group to later swap the 
corresponding sort tuple referenced by that index/offset. The sorting network 
would allow us to do branch-less register based sorting for a particular sort 
run. I guess this idea is moot, given ...

> More broadly, the more invasive a change is, the more risk is involved, and 
> the more effort to test and evaluate. If you're serious about trying to 
> improve insertion sort performance, the simple idea we discussed at the start 
> of the thread is a much more modest step that has a good chance of justifying 
> the time put into it. That is not to say it's easy, however, because testing 
> is a non-trivial amount of work.

I absolutely agree. Let's concentrate on improving things incrementally.
Please excuse me wondering given that you have contributed some of the recent 
vectorization stuff and seeing that you have obviously experimented a lot with 
the sort code, that you might already have tried something along those lines or 
researched the subject - it is definitely a very interesting topic.

Cheers, Ben




Re: [PATCH] Add peer authentication TAP test

2022-09-28 Thread Drouvot, Bertrand

Hi,

On 9/28/22 7:52 AM, Michael Paquier wrote:

On Fri, Aug 26, 2022 at 10:43:43AM +0200, Drouvot, Bertrand wrote:

During the work in [1] we created a new TAP test to test the SYSTEM_USER
behavior with peer authentication.

It turns out that there is currently no TAP test for the peer
authentication, so we think (thanks Michael for the suggestion [2]) that
it's better to split the work in [1] between "pure" SYSTEM_USER related work
and the "pure" peer authentication TAP test work.

That's the reason of this new thread, please find attached a patch to add a
new TAP test for the peer authentication.


+# Get the session_user to define the user name map test.
+my $session_user =
+  $node->safe_psql('postgres', 'select session_user');
[...]
+# Define a user name map.
+$node->append_conf('pg_ident.conf', qq{mypeermap $session_user 
testmap$session_user});
+
+# Set pg_hba.conf with the peer authentication and the user name map.
+reset_pg_hba($node, 'peer map=mypeermap');

A map consists of a "MAPNAME SYSTEM_USER PG_USER".  Why does this test
use a Postgres role (from session_user) as the system user for the
peer map?


Thanks for looking at it!

Initially selecting the session_user with a "local" connection and no 
user provided during the connection is a way I came up to retrieve the 
"SYSTEM_USER" to be used later on in the map.


Maybe the variable name should be system_user instead or should we use 
another way to get the "SYSTEM_USER" to be used in the map?


Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Add peer authentication TAP test

2022-09-28 Thread Michael Paquier
On Wed, Sep 28, 2022 at 09:12:57AM +0200, Drouvot, Bertrand wrote:
> Maybe the variable name should be system_user instead or should we use
> another way to get the "SYSTEM_USER" to be used in the map?

Hmm, indeed.  It would be more reliable to rely on the contents
returned by getpeereid()/getpwuid() after one successful peer
connection, then use it in the map.  I was wondering whether using
stuff like getpwuid() in the perl script itself would be better, but
it sounds less of a headache in terms of portability to just rely on
authn_id via SYSTEM_USER to generate the contents of the correct map.
--
Michael


signature.asc
Description: PGP signature


Re: [small patch] Change datatype of ParallelMessagePending from "volatile bool" to "volatile sig_atomic_t"

2022-09-28 Thread Michael Paquier
On Wed, Sep 28, 2022 at 04:47:09AM +, kuroda.hay...@fujitsu.com wrote:
> PSA fix patch. Note that PromptInterruptContext.enabled was also fixed
> because it is substituted from sigint_interrupt_enabled.

Good point.  Thanks for the patch, this looks consistent!
--
Michael


signature.asc
Description: PGP signature


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-28 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM John Naylor 
wrote:
> [stuff about size classes]

I kind of buried the lede here on one thing: If we only have 4 kinds
regardless of the number of size classes, we can use 2 bits of the pointer
for dispatch, which would only require 4-byte alignment. That should make
that technique more portable.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: pg_rewind WAL segments deletion pitfall

2022-09-28 Thread Polina Bungina
On Tue, Sep 27, 2022 at 9:50 AM Kyotaro Horiguchi 
wrote:

> Regarding the the patch, pg_rewind starts reading segments from the
> divergence point back to the nearest checkpoint, then moves foward
> during rewinding. So, the fact that SimpleXLogPageRead have read a
> segment suggests that the segment is required during the next startup.
> So I don't think we need to move around the keepWalSeg flag.  All
> files that are wanted while rewinding should be preserved
> unconditionally.
>

I am probably not getting this right but as far as I see SimpleXLogPageRead
is called at most 3 times during pg_rewind run:
1. From readOneRecord to determine the end-of-WAL on the target by reading
the last shutdown checkpoint record/minRecoveryPoint on it
2. From findLastCheckpoint to find last common checkpoint (here it
indeed reads all the segments that are required during the startup, hence
the keepWalSeg flag set to true)
3. From extractPageMap to extract all the pages modified after the fork
(here we also read all the segments that should be kept but also the ones
further, until the target's end record. Doesn't seem we should
unconditionally preserve them all).
Am I missing something?



> +   /*
> +* Some entries (WAL segments) already have an action
> assigned
> +* (see SimpleXLogPageRead()).
> +*/
> +   if (entry->action == FILE_ACTION_NONE)
> +   continue;
> entry->action = decide_file_action(entry);

It might be more reasonable to call decide_file_action() when action
> is UNDECIDED.
>

Agree, will change this part.

Regards,
Polina Bungina


Re: A doubt about a newly added errdetail

2022-09-28 Thread Amit Kapila
On Wed, Sep 28, 2022 at 11:30 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 27 Sep 2022 12:19:35 +0200, Alvaro Herrera  
> wrote in
> > Yeah, since you're changing another word in that line, it's ok to move
> > the parameter line off-string.  (If you were only changing the parameter
> > to %s and there was no message duplication, I would reject the patch as
> > useless.)
>
> I'm fine with that. By the way, related to the area, I found the
> following error messages.
>
> >errmsg("publication \"%s\" is defined as FOR ALL TABLES",
> >   NameStr(pubform->pubname)),
> >errdetail("Schemas cannot be added to or dropped from FOR ALL TABLES 
> > publications.")));
>
> It looks tome that the errmsg and errordetail are reversed. Isn't the 
> following order common?
>
> >errmsg("schemas cannot be added to or dropped from publication 
> > \"%s\".",
> >   NameStr(pubform->pubname)),
> >errdetail("The publication is defined as FOR ALL TABLES.")));
>

This one seems to be matching with the below existing message:
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("publication \"%s\" is defined as FOR ALL TABLES",
NameStr(pubform->pubname)),
errdetail("Tables cannot be added to or dropped from FOR ALL TABLES
publications.")));


-- 
With Regards,
Amit Kapila.




Re: In-placre persistance change of a relation

2022-09-28 Thread Kyotaro Horiguchi
Just rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 8c0d5bd7b519149059d1b2b86a93ffe509e09b9b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v24 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c|  49 ++
 src/backend/access/transam/README |  10 +
 src/backend/access/transam/xact.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  18 +
 src/backend/backup/basebackup.c   |   9 +-
 src/backend/catalog/storage.c | 561 +-
 src/backend/commands/tablecmds.c  | 267 --
 src/backend/storage/buffer/bufmgr.c   |  85 
 src/backend/storage/file/fd.c |   4 +-
 src/backend/storage/file/reinit.c | 313 
 src/backend/storage/smgr/md.c |  95 +++-
 src/backend/storage/smgr/smgr.c   |  32 ++
 src/backend/storage/sync/sync.c   |  21 +-
 src/bin/pg_rewind/parsexlog.c |  22 +
 src/bin/pg_rewind/pg_rewind.c |   1 -
 src/common/relpath.c  |  47 +-
 src/include/catalog/storage.h |   3 +
 src/include/catalog/storage_xlog.h|  43 +-
 src/include/common/relpath.h  |   9 +-
 src/include/storage/bufmgr.h  |   2 +
 src/include/storage/fd.h  |   1 +
 src/include/storage/md.h  |   8 +-
 src/include/storage/reinit.h  |   8 +-
 src/include/storage/smgr.h|  17 +
 src/tools/pgindent/typedefs.list  |   7 +
 25 files changed, 1471 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rlocator.dbOid,
+		   xlrec->rlocator.spcOid,
+		   xlrec->rlocator.relNumber,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action = "";
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 91c2578f7a..37527e16ca 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,16 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+
+The Smgr MARK files
+
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time.  In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
 
 Skipping WAL for New RelFileLocator
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2bb975943c..03e4bcec34 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2213,6 +

Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Dilip Kumar
On Wed, Sep 28, 2022 at 9:40 AM Dilip Kumar  wrote:
>
> On Wed, Sep 28, 2022 at 2:59 AM Tom Lane  wrote:
> >
> > ... also, lapwing's not too happy [1].  The alter_table test
> > expects this to yield zero rows, but it doesn't:
>
> By looking at regression diff as shown below, it seems that we are
> able to get the relfilenode from the Oid using
> pg_relation_filenode(oid) but the reverse mapping
> pg_filenode_relation(reltablespace, relfilenode) returned NULL.
>

It was a silly mistake, I used the F_OIDEQ function instead of
F_INT8EQ. Although this was correct on the 0003 patch where we have
removed the tablespace from key, but got missed in this :(

I have locally reproduced this in a 32 bit machine consistently and
the attached patch is fixing the issue for me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 50c5695843c7ab1b6945d4e7237da3812085c85a Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Wed, 28 Sep 2022 13:39:45 +0530
Subject: [PATCH] Fix silly mistake, use F_INT8EQ function for relfilenode
 search

---
 src/backend/utils/cache/relfilenumbermap.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/src/backend/utils/cache/relfilenumbermap.c b/src/backend/utils/cache/relfilenumbermap.c
index 2e0acf9..11427ba 100644
--- a/src/backend/utils/cache/relfilenumbermap.c
+++ b/src/backend/utils/cache/relfilenumbermap.c
@@ -88,7 +88,6 @@ static void
 InitializeRelfilenumberMap(void)
 {
 	HASHCTL		ctl;
-	int			i;
 
 	/* Make sure we've initialized CacheMemoryContext. */
 	if (CacheMemoryContext == NULL)
@@ -97,17 +96,20 @@ InitializeRelfilenumberMap(void)
 	/* build skey */
 	MemSet(&relfilenumber_skey, 0, sizeof(relfilenumber_skey));
 
-	for (i = 0; i < 2; i++)
-	{
-		fmgr_info_cxt(F_OIDEQ,
-	  &relfilenumber_skey[i].sk_func,
-	  CacheMemoryContext);
-		relfilenumber_skey[i].sk_strategy = BTEqualStrategyNumber;
-		relfilenumber_skey[i].sk_subtype = InvalidOid;
-		relfilenumber_skey[i].sk_collation = InvalidOid;
-	}
-
+	fmgr_info_cxt(F_OIDEQ,
+  &relfilenumber_skey[0].sk_func,
+  CacheMemoryContext);
+	relfilenumber_skey[0].sk_strategy = BTEqualStrategyNumber;
+	relfilenumber_skey[0].sk_subtype = InvalidOid;
+	relfilenumber_skey[0].sk_collation = InvalidOid;
 	relfilenumber_skey[0].sk_attno = Anum_pg_class_reltablespace;
+
+	fmgr_info_cxt(F_INT8EQ,
+  &relfilenumber_skey[1].sk_func,
+  CacheMemoryContext);
+	relfilenumber_skey[1].sk_strategy = BTEqualStrategyNumber;
+	relfilenumber_skey[1].sk_subtype = InvalidOid;
+	relfilenumber_skey[1].sk_collation = InvalidOid;
 	relfilenumber_skey[1].sk_attno = Anum_pg_class_relfilenode;
 
 	/*
-- 
1.8.3.1



RE: Data is copied twice when specifying both child and parent table in publication

2022-09-28 Thread wangw.f...@fujitsu.com
On Tues, Sep 27, 2022 at 16:45 PM Peter Smith  wrote:
> Here are my review comments for the HEAD_v11-0001 patch:

Thanks for your comments.

> ==
> 
> 1. General - Another related bug?
> 
> In [1] Hou-san wrote:
> 
> For another case you mentioned (via_root used when publishing child)
> CREATE PUBLICATION pub1 for TABLE parent;
> CREATE PUBLICATION pub2 for TABLE child with (publish_via_partition_root);
> CREATE SUBSCRIPTION sub connect xxx PUBLICATION pub1,pub2;
> 
> The expected behavior is only the child table is published, all the changes
> should be replicated using the child table's identity. We should do table sync
> only for child tables and is same as the current behavior on HEAD. So, I think
> there is no bug in this case.
> 
> ~
> 
> That behaviour seems different to my understanding because the pgdocs
> says when the via_root param is true the 'child' table would be using
> the 'parent' identity:
> 
> [2] publish_via_partition_root determines whether changes in a
> partitioned table (or on its partitions) contained in the publication
> will be published using the identity and schema of the partitioned
> table rather than that of the individual partitions that are actually
> changed.
> 
> ~
> 
> So is this another bug (slightly different from the current one being
> patched), or is it just some different special behaviour? If it's
> another bug then you need to know that ASAP because I think you may
> want to fix both of them at the same time which might impact how this
> 2x data copy patch should be implemented.
> 
> Or perhaps just the pgdocs need more notes about special
> cases/combinations like this?
> 
> ==
> 
> 2. General - documentation?
> 
> For this current patch, IIUC it was decided that it is a bug because
> the data gets duplicated, and then some sensible rule was decided that
> this patch should use to address it (e.g. publishing a child combined
> with publishing a parent via_root will just ignore the child's
> publication...).
> 
> So my question is - is this (new/fixed) behaviour something that a
> user will be able to figure out themselves from the existing
> documentation, or does this patch now need its own special notes in
> the documentation?

IMO this behaviour doesn't look like a bug.
I think the behaviour of multiple publications with parameter
publish_via_partition_root could be added to the pg-doc later in a separate
patch.

> ==
> 
> 3. src/backend/catalog/pg_publication.c - pg_get_publication_tables
> 
> + foreach(lc, pub_elem_tables)
> + {
> + Oid *result = (Oid *) malloc(sizeof(Oid) * 2);
> +
> + result[0] = lfirst_oid(lc);
> + result[1] = pub_elem->oid;
> + table_infos = lappend(table_infos, result);
> + }
> 
> 3a.
> It looks like each element in the table_infos list is a malloced obj
> of 2x Oids (Oid of table, Oid of pub). IMO better to call this element
> 'table_info' instead of the meaningless 'result'
> 
> ~
> 
> 3b.
> Actually, I think it would be better if this function defines a little
> 2-element structure {Oid relid, Oid pubid} to use instead of this
> array (which requires knowledge that [0] means relid and [1] means
> pubid).
> 
> ~~~
> 
> 4.
> 
> + foreach(lc, table_infos)
> + {
> + Oid *table_info_tmp = (Oid *) lfirst(lc);
> +
> + if (!list_member_oid(tables, table_info_tmp[0]))
> + table_infos = foreach_delete_current(table_infos, lc);
>   }
> I think the '_tmp' suffix is not helpful here - IMO having another
> relid variable would make this more self-explanatory.
> 
> Or better yet adopt the suggestion o f #3b and have a little struct
> with self-explanatory member names.

Improved as suggested.

> =
> 
> 5. src/backend/commands/subscriptioncmds.c - fetch_table_list
> 
> + if (server_version >= 16)
> + appendStringInfo(&cmd, "SELECT DISTINCT N.nspname, C.relname,\n"
> 
> Since there is an else statement block, I think this would be more
> readable if there was a statement block here too. YMMV
> 
> SUGGESTION
> if (server_version >= 16)
> {
> appendStringInfo(&cmd, "SELECT DISTINCT N.nspname, C.relname,\n"
> ...
> }

Improved as suggested.

> ~~~
> 
> 6.
> 
> + /*
> + * Get the list of tables from publisher, the partition table whose
> + * ancestor is also in this list will be ignored, otherwise the initial
> + * data in the partition table would be replicated twice.
> + */
> 
> 6a.
> "from publisher, the partition" -> "from the publisher. The partition"
> 
> ~
> 
> 6b.
> This looks like a common comment that also applied to the "if" part,
> so it seems more appropriate to move it to where I indicated below.
> Perhaps the whole comment needs a bit of massaging after you move
> it...
> 
> + /*
> + * Get namespace, relname and column list (if supported) of the tables
> + * belonging to the specified publications.
> + *
> + * HERE <
> + *
> + * From version 16, the parameter of the function pg_get_publication_tables
> + * can be an array of publications. The partition table whose ancestor is
> + * also published in this 

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2022-09-28 Thread Maxim Orlov
Hi!

I think, this patch was marked as "Waiting on Author", probably, by
mistake. Since recent changes were done without any significant code
changes and CF bot how happy again.

I'm going to move it to RfC, could I? If not, please tell why.

-- 
Best regards,
Maxim Orlov.


Re: making relfilenodes 56 bits

2022-09-28 Thread Dilip Kumar
On Wed, Sep 28, 2022 at 9:23 AM Thomas Munro  wrote:
>
> Hi Dilip,
>
> I am very happy to see these commits.  Here's some belated review for
> the tombstone-removal patch.
>
> > v7-0004-Don-t-delay-removing-Tombstone-file-until-next.patch
>
> More things you can remove:
>
>  * sync_unlinkfiletag in struct SyncOps
>  * the macro UNLINKS_PER_ABSORB
>  * global variable pendingUnlinks
>
> This comment after the question mark is obsolete:
>
> * XXX should we CHECK_FOR_INTERRUPTS in this loop?
> Escaping with an
> * error in the case of SYNC_UNLINK_REQUEST would leave the
> * no-longer-used file still present on disk, which
> would be bad, so
> * I'm inclined to assume that the checkpointer will
> always empty the
> * queue soon.
>
> (I think if the answer to the question is now yes, then we should
> replace the stupid sleep with a condition variable sleep, but there's
> another thread about that somewhere).
>
> In a couple of places in dbcommands.c you could now make this change:
>
> /*
> -* Force a checkpoint before starting the copy. This will
> force all dirty
> -* buffers, including those of unlogged tables, out to disk, to ensure
> -* source database is up-to-date on disk for the copy.
> -* FlushDatabaseBuffers() would suffice for that, but we also want to
> -* process any pending unlink requests. Otherwise, if a checkpoint
> -* happened while we're copying files, a file might be deleted just 
> when
> -* we're about to copy it, causing the lstat() call in copydir() to 
> fail
> -* with ENOENT.
> +* Force all dirty buffers, including those of unlogged tables, out to
> +* disk, to ensure source database is up-to-date on disk for the copy.
>  */
> -   RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
> - CHECKPOINT_WAIT |
> CHECKPOINT_FLUSH_ALL);
> +   FlushDatabaseBuffers(src_dboid);
>
> More obsolete comments you could change:
>
>  * If we were copying database at block levels then drop pages for the
>  * destination database that are in the shared buffer cache.  And tell
> -->  * checkpointer to forget any pending fsync and unlink
> requests for files
>
> --> * Tell checkpointer to forget any pending fsync and unlink requests 
> for
> * files in the database; else the fsyncs will fail at next
> checkpoint, or
> * worse, it will delete file
>
> In tablespace.c I think you could now make this change:
>
> if (!destroy_tablespace_directories(tablespaceoid, false))
> {
> -   /*
> -* Not all files deleted?  However, there can be
> lingering empty files
> -* in the directories, left behind by for example DROP
> TABLE, that
> -* have been scheduled for deletion at next checkpoint
> (see comments
> -* in mdunlink() for details).  We could just delete
> them immediately,
> -* but we can't tell them apart from important data
> files that we
> -* mustn't delete.  So instead, we force a checkpoint
> which will clean
> -* out any lingering files, and try again.
> -*/
> -   RequestCheckpoint(CHECKPOINT_IMMEDIATE |
> CHECKPOINT_FORCE | CHECKPOINT_WAIT);
> -
> +#ifdef WIN32
> /*
>  * On Windows, an unlinked file persists in the
> directory listing
>  * until no process retains an open handle for the
> file.  The DDL
> @@ -523,6 +513,7 @@ DropTableSpace(DropTableSpaceStmt *stmt)
>
> /* And now try again. */
> if (!destroy_tablespace_directories(tablespaceoid, false))
> +#endif
> {
> /* Still not empty, the files must be important then 
> */
> ereport(ERROR,

Thanks, Thomas, these all look fine to me.  So far we have committed
the patch to make relfilenode 56 bits wide.  The Tombstone file
removal patch is still pending to be committed, so when I will rebase
that patch I will accommodate all these comments in that patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2022-09-28 Thread Aleksander Alekseev
Hi hackers,

> I think, this patch was marked as "Waiting on Author", probably, by mistake. 
> Since recent changes were done without any significant code changes and CF 
> bot how happy again.
>
> I'm going to move it to RfC, could I? If not, please tell why.

I restored the "Ready for Committer" state. I don't think it's a good
practice to change the state every time the patch has a slight
conflict or something. This is not helpful at all. Such things happen
quite regularly and typically are fixed in a couple of days.

-- 
Best regards,
Aleksander Alekseev




Re: A doubt about a newly added errdetail

2022-09-28 Thread Alvaro Herrera
On 2022-Sep-28, Amit Kapila wrote:

> On Wed, Sep 28, 2022 at 11:30 AM Kyotaro Horiguchi
>  wrote:

> > It looks tome that the errmsg and errordetail are reversed. Isn't the 
> > following order common?
> >
> > >errmsg("schemas cannot be added to or dropped from publication 
> > > \"%s\".",
> > >   NameStr(pubform->pubname)),
> > >errdetail("The publication is defined as FOR ALL TABLES.")));
> >
> 
> This one seems to be matching with the below existing message:
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("publication \"%s\" is defined as FOR ALL TABLES",
> NameStr(pubform->pubname)),
> errdetail("Tables cannot be added to or dropped from FOR ALL TABLES
> publications.")));

Well, that suggests we should change both together.  I do agree that
they look suspicious; they should be more similar to this other one, I
think:

ereport(ERROR,
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("cannot add schema to publication \"%s\"",
   stmt->pubname),
errdetail("Schemas cannot be added if any tables that 
specify a column list are already part of the publication."));

The errcodes appear not to agree with each other, also.  Maybe that
needs some more thought as well.  I don't think INVALID_PARAMETER_VALUE
is the right thing here, and I'm not sure about
OBJECT_NOT_IN_PREREQUISITE_STATE either.


FWIW, the latter is a whole category which is not defined by the SQL
standard, so I recall Tom got it from DB2.  DB2 chose to subdivide in a
lot of different cases, see
https://www.ibm.com/docs/en/db2/9.7?topic=messages-sqlstate#rsttmsg__code55
for a (current?) table.  Maybe we should define some additional 55Pxx
values -- say 55PR1 INCOMPATIBLE PUBLICATION DEFINITION (R for
"replication"-related matters; the P is what we chose for the
Postgres-specific subcategory).

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"People get annoyed when you try to debug them."  (Larry Wall)




Re: pg_rewind WAL segments deletion pitfall

2022-09-28 Thread Kyotaro Horiguchi
At Wed, 28 Sep 2022 10:09:05 +0200, Polina Bungina  wrote in 
> On Tue, Sep 27, 2022 at 9:50 AM Kyotaro Horiguchi 
> wrote:
> 
> > Regarding the the patch, pg_rewind starts reading segments from the
> > divergence point back to the nearest checkpoint, then moves foward
> > during rewinding. So, the fact that SimpleXLogPageRead have read a
> > segment suggests that the segment is required during the next startup.
> > So I don't think we need to move around the keepWalSeg flag.  All
> > files that are wanted while rewinding should be preserved
> > unconditionally.
> >
> 
> I am probably not getting this right but as far as I see SimpleXLogPageRead
> is called at most 3 times during pg_rewind run:
> 1. From readOneRecord to determine the end-of-WAL on the target by reading
> the last shutdown checkpoint record/minRecoveryPoint on it
> 2. From findLastCheckpoint to find last common checkpoint (here it
> indeed reads all the segments that are required during the startup, hence
> the keepWalSeg flag set to true)
> 3. From extractPageMap to extract all the pages modified after the fork
> (here we also read all the segments that should be kept but also the ones
> further, until the target's end record. Doesn't seem we should
> unconditionally preserve them all).
> Am I missing something?

No. You're right. I have to admit that I was confused at the time X(,
sorry for that.  Those extra files are I believe harmless but of
course it's preferable to avoid them. So the keepWalSeg is useful.

So the latest version become very similar to v1 in that the both have
keepWalSeg flag. The difference is the need of the file name hash.  I
still think that it's better if we don't need the additional file
hash.  If we move out the bare code in v2 added to
SimpleXLogPageRead(), then name it "preserve_file(char *filepath)",
the code become more easy to read.

+   if (private->keepWalSeg)
+   {
+   /* the caller told us to preserve this file for future 
use */
+   snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", 
xlogfname);
+   preserve_file(xlogfpath);
+   }

Instead, I think we should add a comment here:

@@ -192,6 +195,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, 
int tliIndex,
 
 >  private.tliIndex = tliIndex;
 >  private.restoreCommand = restoreCommand;
++  /*
++   * WAL files read during searching for the last checkpoint are required
++   * by the next startup recovery of the target cluster.
++  */
 >+ private.keepWalSeg = true;

What do you think about the above?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A doubt about a newly added errdetail

2022-09-28 Thread Kyotaro Horiguchi
At Wed, 28 Sep 2022 13:47:25 +0530, Amit Kapila  wrote 
in 
> On Wed, Sep 28, 2022 at 11:30 AM Kyotaro Horiguchi
>  wrote:
> > I'm fine with that. By the way, related to the area, I found the
> > following error messages.
> >
> > >errmsg("publication \"%s\" is defined as FOR ALL TABLES",
> > >   NameStr(pubform->pubname)),
> > >errdetail("Schemas cannot be added to or dropped from FOR ALL 
> > > TABLES publications.")));
> >
> > It looks tome that the errmsg and errordetail are reversed. Isn't the 
> > following order common?
> >
> > >errmsg("schemas cannot be added to or dropped from publication 
> > > \"%s\".",
> > >   NameStr(pubform->pubname)),
> > >errdetail("The publication is defined as FOR ALL TABLES.")));
> >
> 
> This one seems to be matching with the below existing message:
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("publication \"%s\" is defined as FOR ALL TABLES",
> NameStr(pubform->pubname)),
> errdetail("Tables cannot be added to or dropped from FOR ALL TABLES
> publications.")));

Yeah, so I meant that I'd like to propose to chage the both.  I just
wanted to ask people whether that proposal is reasonable or not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Avoid memory leaks during base backups

2022-09-28 Thread Bharath Rupireddy
On Wed, Sep 28, 2022 at 10:09 AM Bharath Rupireddy
 wrote:
>
> Here's what I think - for backup functions, we
> can have the new memory context child of TopMemoryContext and for
> perform_base_backup(), we can have the memory context child of
> CurrentMemoryContext. With PG_TRY()-PG_FINALLY()-PG_END_TRY(), we can
> delete those memory contexts upon ERRORs. This approach works for us
> since backup-related code doesn't have any FATALs.
>
> Thoughts?

I'm attaching the v2 patch designed as described above. Please review it.

I've added an entry in CF - https://commitfest.postgresql.org/40/3915/

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 36607bb46a1328a8b4d63169eb1739cf3e6769fd Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 28 Sep 2022 09:04:00 +
Subject: [PATCH v2] Avoid memory leaks during backups

Postgres currently can leak memory if a failure occurs during base
backup in do_pg_backup_start() or do_pg_backup_stop() or
perform_base_backup(). The palloc'd memory such as backup_state or
tablespace_map in xlogfuncs.c or basebackup.c or tablespaceinfo in
perform_base_backup() or any other, is left-out which may cause
memory bloat on the server eventually. To experience this issue,
run pg_basebackup with --label name longer than 1024 characters and
observe memory with watch command, the memory usage goes up.

It looks like the memory leak issue has been there for quite some
time.

To solve the above problem, leverage PG_TRY()-PG_FINALLY()-PG_END_TRY()
mechanism and memory context. The design of the patch is as follows:

For backup functions or on-line backups, a new memory context is
created as a child of TopMemoryContext in pg_backup_start() which
gets deleted upon ERROR or at the end of pg_backup_stop().

For perform_base_backup() or base backups, a new memory context is
created as a child of CurrentMemoryContext which gets deleted upon
ERROR or at the end of perform_base_backup().
---
 src/backend/access/transam/xlogfuncs.c | 114 ++---
 src/backend/backup/basebackup.c|  41 -
 src/backend/utils/error/elog.c |  17 
 src/include/utils/elog.h   |   1 +
 4 files changed, 141 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index a801a94fe8..6c13da91bd 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -45,6 +45,9 @@
 static BackupState *backup_state = NULL;
 static StringInfo tablespace_map = NULL;
 
+/* A workspace for on-line backup. */
+static MemoryContext backupcontext = NULL;
+
 /*
  * pg_backup_start: set up for taking an on-line backup dump
  *
@@ -71,33 +74,52 @@ pg_backup_start(PG_FUNCTION_ARGS)
  errmsg("a backup is already in progress in this session")));
 
 	/*
-	 * backup_state and tablespace_map need to be long-lived as they are used
-	 * in pg_backup_stop().
+	 * Start the on-line backup, but make sure we clean up the allocated memory
+	 * even if an error occurs.
 	 */
-	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+	PG_TRY();
+	{
+		/*
+		 * backup_state and tablespace_map need to be long-lived as they are
+		 * used in pg_backup_stop(). Create a special memory context as a
+		 * direct child of TopMemoryContext so that the memory allocated is
+		 * cleaned in case of errors.
+		 */
+		if (backupcontext == NULL)
+		{
+			backupcontext = AllocSetContextCreate(TopMemoryContext,
+  "on-line backup context",
+  ALLOCSET_DEFAULT_SIZES);
+		}
 
-	/* Allocate backup state or reset it, if it comes from a previous run */
-	if (backup_state == NULL)
+		oldcontext = MemoryContextSwitchTo(backupcontext);
+
+		Assert(backup_state == NULL);
+		Assert(tablespace_map == NULL);
 		backup_state = (BackupState *) palloc0(sizeof(BackupState));
-	else
-		MemSet(backup_state, 0, sizeof(BackupState));
+		tablespace_map = makeStringInfo();
 
-	/*
-	 * tablespace_map may have been created in a previous backup, so take this
-	 * occasion to clean it.
-	 */
-	if (tablespace_map != NULL)
+		register_persistent_abort_backup_handler();
+		do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+	PG_FINALLY();
 	{
-		pfree(tablespace_map->data);
-		pfree(tablespace_map);
-		tablespace_map = NULL;
+		/*
+		 * Delete on-line backup memory context only when an ERROR occurred,
+		 * because the memory allocated in backupcontext is used in
+		 * pg_backup_stop() for non-ERROR cases.
+		 */
+		if (backupcontext != NULL && geterrlevel() == ERROR)
+		{
+			MemoryContextDelete(backupcontext);
+			backupcontext = NULL;
+			backup_state = NULL;
+			tablespace_map = NULL;
+		}
 	}
-
-	tablespace_map = makeStringInfo();
-	MemoryContextSwitchTo(oldcontext);
-
-	register_persistent_abort_backup_handler();
-	do_pg_backup_start(backupidstr, fast, NULL, backup_stat

Re: A doubt about a newly added errdetail

2022-09-28 Thread Kyotaro Horiguchi
At Wed, 28 Sep 2022 10:46:41 +0200, Alvaro Herrera  
wrote in 
> On 2022-Sep-28, Amit Kapila wrote:
> 
> > On Wed, Sep 28, 2022 at 11:30 AM Kyotaro Horiguchi
> >  wrote:
> 
> > > It looks tome that the errmsg and errordetail are reversed. Isn't the 
> > > following order common?
> > >
> > > >errmsg("schemas cannot be added to or dropped from publication 
> > > > \"%s\".",
> > > >   NameStr(pubform->pubname)),
> > > >errdetail("The publication is defined as FOR ALL TABLES.")));
> > >
> > 
> > This one seems to be matching with the below existing message:
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > errmsg("publication \"%s\" is defined as FOR ALL TABLES",
> > NameStr(pubform->pubname)),
> > errdetail("Tables cannot be added to or dropped from FOR ALL TABLES
> > publications.")));
> 
> Well, that suggests we should change both together.  I do agree that
> they look suspicious; they should be more similar to this other one, I
> think:

Ah, yes, and thanks.

> ereport(ERROR,
> errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> errmsg("cannot add schema to publication \"%s\"",
>stmt->pubname),
> errdetail("Schemas cannot be added if any tables that 
> specify a column list are already part of the publication."));
> 
> The errcodes appear not to agree with each other, also.  Maybe that
> needs some more thought as well.  I don't think INVALID_PARAMETER_VALUE
> is the right thing here, and I'm not sure about
> OBJECT_NOT_IN_PREREQUISITE_STATE either.

The latter seems to fit better than the current one. That being said
if we change the SQLSTATE for exising erros, that may make existing
users confused.

> FWIW, the latter is a whole category which is not defined by the SQL
> standard, so I recall Tom got it from DB2.  DB2 chose to subdivide in a
> lot of different cases, see
> https://www.ibm.com/docs/en/db2/9.7?topic=messages-sqlstate#rsttmsg__code55
> for a (current?) table.  Maybe we should define some additional 55Pxx
> values -- say 55PR1 INCOMPATIBLE PUBLICATION DEFINITION (R for
> "replication"-related matters; the P is what we chose for the
> Postgres-specific subcategory).

I generally agree to this. But we don't have enough time to fully
consider that?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Obsolete comment in ExecInsert()

2022-09-28 Thread Etsuro Fujita
Hi,

While reviewing the “Fast COPY FROM based on batch insert" patch, I
noticed this comment introduced in commit b663a4136:

/*
 * If a certain number of tuples have already been accumulated, or
 * a tuple has come for a different relation than that for the
 * accumulated tuples, perform the batch insert
 */
if (resultRelInfo->ri_NumSlots == resultRelInfo->ri_BatchSize)
{
ExecBatchInsert(mtstate, resultRelInfo,
resultRelInfo->ri_Slots,
resultRelInfo->ri_PlanSlots,
resultRelInfo->ri_NumSlots,
estate, canSetTag);
resultRelInfo->ri_NumSlots = 0;
}

I think the “or a tuple has come for a different relation than that
for the accumulated tuples" part in the comment is a leftover from an
earlier version of the patch [1].  As the code shows, we do not handle
that case anymore, so I think we should remove that part from the
comment.  Attached is a patch for that.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/TYAPR01MB2990ECD1C68EA694DD0667E4FEE90%40TYAPR01MB2990.jpnprd01.prod.outlook.com


Remove-obsolete-comment.patch
Description: Binary data


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Thomas Munro
On Wed, Sep 28, 2022 at 9:26 PM Dilip Kumar  wrote:
> It was a silly mistake, I used the F_OIDEQ function instead of
> F_INT8EQ. Although this was correct on the 0003 patch where we have
> removed the tablespace from key, but got missed in this :(
>
> I have locally reproduced this in a 32 bit machine consistently and
> the attached patch is fixing the issue for me.

I tested this with an armhf (32 bit) toolchain, and it passes
check-world, and was failing before.

Robert's patch isn't needed on this system. I didn't look into this
subject for long but it seems that SIGBUS on misaligned access (as
typically seen on eg SPARC) requires a 32 bit Linux/ARM kernel, but I
was testing with 32 bit processes and a 64 bit kernel.  Apparently 32
bit Linux/ARM has a control /proc/cpu/alignment to select behaviour
(options include emulation, SIGBUS) but 64 bit kernels don't have it
and are happy with misaligned access.




Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-28 Thread Martin Kalcher

Am 26.09.22 um 22:16 schrieb Tom Lane:


With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose.  I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.



Thanks for your thoughts, Tom. I have a couple of questions. Should we 
introduce a new seed function for the new PRNG state, used by 
array_shuffle() and array_sample()? What would be a good name? Or should 
those functions use pg_global_prng_state? Is it safe to assume, that 
pg_global_prng_state is seeded?


Martin




Re: SYSTEM_USER reserved word implementation

2022-09-28 Thread Drouvot, Bertrand

Hi,

On 9/28/22 5:28 AM, Michael Paquier wrote:

On Tue, Sep 27, 2022 at 03:38:49PM -0700, Jacob Champion wrote:

On 9/26/22 06:29, Drouvot, Bertrand wrote:
Since there are only internal clients to the API, I'd argue this makes
more sense as an Assert(authn_id != NULL), but I don't think it's a
dealbreaker.


Using an assert() looks like a good idea from here.  If this is called
with a NULL authn, this could reflect a problem in the authentication
logic.



Agree, thanks for pointing out.


As far the assertion failure mentioned by Michael when moving the
SVFOP_SYSTEM_USER from NAMEOID to TEXTOID: V4 is assuming that it is
safe to force the collation to C_COLLATION_OID for SQLValueFunction
having a TEXT type, but I would be happy to also hear your thoughts
about it.


Unfortunately I don't have much to add here; I don't know enough about
the underlying problems.


I have been looking at that, and after putting my hands on that this
comes down to the facility introduced in 40c24bf.  So, I think that
we'd better use COERCE_SQL_SYNTAX so as there is no need to worry
about the shortcuts this patch is trying to use with the collation
setup.


Nice!


 And there are a few tests for get_func_sql_syntax() in
create_view.sql.  Note that this makes the patch slightly shorter, and
simpler.



Agree that it does look simpler that way and that making use of 
COERCE_SQL_SYNTAX does looks like a better approach. Nice catch!




The docs still mentioned "name", and not "text".



Oups, thanks for pointing out.

I had a look at v5 and it does look good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Dilip Kumar
On Wed, Sep 28, 2022 at 11:57 AM Tom Lane  wrote:
>
> Dilip Kumar  writes:
> > Btw, I think the reason for the bus error on wrasse is the same as
> > what is creating failure on longfin[1], I mean this unaligned access
> > is causing Bus error during startup, IMHO.
>
> Maybe, but there's not a lot of evidence for that.  wrasse got
> through the test_decoding check where longfin, tamandua, kestrel,
> and now skink are failing.  It's evidently not the same issue
> that the 32-bit animals are choking on, either.  Looks like yet
> a third bug to me.

I think the reason is that "longfin" is configured with the
-fsanitize=alignment option so it will report the failure for any
unaligned access.  Whereas "wrasse" actually generates the "Bus error"
due to architecture.  So the difference is that with
-fsanitize=alignment, it will always complain for any unaligned access
but all unaligned access will not end up in the "Bus error", and I
think that could be the reason "wrasse" is not failing in the test
decoding.

Yeah but anyway this is just a theory behind why failing at different
places but we still do not have evidence/call stack to prove that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: DROP OWNED BY is broken on master branch.

2022-09-28 Thread Rushabh Lathia
On Tue, Sep 27, 2022 at 7:34 PM Robert Haas  wrote:

> On Tue, Sep 27, 2022 at 2:53 AM Rushabh Lathia 
> wrote:
> > Yes, I was also thinking to avoid the duplicate logic but couldn't found
> > a way.  I did the quick testing with the patch, and reported test is
> working
> > fine.   But  "make check" is failing with few failures.
>
> Oh, woops. There was a dumb mistake in that version -- it was testing
> sdepForm->dbid == SHARED_DEPENDENCY_OWNER, which is nonsense, instead
> of sdepForm->dbid == MyDatabaseId. Here's a fixed version.
>

This seems to fix the issue and in further testing I didn't find anything
else.

Thanks,


> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>


-- 
Rushabh Lathia


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Robert Haas
On Wed, Sep 28, 2022 at 1:48 AM Thomas Munro  wrote:
> FTR CI reported that cpluspluscheck failure and more[1], so perhaps we
> just need to get clearer agreement on the status of CI, ie a policy
> that CI had better be passing before you get to the next stage.  It's
> still pretty new...

Yeah, I suppose I have to get in the habit of looking at CI before
committing anything. It's sort of annoying to me, though. Here's a
list of the follow-up fixes I've so far committed:

1. headerscheck
2. typos
3. pg_buffercache's meson.build
4. compiler warning
5. alignment problem
6. F_INTEQ/F_OIDEQ problem

CI caught (1), (3), and (4). The buildfarm caught (1), (5), and (6).
The number of buildfarm failures that I would have avoided by checking
CI is less than the number of extra things I had to fix to keep CI
happy, and the serious problems were caught by the buildfarm, not by
CI. It's not even clear to me how I was supposed to know that every
future Makefile change is going to require adjusting a meson.build
file as well. It's not like that was mentioned in the commit message
for the meson build system, which also has no README anywhere in the
source tree. I found the wiki page by looking up the commit and
finding the URL in the commit message, but, you know, that wiki page
ALSO doesn't mention the need to now update meson.build files going
forward. So I guess the way you're supposed to know that you need to
update meson.build that is by looking at CI, but CI is also the only
reason it's necessary to carry about meson.build in the first place. I
feel like CI has not really made it in any easier to not break the
buildfarm -- it's just provided a second buildfarm that you can break
independently of the first one.

And like the existing buildfarm, it's severely under-documented.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make ON_ERROR_STOP stop on shell script failure

2022-09-28 Thread torikoshia

On 2022-09-20 15:15, bt22nakamorit wrote:


I edited the documentation for ON_ERROR_STOP.
Any other suggestions?


Thanks for the patch!


   if (result == 127 || result == -1)
   {
   pg_log_error("\\!: failed");
   return false;
   }
   else if (result != 0) {
   pg_log_error("command failed");
   return false;


Since it would be hard to understand the cause of failures from these 
two messages, it might be better to clarify them in the messages.


The former comes from failures of child process creation or execution on 
it and the latter occurs when child process creation and execution 
succeeded but the return code is not 0, doesn't it?



I also felt it'd be natural that the latter message also begins with 
"\\!" since both message concerns with \!.


How do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Robert Haas
On Wed, Sep 28, 2022 at 1:15 AM Tom Lane  wrote:
> Dilip Kumar  writes:
> > wrasse is also failing with a bus error,
>
> Yeah.  At this point I think it's time to call for this patch
> to get reverted.  It should get tested *off line* on some
> non-Intel, non-64-bit, alignment-picky architectures before
> the rest of us have to deal with it any more.

I don't really understand how you expect me or Dilip to do this. Is
every PostgreSQL hacker supposed to have a bunch of antiquated servers
in their basement so that they can test this stuff? I don't think I
have had easy access to non-Intel, non-64-bit, alignment-picky
hardware in probably 25 years, unless my old Raspberry Pi counts.

I admit that I should have checked the CI results before pushing this
commit, but as you say yourself, that missed a bunch of stuff, and I'd
say it was the important stuff. Unless and until CI is able to check
all the same configurations that the buildfarm can check, it's not
going to be possible to get test results on some of these platforms
except by checking the code in and seeing what happens. If I revert
this, I'm just going to be sitting here not knowing where any of the
problems are and having no way to find them.

Maybe I'm missing something here. Apart from visual inspection of the
code and missing fewer mistakes than I did, how would you have avoided
these problems in one of your commits?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-28 Thread Fabien COELHO




With our current PRNG infrastructure it doesn't cost much to have
a separate PRNG for every purpose.  I don't object to having
array_shuffle() and array_sample() share one PRNG, but I don't
think it should go much further than that.


Thanks for your thoughts, Tom. I have a couple of questions. Should we 
introduce a new seed function for the new PRNG state, used by array_shuffle() 
and array_sample()? What would be a good name? Or should those functions use 
pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is 
seeded?


I'd suggest to use the existing global state. The global state should be 
seeded at the process start, AFAIKR.


--
Fabien.




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Robert Haas
On Tue, Sep 27, 2022 at 5:29 PM Tom Lane  wrote:
> ... also, lapwing's not too happy [1].  The alter_table test
> expects this to yield zero rows, but it doesn't:
>
>  SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
>  WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
>
> I've reproduced that symptom in a 32-bit FreeBSD VM building with clang,
> so I suspect that it'll occur on any 32-bit build.  mamba is a couple
> hours away from offering a confirmatory data point, though.
>
> (BTW, is that test case sane at all?  I'm bemused by the symmetrical
> NOT NULL tests on a fundamentally not-symmetrical left join; what
> are those supposed to accomplish?  Also, the fact that it doesn't
> deign to show any fields from "c" is sure making it hard to tell
> what's wrong.)

This was added by:

commit f3fdd257a430ff581090740570af9f266bb893e3
Author: Noah Misch 
Date:   Fri Jun 13 19:57:59 2014 -0400

Harden pg_filenode_relation test against concurrent DROP TABLE.

Per buildfarm member prairiedog.  Back-patch to 9.4, where the test was
introduced.

Reviewed by Tom Lane.

There seems to be a comment in that commit which explains the intent
of those funny-looking NULL tests.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Robert Haas
On Tue, Sep 27, 2022 at 5:50 PM Tom Lane  wrote:
> Maybe it wouldn't have any great impact.  I don't know, but I don't
> think it's incumbent on me to measure that.  You or the patch author
> should have had a handle on that question *before* committing.

I agree. I should have gone through and checked that every place where
RelFileLocator got embedded in some larger struct, there was no
problem with making it bigger and increasing the alignment
requirement. I'll go back and do that as soon as the immediate
problems are fixed. This case would have stood out as something
needing attention.

Some of the cases are pretty subtle, though. tamandua is still unhappy
even after pushing that fix, because xl_xact_relfilelocators embeds a
RelFileLocator which now requires 8-byte alignment, and
ParseCommitRecord has an undocumented assumption that none of the
things embedded in a commit record require more than 4-byte alignment.
Really, if it's critical for a struct to never require more than
4-byte alignment, there ought to be a comment about that *on the
struct itself*. Having a comment on a function that does something
with that struct is probably not really good enough, and we don't even
have that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




PostgreSQL 15 RC1 release announcement draft

2022-09-28 Thread Jonathan S. Katz

Hi,

Attached is a draft of the PostgreSQL 15 RC 1 release announcement. 
Please provide feedback no later than 2022-09-29 0:00 AoE.


Thanks,

Jonathan
The PostgreSQL Global Development Group announces that the first release
candidate of PostgreSQL 15 is now available for download. As a release
candidate, PostgreSQL 15 RC 1 will be mostly identical to the initial release of
PostgreSQL 15, though some more fixes may be applied prior to the general
availability of PostgreSQL 15.

The planned date for the general availability of PostgreSQL 15 is
October 13, 2022. Please see the "Release Schedule" section for more details.

Upgrading to PostgreSQL 15 RC 1
---

To upgrade to PostgreSQL 15 RC 1 from earlier versions of PostgreSQL, you will
need to use a major version upgrade strategy, e.g. `pg_upgrade` or
`pg_dump` / `pg_restore`. For more information, please visit the documentation
section on 
[upgrading](https://www.postgresql.org/docs/15/static/upgrading.html):

[https://www.postgresql.org/docs/15/static/upgrading.html](https://www.postgresql.org/docs/15/static/upgrading.html)

Changes Since 15 Beta 4
---

Several bug fixes were applied for PostgreSQL 15 during the Beta 4 period. These
include:

* The syntax for publishing all tables in a schema using logical replication
is changed to `CREATE PUBLICATION ... FOR TABLES IN SCHEMA ...`.
* Logical replication publications can now publish tables that are within a
schema if both schema and table are specified.
* Disallow publishing a schema if a table with column-lists is specified.
* Fix issue with `pg_publication_tables` view where it would display dropped
columns.
* Disallow creating a new database with an unsupported ICU locale.
* Fix behavior issue with the `--gzip` option for `pg_basebackup` where the
contents of the WAL directory were not compressed.
* Fix issue with recovery prefetching when `maintenance_io_concurrency` was set
to a low value (e.g. `0`).
* Log errors when replaying WAL files when `wal_compression` is specified as
either `lz4` or `zstd` but the server does not support it.
* Move several files generated by `pg_upgrade` to an internal subdirectory.
* Clear the display from the `ps` command when the recovery process finishes.

For a detailed list of fixes, please visit the
[open 
items](https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items#resolved_before_15rc1)
page.

Release Schedule


This is the first release candidate for PostgreSQL 15. Unless an issue is
discovered that warrants a delay or to produce an additional release candidate,
PostgreSQL 15 should be made generally available on October 13, 2022.

For further information please see the
[Beta Testing](https://www.postgresql.org/developer/beta/) page.

Links
-

* [Download](https://www.postgresql.org/download/)
* [Beta Testing Information](https://www.postgresql.org/developer/beta/)
* [PostgreSQL 15 RC 1 Release 
Notes](https://www.postgresql.org/docs/15/static/release-15.html)
* [PostgreSQL 15 Open 
Issues](https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items)
* [Submit a Bug](https://www.postgresql.org/account/submitbug/)
* [Follow @postgresql on Twitter](https://www.twitter.com/postgresql)


OpenPGP_signature
Description: OpenPGP digital signature


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Robert Haas
On Wed, Sep 28, 2022 at 9:16 AM Robert Haas  wrote:
> I agree. I should have gone through and checked that every place where
> RelFileLocator got embedded in some larger struct, there was no
> problem with making it bigger and increasing the alignment
> requirement. I'll go back and do that as soon as the immediate
> problems are fixed. This case would have stood out as something
> needing attention.

On second thought, I'm going to revert the whole thing. There's a
bigger mess here than can be cleaned up on the fly. The
alignment-related mess in ParseCommitRecord is maybe something for
which I could just hack a quick fix, but what I've also just now
realized is that this makes a huge number of WAL records larger by 4
bytes, since most WAL records will contain a block reference. I don't
know whether that's OK or not, but I do know that it hasn't been
thought about, and after commit is not the time to begin experimenting
with such things.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Summary function for pg_buffercache

2022-09-28 Thread Melih Mutlu
Hi all,

The patch needed a rebase due to recent changes on pg_buffercache.
You can find the updated version attached.

Best,
Melih


v12-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data


Re: [PATCH] Introduce array_shuffle() and array_sample()

2022-09-28 Thread Tom Lane
Fabien COELHO  writes:
>> Thanks for your thoughts, Tom. I have a couple of questions. Should we 
>> introduce a new seed function for the new PRNG state, used by 
>> array_shuffle() 
>> and array_sample()? What would be a good name? Or should those functions use 
>> pg_global_prng_state? Is it safe to assume, that pg_global_prng_state is 
>> seeded?

> I'd suggest to use the existing global state. The global state should be 
> seeded at the process start, AFAIKR.

It is seeded at process start, yes.  If you don't feel a need for
user control over the sequence used by these functions, then using
pg_global_prng_state is fine.  (Basically the point to be made
here is that we need to keep a tight rein on what can be affected
by setseed().)

regards, tom lane




Re: Summary function for pg_buffercache

2022-09-28 Thread Zhang Mingli

Regards,
Zhang Mingli
On Sep 28, 2022, 21:50 +0800, Melih Mutlu , wrote:
> Hi all,
>
> The patch needed a rebase due to recent changes on pg_buffercache.
> You can find the updated version attached.
>
> Best,
> Melih
>
>
```
+
+   if (buffers_used != 0)
+ usagecount_avg = usagecount_avg / buffers_used;
+
+   memset(nulls, 0, sizeof(nulls));
+   values[0] = Int32GetDatum(buffers_used);
+   values[1] = Int32GetDatum(buffers_unused);
+   values[2] = Int32GetDatum(buffers_dirty);
+   values[3] = Int32GetDatum(buffers_pinned);
+
+   if (buffers_used != 0)
+   {
+ usagecount_avg = usagecount_avg / buffers_used;
+ values[4] = Float4GetDatum(usagecount_avg);
+   }
+   else
+   {
+ nulls[4] = true;
+   }
```

Why compute usagecount_avg twice?


Re: Summary function for pg_buffercache

2022-09-28 Thread Melih Mutlu
Zhang Mingli , 28 Eyl 2022 Çar, 17:31 tarihinde şunu
yazdı:

> Why compute usagecount_avg twice?
>

I should have removed the first one, but I think I missed it.
Nice catch.

Attached an updated version.

Thanks,
Melih


v13-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data


Re: Obsolete comment in ExecInsert()

2022-09-28 Thread Tom Lane
Etsuro Fujita  writes:
> I think the “or a tuple has come for a different relation than that
> for the accumulated tuples" part in the comment is a leftover from an
> earlier version of the patch [1].  As the code shows, we do not handle
> that case anymore, so I think we should remove that part from the
> comment.  Attached is a patch for that.

+1, but what remains still seems awkwardly worded.  How about something
like "When we've reached the desired batch size, perform the insertion"?

regards, tom lane




Move backup-related code to xlogbackup.c/.h

2022-09-28 Thread Bharath Rupireddy
Hi,

xlog.c currently has ~9000 LOC, out of which ~700 LOC is backup
related, making the file really unmanageable. The commit
7d708093b7400327658a30d1aa1d5e284d37622c added new files
xlogbackup.c/.h for hosting all backup related code eventually. I
propose to move all the backp related code from xlog.c and xlogfuncs.c
to xlogbackup.c/.h. In doing so, I had to add a few Get/Set functions
for XLogCtl variables so that xlogbackup.c can use them.

I'm attaching a patch set where 0001 and 0002 move backup code from
xlogfuncs.c and xlog.c to xlogbackup.c/.h respectively. The advantage
is that all the core's backup code is in one single file making it
more readable and manageable while reducing the xlog.c's file size.

Thoughts?

Thanks Michael Paquier for suggesting to have new files for backup related code.

[1] 
https://www.postgresql.org/message-id/CALj2ACX0wjo%2B49hbUmvc_zT1zwdqFOQyhorN0Ox-Rk6v97Nejw%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 3ccaf1ad08e00229106be0cca3180585e4eb9c6f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 28 Sep 2022 12:12:59 +
Subject: [PATCH v1] Move backup-related code from xlogfuncs.c to xlogbackup.c

---
 src/backend/access/transam/xlogbackup.c | 132 
 src/backend/access/transam/xlogfuncs.c  | 131 ---
 2 files changed, 132 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 90b5273b02..073678d84f 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -16,6 +16,16 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogbackup.h"
+#include "funcapi.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/pg_lsn.h"
+
+/*
+ * Backup-related variables.
+ */
+static BackupState *backup_state = NULL;
+static StringInfo tablespace_map = NULL;
 
 /*
  * Build contents for backup_label or backup history file.
@@ -82,3 +92,125 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 
 	return data;
 }
+
+/*
+ * pg_backup_start: set up for taking an on-line backup dump
+ *
+ * Essentially what this does is to create the contents required for the
+ * backup_label file and the tablespace map.
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_backup_start(PG_FUNCTION_ARGS)
+{
+	text	   *backupid = PG_GETARG_TEXT_PP(0);
+	bool		fast = PG_GETARG_BOOL(1);
+	char	   *backupidstr;
+	SessionBackupState status = get_backup_status();
+	MemoryContext oldcontext;
+
+	backupidstr = text_to_cstring(backupid);
+
+	if (status == SESSION_BACKUP_RUNNING)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("a backup is already in progress in this session")));
+
+	/*
+	 * backup_state and tablespace_map need to be long-lived as they are used
+	 * in pg_backup_stop().
+	 */
+	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+	/* Allocate backup state or reset it, if it comes from a previous run */
+	if (backup_state == NULL)
+		backup_state = (BackupState *) palloc0(sizeof(BackupState));
+	else
+		MemSet(backup_state, 0, sizeof(BackupState));
+
+	/*
+	 * tablespace_map may have been created in a previous backup, so take this
+	 * occasion to clean it.
+	 */
+	if (tablespace_map != NULL)
+	{
+		pfree(tablespace_map->data);
+		pfree(tablespace_map);
+		tablespace_map = NULL;
+	}
+
+	tablespace_map = makeStringInfo();
+	MemoryContextSwitchTo(oldcontext);
+
+	register_persistent_abort_backup_handler();
+	do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
+
+	PG_RETURN_LSN(backup_state->startpoint);
+}
+
+/*
+ * pg_backup_stop: finish taking an on-line backup.
+ *
+ * The first parameter (variable 'waitforarchive'), which is optional,
+ * allows the user to choose if they want to wait for the WAL to be archived
+ * or if we should just return as soon as the WAL record is written.
+ *
+ * This function stops an in-progress backup, creates backup_label contents and
+ * it returns the backup stop LSN, backup_label and tablespace_map contents.
+ *
+ * The backup_label contains the user-supplied label string (typically this
+ * would be used to tell where the backup dump will be stored), the starting
+ * time, starting WAL location for the dump and so on.  It is the caller's
+ * responsibility to write the backup_label and tablespace_map files in the
+ * data folder that will be restored from this backup.
+ *
+ * Permission checking for this function is managed through the normal
+ * GRANT system.
+ */
+Datum
+pg_backup_stop(PG_FUNCTION_ARGS)
+{
+#define PG_BACKUP_STOP_V2_COLS 3
+	TupleDesc	tupdesc;
+	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
+	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
+	bool		waitforarchive = PG_GETARG_BOOL(0);
+	char	   *backup_labe

Re: DROP OWNED BY is broken on master branch.

2022-09-28 Thread Robert Haas
On Wed, Sep 28, 2022 at 8:21 AM Rushabh Lathia  wrote:
> On Tue, Sep 27, 2022 at 7:34 PM Robert Haas  wrote:
>> On Tue, Sep 27, 2022 at 2:53 AM Rushabh Lathia  
>> wrote:
>> > Yes, I was also thinking to avoid the duplicate logic but couldn't found
>> > a way.  I did the quick testing with the patch, and reported test is 
>> > working
>> > fine.   But  "make check" is failing with few failures.
>>
>> Oh, woops. There was a dumb mistake in that version -- it was testing
>> sdepForm->dbid == SHARED_DEPENDENCY_OWNER, which is nonsense, instead
>> of sdepForm->dbid == MyDatabaseId. Here's a fixed version.
>
> This seems to fix the issue and in further testing I didn't find anything 
> else.

OK, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Summary function for pg_buffercache

2022-09-28 Thread Zhang Mingli
Hi,

On Sep 28, 2022, 22:41 +0800, Melih Mutlu , wrote:
>
>
> Zhang Mingli , 28 Eyl 2022 Çar, 17:31 tarihinde şunu 
> yazdı:
> > Why compute usagecount_avg twice?
>
> I should have removed the first one, but I think I missed it.
> Nice catch.
>
> Attached an updated version.
>
> Thanks,
> Melih
>
Hmm, I just apply v13 patch but failed.

Part of errors:
```
error: patch failed: contrib/pg_buffercache/pg_buffercache.control:1 error: 
contrib/pg_buffercache/pg_buffercache.control: patch does not apply
Checking patch contrib/pg_buffercache/pg_buffercache_pages.c...
error: while searching for:
 */
PG_FUNCTION_INFO_V1(pg_buffercache_pages);
PG_FUNCTION_INFO_V1(pg_buffercache_pages_v1_4);
```

Rebase on master and then apply our changes again?

Regards,
Zhang Mingli
>


Re: Allow foreign keys to reference a superset of unique columns

2022-09-28 Thread Kaiting Chen
> For one example of where the semantics get fuzzy, it's not
> very clear how the extra-baggage columns ought to participate in
> CASCADE updates.  Currently, if we have
>CREATE TABLE foo (a integer PRIMARY KEY, b integer);
> then an update that changes only foo.b doesn't need to update
> referencing tables, and I think we even have optimizations that
> assume that if no unique-key columns are touched then RI checks
> need not be made.

Regarding optimizations that skip RI checks on the PK side of the
relationship,
I believe the relevant code is here (in trigger.c):

  if (TRIGGER_FIRED_BY_UPDATE(event) || TRIGGER_FIRED_BY_DELETE(event)) {
  ...

  switch (RI_FKey_trigger_type(trigger->tgfoid)) {
  ...

  case RI_TRIGGER_PK:
  ...

  /* Update or delete on trigger's PK table */
  if (!RI_FKey_pk_upd_check_required(trigger, rel, oldslot,
newslot))
  {
  /* skip queuing this event */
  continue;
  }

  ...

And the checks done in RI_FKey_pk_upd_check_required() are:

  const RI_ConstraintInfo *riinfo;

  riinfo = ri_FetchConstraintInfo(trigger, pk_rel, true);

  /*
* If any old key value is NULL, the row could not have been referenced
by
* an FK row, so no check is needed.
*/
  if (ri_NullCheck(RelationGetDescr(pk_rel), oldslot, riinfo, true) !=
RI_KEYS_NONE_NULL)
  return false;

  /* If all old and new key values are equal, no check is needed */
  if (newslot && ri_KeysEqual(pk_rel, oldslot, newslot, riinfo, true))
  return false;

  /* Else we need to fire the trigger. */
  return true;

The columns inspected by both ri_NullCheck() and ri_KeysEqual() are based
on the
riinfo->pk_attnums:

  if (rel_is_pk)
  attnums = riinfo->pk_attnums;
  else
  attnums = riinfo->fk_attnums;

The check in ri_NullCheck() is, expectedly, a straightforward NULL check:

  for (int i = 0; i < riinfo->nkeys; i++)
  {
  if (slot_attisnull(slot, attnums[i]))
  nonenull = false;
  else
  allnull = false;
  }

The check in ri_KeysEqual() is a bytewise comparison:

  /* XXX: could be worthwhile to fetch all necessary attrs at once */
  for (int i = 0; i < riinfo->nkeys; i++)
  {
  Datum oldvalue;
  Datum newvalue;
  bool isnull;

  /*
* Get one attribute's oldvalue. If it is NULL - they're not
equal.
*/
  oldvalue = slot_getattr(oldslot, attnums[i], &isnull);
  if (isnull)
  return false;

  /*
* Get one attribute's newvalue. If it is NULL - they're not
equal.
*/
  newvalue = slot_getattr(newslot, attnums[i], &isnull);
  if (isnull)
  return false;

  if (rel_is_pk)
  {
  /*
* If we are looking at the PK table, then do a bytewise
* comparison.  We must propagate PK changes if the
value is
* changed to one that "looks" different but would
compare as
* equal using the equality operator.  This only makes a
* difference for ON UPDATE CASCADE, but for consistency
we treat
* all changes to the PK the same.
*/
  Form_pg_attribute att =
TupleDescAttr(oldslot->tts_tupleDescriptor, attnums[i] - 1);

  if (!datum_image_eq(oldvalue, newvalue, att->attbyval,
att->attlen))
  return false;
  }
  else
  {
  /*
* For the FK table, compare with the appropriate
equality
* operator.  Changes that compare equal will still
satisfy the
* constraint after the update.
*/
  if (!ri_AttributesEqual(riinfo->ff_eq_oprs[i],
RIAttType(rel, attnums[i]),
  oldvalue,
newvalue))
  return false;
  }
  }

It seems like neither optimization actually requires the presence of the
unique
index. And, as my proposed patch would leave both riinfo->nkeys and
riinfo->pk_attnums exactly the same as before, I don't believe that it
should
have any impact on these optimizations.


Re: Avoid memory leaks during base backups

2022-09-28 Thread Robert Haas
On Wed, Sep 28, 2022 at 5:30 AM Bharath Rupireddy
 wrote:
> I'm attaching the v2 patch designed as described above. Please review it.
>
> I've added an entry in CF - https://commitfest.postgresql.org/40/3915/

This looks odd to me. In the case of a regular backend, the
sigsetjmp() handler in src/backend/tcop/postgres.c is responsible for
cleaning up memory. It calls AbortCurrentTransaction() which will call
CleanupTransaction() which will call AtCleanup_Memory() which will
block away TopTransactionContext. I think we ought to do something
analogous here, and we almost do already. Some walsender commands are
going to be SQL commands and some aren't. For those that aren't, the
same block calls WalSndErrorCleanup() which does similar kinds of
cleanup, including in some situations calling WalSndResourceCleanup()
which cleans up the resource owner in cases where we have a resource
owner without a transaction. I feel like we ought to be trying to tie
the cleanup into WalSndErrorCleanup() or WalSndResourceCleanup() based
on having the memory context that we ought to be blowing away stored
in a global variable, rather than using a try/catch block.

Like, maybe there's a function EstablishWalSenderMemoryContext() that
commands can call before allocating memory that shouldn't survive an
error. And it's deleted after each command if it exists, or if an
error occurs then WalSndErrorCleanup() deletes it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Summary function for pg_buffercache

2022-09-28 Thread Melih Mutlu
Hi,

Seems like the commit a448e49bcbe40fb72e1ed85af910dd216d45bad8 reverts the
changes on pg_buffercache.

Why compute usagecount_avg twice?
>
Then, I'm going back to v11 + the fix for this.

Thanks,
Melih


v14-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data


Re: Enables to call Unregister*XactCallback() in Call*XactCallback()

2022-09-28 Thread Tom Lane
Nathan Bossart  writes:
> On Mon, Sep 26, 2022 at 06:05:34PM -0400, Tom Lane wrote:
>> Yeah.  Whether it's efficient or not, seems like it should *work*.
>> I'm a bit inclined to call this a bug-fix and backpatch it.

> LGTM.  I have no opinion on back-patching.

I had second thoughts about back-patching: doing so would encourage
extensions to rely on this working in pre-v16 branches, which they'd
better not since they might be in a not-up-to-date installation.

We could still squeeze this into v15 without creating such a hazard,
but post-rc1 doesn't seem like a good time for inessential tweaks.

Hence, pushed to HEAD only.

regards, tom lane




Warning about using pg_stat_reset() and pg_stat_reset_shared()

2022-09-28 Thread Bruce Momjian
We have discussed the problems caused by the use of pg_stat_reset() and 
pg_stat_reset_shared(), specifically the removal of information needed
by autovacuum.  I don't see these risks documented anywhere.  Should we
do that?  Are there other risks?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Warning about using pg_stat_reset() and pg_stat_reset_shared()

2022-09-28 Thread Robert Haas
On Wed, Sep 28, 2022 at 11:45 AM Bruce Momjian  wrote:
> We have discussed the problems caused by the use of pg_stat_reset() and
> pg_stat_reset_shared(), specifically the removal of information needed
> by autovacuum.  I don't see these risks documented anywhere.  Should we
> do that?

+1.

> Are there other risks?

I don't know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Önder Kalacı
Hi hackers,

With PG 15 (rc1 or beta4), I'm observing an interesting memory pattern. I
have not seen a similar discussion on the mailing list. If I missed that,
please refer me there. The problem that I'm going to explain does not
happen on PG 13/14.

It seems like there is a memory leak(?) with $title. Still, not sure about
what is going on and, thought it'd be useful to share at least my initial
investigation.

After running the query and waiting a few minutes (see steps to repro
below), use pg_log_backend_memory_contexts() to get the contexts of the
backend executing the command. See that it goes beyond 100GB. And depending
on vm.overcommit_memory, you get an OOM error or OOM crash eventually.

```
2022-09-28 17:33:38.155 CEST [32224] LOG:  level: 2; PortalContext: 1024
total in 1 blocks; 592 free (0 chunks); 432 used: 
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 3; ExecutorState:
*114923929600* total in 13710 blocks; 7783264 free (3 chunks); 114916146336
used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 4; TupleSort main: 8192
total in 1 blocks; 3928 free (0 chunks); 4264 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 5; TupleSort sort: 295096
total in 8 blocks; 256952 free (67 chunks); 38144 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 6; Caller tuples: 8192
total in 1 blocks (0 chunks); 7992 free (0 chunks); 200 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 4; TupleSort main: 8192
total in 1 blocks; 3928 free (0 chunks); 4264 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 5; TupleSort sort:
4309736 total in 18 blocks; 263864 free (59 chunks); 4045872 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 6; Caller tuples: 8192
total in 1 blocks (0 chunks); 7992 free (0 chunks); 200 used
...
2022-09-28 17:33:38.160 CEST [32224] LOG:  Grand total: *114930446784*
bytes in 13972 blocks; 8802248 free (275 chunks); 114921644536 used
```

I observed this with a merge join involving a table and set returning
function. To simulate the problem with two tables, I have the following
steps:

```
CREATE TABLE t1 (a text);
CREATE TABLE t2 (a text);

-- make the text a little large by adding 1000
INSERT INTO t1 SELECT (1000+i%1000)::text FROM
generate_series(0,1000) i;

-- make the text a little large by adding 1000
INSERT INTO t2 SELECT (1000+i%1)::text FROM
generate_series(0,1000) i;

-- to simplify the explain plan, not strictly necessary
SET max_parallel_workers_per_gather TO 0;

-- these two are necessary so that the problem is triggered
-- these are helpful to use Merge join and avoid materialization
SET enable_hashjoin TO false;
SET enable_material TO false;

-- the join is on a TEXT column
-- when the join is on INT column with a similar setup, I do not observe
this problem
SELECT count(*) FROM t1 JOIN t2 USING (a);
```


The explain output for the query like the following:
```
explain SELECT count(*) FROM t1 JOIN t2 USING (a);
┌─┐
│   QUERY PLAN
   │
├─┤
│ Aggregate  (cost=177735283.36..177735283.37 rows=1 width=8)
  │
│   ->  Merge Join  (cost=2556923.81..152703372.24 rows=10012764448
width=0)  │
│ Merge Cond: (t1.a = t2.a)
  │
│ ->  Sort  (cost=1658556.19..1683556.63 rows=1175 width=13)
   │
│   Sort Key: t1.a
   │
│   ->  Seq Scan on t1  (cost=0.00..154056.75 rows=1175
width=13) │
│ ->  Sort  (cost=1658507.28..1683506.93 rows=861 width=13)
  │
│   Sort Key: t2.a
   │
│   ->  Seq Scan on t2  (cost=0.00..154053.61 rows=861
width=13)  │
└─┘
(9 rows)
```

In the end, my investigation mostly got me to the following palloc(), where
we seem to allocate memory over and over again as memory grows:
```
(gdb) bt
#0  __GI___libc_malloc (bytes=bytes@entry=8388608) at malloc.c:3038
#1  0x5589f3c55444 in AllocSetAlloc (context=0x5589f4896300, size=14)
at aset.c:920
#2  0x5589f3c5d763 in palloc (size=size@entry=14) at mcxt.c:1082
#3  0x5589f3b1f553 in datumCopy (value=94051002161216,
typByVal=typByVal@entry=false,
typLen=) at datum.c:162
#4  0x5589f3c6ed0b in tuplesort_getdatum (state=state@entry
=0x5589f49274e0,
forward=forward@entry=true, val=0x5589f48d7860, isNull=0x5589f48d7868,
abbrev=abbrev@entry=0x0)
at tuplesort.c:2675
#5  0x5589f3947925 in ExecSort (pstate=0x5589f48d0a38) at nodeSort.c:200
#6  0x5589f393d74c in ExecProcNode (node=0x5589f48d0a38)
at ../../../src/include/executor/executor.h:259
#7  ExecMergeJoin (pstate=0x5589f4896cc8) at nodeMergejoin.c:871
#8  0x5589f391fbc8 in ExecProcNode (node=0x5589f4896cc8)
at ../../../src/include/executor/executor.h:259
#9  fetch_input_tuple (aggstate=

Re: [PATCH] Log details for client certificate failures

2022-09-28 Thread Jacob Champion
On Tue, Sep 27, 2022 at 6:14 PM Masahiko Sawada  wrote:
> No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster
> memory only once when starting up. application_name is PGC_USERSET but
> since we normally allocate memory in PortalMemoryContext we eventually
> can free it.

Oh, I see; thank you for the correction. And even if someone put an
application_name into their postgresql.conf, and then changed it a
bunch of times, we'd free the leaked memory from the config_cxt that's
created in ProcessConfigFile().

Is there a reason we don't provide a similar temporary context during
InitializeGUCOptions()? Naively it seems like that would suppress any
future one-time leaks, and maybe cut down on some Valgrind noise. Then
again, maybe there's just not that much demand for pallocs during GUC
hooks.

Thanks,
--Jacob




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 6:48 AM Robert Haas  wrote:
> On second thought, I'm going to revert the whole thing. There's a
> bigger mess here than can be cleaned up on the fly. The
> alignment-related mess in ParseCommitRecord is maybe something for
> which I could just hack a quick fix, but what I've also just now
> realized is that this makes a huge number of WAL records larger by 4
> bytes, since most WAL records will contain a block reference.

It would be useful if there were generic tests that caught issues like
this. There are various subtle effects related to how struct layout
can impact WAL record size that might easily be missed. It's not like
there are a huge number of truly critical WAL records to have tests
for.

The example that comes to mind is the XLOG_BTREE_INSERT_POST record
type, which is used for B-Tree index tuple inserts with a posting list
split. There is only an extra 2 bytes of payload for these record
types compared to conventional XLOG_BTREE_INSERT_LEAF records, but we
nevertheless tend to see a final record size that is consistently a
full 8 bytes larger in many important cases, despite not needing to
stored the IndexTuple with alignment padding. I believe that this is a
consequence of the record header itself needing to be MAXALIGN()'d.

Another important factor in this scenario is the general tendency for
index tuple sizes to leave the final XLOG_BTREE_INSERT_LEAF record
size at 64 bytes. It wouldn't have been okay if the deduplication work
made that size jump up to 72 bytes for many kinds of indexes across
the board, even when there was no accompanying posting list split
(i.e. the vast majority of the time). Maybe it would have been okay if
nbtree leaf page insert records were naturally rare, but that isn't
the case at all, obviously.

That's why we have two different record types here in the first place.
Earlier versions of the deduplication patch just added an OffsetNumber
field to XLOG_BTREE_INSERT_LEAF which could be set to
InvalidOffsetNumber, resulting in a surprisingly large amount of waste
in terms of WAL size. Because of the presence of 3 different factors.
We don't bother doing this with the split records, which can also have
accompanying posting list splits, because it makes hardly any
difference at all (split records are much rarer than any kind of leaf
insert record, and are far larger when considered individually).

-- 
Peter Geoghegan




A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Ranier Vilela
>CREATE TABLE t1 (a text);
>CREATE TABLE t2 (a text);

>-- make the text a little large by adding 1000
>INSERT INTO t1 SELECT (1000+i%1000)::text FROM
>generate_series(0,1000) i;

>-- make the text a little large by adding 1000
>INSERT INTO t2 SELECT (1000+i%1)::text FROM
>generate_series(0,1000) i;

>-- to simplify the explain plan, not strictly necessary
>SET max_parallel_workers_per_gather TO 0;

>-- these two are necessary so that the problem is triggered
>-- these are helpful to use Merge join and avoid materialization
>SET enable_hashjoin TO false;
>SET enable_material TO false;

>-- the join is on a TEXT column
>-- when the join is on INT column with a similar setup, I do not observe
>this problem
>SELECT count(*) FROM t1 JOIN t2 USING (a);
>```

>The explain output for the query like the following:
>```
>explain SELECT count(*) FROM t1 JOIN t2 USING (a);

I run your test here with a fix attached.

Can you retake your test with the patch attached?
regards,

Ranier Vilela


fix_possible_memory_leak_merge_join_sort.patch
Description: Binary data


Refactor UnpinBuffer()

2022-09-28 Thread Aleksander Alekseev
Hi hackers,

The proposed patch removes the redundant `fixOwner` argument.

"""
The fixOwner bool argument ended up always being true, so it doesn't do much
anymore. Removing it doesn't necessarily affect the performance a lot, but at
least improves the readability. The procedure is static thus the extension
authors are not going to be upset.
"""

-- 
Best regards,
Aleksander Alekseev


v1-0001-Refactor-UnpinBuffer.patch
Description: Binary data


Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Önder Kalacı
Hi,

Thanks for replying so quickly!

I run your test here with a fix attached.
>
> Can you retake your test with the patch attached?
>
>
> Unfortunately, with the patch, I still see the memory usage increase and
get the OOMs

Thanks,
Onder KALACI


Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Ranier Vilela
Em qua., 28 de set. de 2022 às 14:24, Önder Kalacı 
escreveu:

> Hi,
>
> Thanks for replying so quickly!
>
> I run your test here with a fix attached.
>>
>> Can you retake your test with the patch attached?
>>
>>
>> Unfortunately, with the patch, I still see the memory usage increase and
> get the OOMs
>
Thanks for sharing the result.

regards,
Ranier Vilela


Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Tom Lane
=?UTF-8?B?w5ZuZGVyIEthbGFjxLE=?=  writes:
> With PG 15 (rc1 or beta4), I'm observing an interesting memory pattern.

Yup, that is a leak.  valgrind'ing it blames this call chain:

==00:00:16:12.228 4011013== 790,404,056 bytes in 60,800,312 blocks are 
definitely lost in loss record 1,108 of 1,108
==00:00:16:12.228 4011013==at 0x9A5104: palloc (mcxt.c:1170)
==00:00:16:12.228 4011013==by 0x89F8D9: datumCopy (datum.c:175)
==00:00:16:12.228 4011013==by 0x9B5BEE: tuplesort_getdatum 
(tuplesortvariants.c:882)
==00:00:16:12.228 4011013==by 0x6FA8B3: ExecSort (nodeSort.c:200)
==00:00:16:12.228 4011013==by 0x6F1E87: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013==by 0x6F1E87: ExecMergeJoin (nodeMergejoin.c:871)
==00:00:16:12.228 4011013==by 0x6D7800: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013==by 0x6D7800: fetch_input_tuple (nodeAgg.c:562)
==00:00:16:12.228 4011013==by 0x6DAE2E: agg_retrieve_direct (nodeAgg.c:2454)
==00:00:16:12.228 4011013==by 0x6DAE2E: ExecAgg (nodeAgg.c:2174)
==00:00:16:12.228 4011013==by 0x6C6122: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013==by 0x6C6122: ExecutePlan (execMain.c:1636)

and bisecting fingers this commit as the guilty party:

commit 91e9e89dccdfdf4216953d3d8f5515dcdef177fb
Author: David Rowley 
Date:   Thu Jul 22 14:03:19 2021 +1200

Make nodeSort.c use Datum sorts for single column sorts

Looks like that forgot that tuplesort_getdatum()'s result has to
be freed by the caller.

regards, tom lane




Re: Adding CommandID to heap xlog records

2022-09-28 Thread Bruce Momjian
On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote:
> On Thu, 8 Sept 2022 at 23:24, Tom Lane  wrote:
> >
> > Matthias van de Meent  writes:
> > > Please find attached a patch that adds the CommandId of the inserting
> > > transaction to heap (batch)insert, update and delete records. It is
> > > based on the changes we made in the fork we maintain for Neon.
> >
> > This seems like a very significant cost increment with returns
> > to only a minuscule number of users.  We certainly cannot consider
> > it unless you provide some evidence that that impression is wrong.
> 
> Attached a proposed set of patches to reduce overhead of the inital patch.

This might be obvious to some, but the patch got a lot larger.  :-(

---

>  contrib/pg_walinspect/pg_walinspect.c|  4 +-
>  src/backend/access/brin/brin_pageops.c   | 16 +++---
>  src/backend/access/brin/brin_xlog.c  |  8 +--
>  src/backend/access/gin/ginxlog.c |  6 +--
>  src/backend/access/gist/gistxlog.c   |  6 +--
>  src/backend/access/hash/hash_xlog.c  |  6 +--
>  src/backend/access/heap/heapam.c | 40 +++
>  src/backend/access/nbtree/nbtinsert.c| 18 +++
>  src/backend/access/nbtree/nbtpage.c  |  8 +--
>  src/backend/access/nbtree/nbtxlog.c  | 10 ++--
>  src/backend/access/rmgrdesc/brindesc.c   | 20 
>  src/backend/access/rmgrdesc/clogdesc.c   | 10 ++--
>  src/backend/access/rmgrdesc/committsdesc.c   | 10 ++--
>  src/backend/access/rmgrdesc/dbasedesc.c  | 12 ++---
>  src/backend/access/rmgrdesc/genericdesc.c|  2 +-
>  src/backend/access/rmgrdesc/gindesc.c|  8 +--
>  src/backend/access/rmgrdesc/gistdesc.c   |  8 +--
>  src/backend/access/rmgrdesc/hashdesc.c   |  8 +--
>  src/backend/access/rmgrdesc/heapdesc.c   | 46 -
>  src/backend/access/rmgrdesc/logicalmsgdesc.c |  8 +--
>  src/backend/access/rmgrdesc/mxactdesc.c  | 14 ++---
>  src/backend/access/rmgrdesc/nbtdesc.c|  8 +--
>  src/backend/access/rmgrdesc/relmapdesc.c |  8 +--
>  src/backend/access/rmgrdesc/replorigindesc.c |  8 +--
>  src/backend/access/rmgrdesc/seqdesc.c|  8 +--
>  src/backend/access/rmgrdesc/smgrdesc.c   | 10 ++--
>  src/backend/access/rmgrdesc/spgdesc.c|  8 +--
>  src/backend/access/rmgrdesc/standbydesc.c| 12 ++---
>  src/backend/access/rmgrdesc/tblspcdesc.c | 10 ++--
>  src/backend/access/rmgrdesc/xactdesc.c   | 34 ++--
>  src/backend/access/rmgrdesc/xlogdesc.c   | 28 +-
>  src/backend/access/spgist/spgxlog.c  |  6 +--
>  src/backend/access/transam/clog.c|  8 +--
>  src/backend/access/transam/commit_ts.c   |  8 +--
>  src/backend/access/transam/multixact.c   | 48 -
>  src/backend/access/transam/twophase.c|  2 +-
>  src/backend/access/transam/xact.c| 36 +++--
>  src/backend/access/transam/xlog.c| 34 ++--
>  src/backend/access/transam/xloginsert.c  | 31 ---
>  src/backend/access/transam/xlogprefetcher.c  |  2 +-
>  src/backend/access/transam/xlogreader.c  |  2 +-
>  src/backend/access/transam/xlogrecovery.c| 54 ++--
>  src/backend/access/transam/xlogstats.c   |  2 +-
>  src/backend/catalog/storage.c| 15 +++---
>  src/backend/commands/dbcommands.c| 30 ++-
>  src/backend/commands/sequence.c  |  6 +--
>  src/backend/commands/tablespace.c|  8 +--
>  src/backend/postmaster/autovacuum.c  |  4 +-
>  src/backend/replication/logical/decode.c | 38 +++---
>  src/backend/replication/logical/message.c|  6 +--
>  src/backend/replication/logical/origin.c |  6 +--
>  src/backend/storage/ipc/standby.c| 10 ++--
>  src/backend/utils/cache/relmapper.c  |  6 +--
>  src/bin/pg_resetwal/pg_resetwal.c|  2 +-
>  src/bin/pg_rewind/parsexlog.c| 10 ++--
>  src/bin/pg_waldump/pg_waldump.c  |  6 +--
>  src/include/access/brin_xlog.h   |  2 +-
>  src/include/access/clog.h|  2 +-
>  src/include/access/ginxlog.h |  2 +-
>  src/include/access/gistxlog.h|  2 +-
>  src/include/access/hash_xlog.h   |  2 +-
>  src/include/access/heapam_xlog.h |  4 +-
>  src/include/access/multixact.h   |  3 +-
>  src/include/access/nbtxlog.h |  2 +-
>  src/include/access/spgxlog.h |  2 +-
>  src/include/access/xact.h|  6 +--
>  src/include/access/xlog.h|  2 +-
>  src/include/access/xloginsert.h  |  3 +-
>  src/include/access/xlogreader.h  |  1 +
>  src/include/access/xlogrecord.h  | 11 +---
>  src/include/access/xlogstats.h   |  2 +-
>  src/i

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Tom Lane
I wrote:
> and bisecting fingers this commit as the guilty party:

> commit 91e9e89dccdfdf4216953d3d8f5515dcdef177fb
> Author: David Rowley 
> Date:   Thu Jul 22 14:03:19 2021 +1200

> Make nodeSort.c use Datum sorts for single column sorts

After looking at that for a little while, I wonder if we shouldn't
fix this by restricting the Datum-sort path to be used only with
pass-by-value data types.  That'd require only a minor addition
to the new logic in ExecInitSort.

The alternative of inserting a pfree of the old value would complicate
the code nontrivially, I think, and really it would necessitate a
complete performance re-test.  I'm wondering if the claimed speedup
for pass-by-ref types wasn't fictional and based on skipping the
required pfrees.  Besides, if you think this code is hot enough that
you don't want to add a test-and-branch per tuple (a claim I also
doubt, BTW) then you probably don't want to add such overhead into
the pass-by-value case where the speedup is clear.

regards, tom lane




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-28 Thread Nathan Bossart
On Tue, Sep 20, 2022 at 09:31:26PM -0700, Nathan Bossart wrote:
> I bet a more pressing concern is the calls to aclmask() since checking
> privileges is probably done more frequently than updating them.  That
> appears to use a linear search, too, so maybe sorting the aclitem arrays is
> actually worth exploring.  I still doubt there will be much noticeable
> impact from expanding AclMode outside of the most extreme cases.

I've been testing aclmask() with long aclitem arrays (2,000 entries is
close to the limit for pg_class entries), and I haven't found any
significant impact from bumping AclMode to 64 bits.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-28 Thread Stephen Frost
Greetings,

On Wed, Sep 28, 2022 at 14:50 Nathan Bossart 
wrote:

> On Tue, Sep 20, 2022 at 09:31:26PM -0700, Nathan Bossart wrote:
> > I bet a more pressing concern is the calls to aclmask() since checking
> > privileges is probably done more frequently than updating them.  That
> > appears to use a linear search, too, so maybe sorting the aclitem arrays
> is
> > actually worth exploring.  I still doubt there will be much noticeable
> > impact from expanding AclMode outside of the most extreme cases.
>
> I've been testing aclmask() with long aclitem arrays (2,000 entries is
> close to the limit for pg_class entries), and I haven't found any
> significant impact from bumping AclMode to 64 bits.


The max is the same regardless of the size..?  Considering the size is
capped since pg_class doesn’t (and isn’t likely to..) have a toast table,
that seems unlikely, so I’m asking for clarification on that. We may be
able to get consensus that the difference isn’t material since no one is
likely to have such long lists, but we should at least be aware.

Thanks,

Stephen

>


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Alvaro Herrera
On 2022-Sep-28, Peter Geoghegan wrote:

> It would be useful if there were generic tests that caught issues like
> this. There are various subtle effects related to how struct layout
> can impact WAL record size that might easily be missed. It's not like
> there are a huge number of truly critical WAL records to have tests
> for.

What do you think would constitute a test here?

Say: insert N records to a heapam table with one index of each kind
(under controlled conditions: no checkpoint, no autovacuum, no FPIs),
then measure the total number of bytes used by WAL records of each rmgr.
Have a baseline and see how that changes over time.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Alvaro Herrera
On 2022-Sep-28, Robert Haas wrote:

> The number of buildfarm failures that I would have avoided by checking
> CI is less than the number of extra things I had to fix to keep CI
> happy, and the serious problems were caught by the buildfarm, not by
> CI. [...] So I guess the way you're supposed to know that you need to
> update meson.build that is by looking at CI, but CI is also the only
> reason it's necessary to carry about meson.build in the first place. I
> feel like CI has not really made it in any easier to not break the
> buildfarm -- it's just provided a second buildfarm that you can break
> independently of the first one.

I have an additional, unrelated complaint about CI, which is that we
don't have anything for past branches.  I have a partial hack(*), but
I wish we had something we could readily use.

(*) I just backpatched the commit that added the .cirrus.yml file, plus
some later fixes to it, and I keep that as a separate branch which I
merge with whatever other changes I want to test.  I then push that to
github, and ignore the windows results when looking at cirrus-ci.com.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Thomas Munro
On Thu, Sep 29, 2022 at 1:27 AM Robert Haas  wrote:
> ... Here's a
> list of the follow-up fixes I've so far committed:
>
> 1. headerscheck
> 2. typos
> 3. pg_buffercache's meson.build
> 4. compiler warning
> 5. alignment problem
> 6. F_INTEQ/F_OIDEQ problem
>
> CI caught (1), (3), and (4). The buildfarm caught (1), (5), and (6).

I think at least some of 5 and all of 6 would be covered by sanitizer
flags and a 32 bit test respectively, and I think we should add those.
We're feeling our way here, working out what's worth including at
non-zero cost for each thing we could check.  In other cases you and I
have fought with, it's been  Windows problems (mingw differences, or
file handle semantics), which are frustrating to us all, but I see
Meson as part of the solution to that: uniform testing on Windows
(whereas the crusty perl would not run all the tests), and CI support
for mingw is in the pipeline.

> ... I
> feel like CI has not really made it in any easier to not break the
> buildfarm -- it's just provided a second buildfarm that you can break
> independently of the first one.

I don't agree with this.  The build farm clearly has more ways to
break than CI, because it has more CPUs, compilers, operating systems,
combinations of configure options and rolls of the timing dice, but CI
now catches a lot and, importantly, *before* it reaches the 'farm and
everyone starts shouting a lot of stuff at you that you already knew,
because it's impacting their work.  Unless you don't look, and then
it's just something that breaks with the build farm, and then you
break CI on master for everyone else too and more people shout.

I'm not personally aware of any significant project that isn't using
CI, and although we're late to the party I happen to think that ours
is getting pretty good considering the complexities.  And it's going
to keep improving.




Re: meson vs mingw, plpython, readline and other fun

2022-09-28 Thread Andres Freund
Hi,

On 2022-09-27 19:27:24 -0700, Andres Freund wrote:
> Stared too long at the screen to figure all of this out. Food next. I'll clean
> the patches up later tonight or tomorrow morning.

Attached:

0001 - meson: windows: Normalize slashes in prefix
0002 - meson: pg_regress: Define a HOST_TUPLE sufficient to make resultmap work
0003 - meson: mingw: Allow multiple definitions
0004 - meson: Implement getopt logic from autoconf
0005 - mingw: Define PGDLLEXPORT as __declspec (dllexport) as done for msvc
0006 - meson: mingw: Add -Wl,--disable-auto-import, enable when linking with 
readline

0005 is the one that I'd most like review for. The rest just affect meson, so
I'm planning to push them fairly soon - review would nevertheless be nice.

Greetings,

Andres Freund
>From 3bd663d454c05d38b629a9807b89bcb332ccae1a Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 27 Sep 2022 11:55:00 -0700
Subject: [PATCH v1 1/6] meson: windows: Normalize slashes in prefix

This fixes a build issue on windows, when the prefix is set to a path with
forward slashes. Windows python defaults to a path with a backslash, but mingw
ucrt python defaults to a forward slash. This in turn lead to a wrong PATH set
during tests, causing tests to fail.

Reported-by: Melih Mutlu 
Discussion: http://postgr.es/m/20220928022724.erzuk5v4ai4b5...@awork3.anarazel.de
---
 meson.build | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 38b2c3aae2e..02c086c04e7 100644
--- a/meson.build
+++ b/meson.build
@@ -2733,15 +2733,15 @@ endif
 
 prefix = get_option('prefix')
 
-test_prefix = prefix
+test_prefix = fs.as_posix(prefix)
 
 if fs.is_absolute(get_option('prefix'))
   if host_system == 'windows'
-if prefix.split(':\\').length() == 1
+if prefix.split(':/').length() == 1
   # just a drive
   test_prefix = ''
 else
-  test_prefix = prefix.split(':\\')[1]
+  test_prefix = prefix.split(':/')[1]
 endif
   else
 assert(prefix.startswith('/'))
-- 
2.37.3.542.gdd3f6c4cae

>From 386bf996d7e0e19a3aa3ee2239b7baf35b2ea3a2 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 27 Sep 2022 12:01:35 -0700
Subject: [PATCH v1 2/6] meson: pg_regress: Define a HOST_TUPLE sufficient to
 make resultmap work

This doesn't end up with a triple that's exactly the same as config.guess -
it'd be hard to achieve that and it doesn't seem required. We can't rely on
config.guess as we don't necessarily have a /bin/sh on windows, e.g., when
building on windows with msvc.

This isn't perfect, e.g., clang works on windows as well.  But I suspect we'd
need a bunch of other changes to make clang on windows work, and we haven't
supported it historically.

Discussion: http://postgr.es/m/20220928022724.erzuk5v4ai4b5...@awork3.anarazel.de
---
 src/test/regress/meson.build | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/test/regress/meson.build b/src/test/regress/meson.build
index fd8ee995b79..03de591b0c7 100644
--- a/src/test/regress/meson.build
+++ b/src/test/regress/meson.build
@@ -6,7 +6,16 @@ regress_sources = pg_regress_c + files(
   'pg_regress_main.c'
 )
 
-pg_regress_cflags = ['-DHOST_TUPLE="frak"', '-DSHELLPROG="/bin/sh"']
+# Need make up something roughly like x86_64-pc-mingw64. resultmap matches on
+# patterns like ".*-.*-mingw.*". We probably can do better, but for now just
+# replace 'gcc' with 'mingw' on windows.
+host_tuple_cc = cc.get_id()
+if host_system == 'windows' and host_tuple_cc == 'gcc'
+  host_tuple_cc = 'mingw'
+endif
+host_tuple = '@0@-@1@-@2@'.format(host_cpu, host_system, host_tuple_cc)
+
+pg_regress_cflags = ['-DHOST_TUPLE="@0@"'.format(host_tuple), '-DSHELLPROG="/bin/sh"']
 
 pg_regress = executable('pg_regress',
   regress_sources,
-- 
2.37.3.542.gdd3f6c4cae

>From 69ffcb9017bceab1a431a0053b517e0a55d47e00 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 28 Sep 2022 09:48:09 -0700
Subject: [PATCH v1 3/6] meson: mingw: Allow multiple definitions

I didn't carry this forward from the win32 template. It's not needed anymore
for the reason stated therein, but it turns out to be required to
e.g. override getopt. Possibly a better solution exists, but that's for later.

Discussion: http://postgr.es/m/20220928022724.erzuk5v4ai4b5...@awork3.anarazel.de
---
 meson.build | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/meson.build b/meson.build
index 02c086c04e7..cd410319f3f 100644
--- a/meson.build
+++ b/meson.build
@@ -277,6 +277,8 @@ elif host_system == 'windows'
 # ldflags += '/nxcompat' # generated by msbuild, should have it for ninja?
   else
 ldflags += '-Wl,--stack,@0@'.format(cdata.get('WIN32_STACK_RLIMIT'))
+# Need to allow multiple definitions, we e.g. want to override getopt.
+ldflags += '-Wl,--allow-multiple-definition'
   endif
 
   os_deps += cc.find_library('ws2_32', required: true)
-- 
2.37.3.542.gdd3f6c4cae

>From 5babfb6dfe9189ce89145d6e19f9144894c77794 Mon Sep 17 00:00:00 2001
From: Andre

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread David Rowley
Thanks for investigating this and finding the guilty commit.

On Thu, 29 Sept 2022 at 07:34, Tom Lane  wrote:
> After looking at that for a little while, I wonder if we shouldn't
> fix this by restricting the Datum-sort path to be used only with
> pass-by-value data types.  That'd require only a minor addition
> to the new logic in ExecInitSort.

I'm also wondering if that's the best fix given the timing of this discovery.

> The alternative of inserting a pfree of the old value would complicate
> the code nontrivially, I think, and really it would necessitate a
> complete performance re-test.  I'm wondering if the claimed speedup
> for pass-by-ref types wasn't fictional and based on skipping the
> required pfrees.  Besides, if you think this code is hot enough that
> you don't want to add a test-and-branch per tuple (a claim I also
> doubt, BTW) then you probably don't want to add such overhead into
> the pass-by-value case where the speedup is clear.

I'm wondering if the best way to fix it if doing it that way would be
to invent tuplesort_getdatum_nocopy() which would be the same as
tuplesort_getdatum() except it wouldn't do the datumCopy for byref
types.  It looks like tuplesort_gettupleslot() when copy==false just
directly stores the MinimalTuple that's in stup.tuple and shouldFree
is set to false.

Going by [1], it looks like I saw gains in test 6, which was a byref
Datum. Skipping the datumCopy() I imagine could only make the gains
slightly higher on that.  That puts me a bit more on the fence about
the best fix for PG15.

I've attached a patch to restrict the optimisation to byval types in
the meantime.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrWV%3Dv0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA%40mail.gmail.com
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 3c28d60c3e..740ad37717 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -220,6 +220,7 @@ SortState *
 ExecInitSort(Sort *node, EState *estate, int eflags)
 {
SortState  *sortstate;
+   TupleDesc   outerTupDesc;
 
SO1_printf("ExecInitSort: %s\n",
   "initializing sort node");
@@ -274,11 +275,13 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
ExecInitResultTupleSlotTL(&sortstate->ss.ps, &TTSOpsMinimalTuple);
sortstate->ss.ps.ps_ProjInfo = NULL;
 
+   outerTupDesc = ExecGetResultType(outerPlanState(sortstate));
+
/*
-* We perform a Datum sort when we're sorting just a single column,
+* We perform a Datum sort when we're sorting just a single byval 
column,
 * otherwise we perform a tuple sort.
 */
-   if (ExecGetResultType(outerPlanState(sortstate))->natts == 1)
+   if (outerTupDesc->natts == 1 && outerTupDesc->attrs[0].attbyval)
sortstate->datumSort = true;
else
sortstate->datumSort = false;


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 12:32 PM Thomas Munro  wrote:
> I don't agree with this.  The build farm clearly has more ways to
> break than CI, because it has more CPUs, compilers, operating systems,
> combinations of configure options and rolls of the timing dice, but CI
> now catches a lot and, importantly, *before* it reaches the 'farm and
> everyone starts shouting a lot of stuff at you that you already knew,
> because it's impacting their work.

Right. I really don't can't imagine how CI could be seen as anything
less than a very significant improvement. It wasn't that long ago that
commits that certain kinds of work that used OS facilities would
routinely break Windows in some completely predictable way. Just
breaking every single Windows buildfarm animal was almost a routine
occurrence. It was normal. Remember that?

Of course it is also true that anything that breaks the buildfarm
today will be disproportionately difficult and subtle. You really do
have 2 buildfarms to break -- it's just that one of those buildfarms
can be broken and fixed without it bothering anybody else, which is
typically enough to prevent breaking the real buildfarm. But only if
you actually check both!

-- 
Peter Geoghegan




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Tom Lane
David Rowley  writes:
> I'm wondering if the best way to fix it if doing it that way would be
> to invent tuplesort_getdatum_nocopy() which would be the same as
> tuplesort_getdatum() except it wouldn't do the datumCopy for byref
> types.

Yeah, perhaps.  We'd need a clear spec on how long the Datum could
be presumed good --- probably till the next tuplesort_getdatum_nocopy
call, but that'd need to be checked --- and then check if that is
satisfactory for nodeSort's purposes.

If we had such a thing, I wonder if any of the other existing
tuplesort_getdatum callers would be happier with that.  nodeAgg for
one is tediously freeing the result, but could we drop that logic?
(I hasten to add that I'm not proposing we touch that for v15.)

regards, tom lane




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Tom Lane
Robert Haas  writes:
> Yeah, I suppose I have to get in the habit of looking at CI before
> committing anything. It's sort of annoying to me, though. Here's a
> list of the follow-up fixes I've so far committed:

> 1. headerscheck
> 2. typos
> 3. pg_buffercache's meson.build
> 4. compiler warning
> 5. alignment problem
> 6. F_INTEQ/F_OIDEQ problem

> CI caught (1), (3), and (4). The buildfarm caught (1), (5), and (6).
> The number of buildfarm failures that I would have avoided by checking
> CI is less than the number of extra things I had to fix to keep CI
> happy, and the serious problems were caught by the buildfarm, not by
> CI.

That seems like an unfounded complaint.  You would have had to fix
(3) and (4) in any case, on some time schedule or other.  I agree
that it'd be good if CI did some 32-bit testing so it could have
caught (5) and (6), but that's being worked on.

> So I guess the way you're supposed to know that you need to
> update meson.build that is by looking at CI, but CI is also the only
> reason it's necessary to carry about meson.build in the first place.

Not so.  People are already using meson in preference to the makefiles
for some things, I believe.  And we're expecting that meson will
supplant the MSVC scripts pretty soon and the makefiles eventually.

> And like the existing buildfarm, it's severely under-documented.

That complaint I agree with.  A wiki page is a pretty poor substitute
for in-tree docs.

regards, tom lane




Re: predefined role(s) for VACUUM and ANALYZE

2022-09-28 Thread Nathan Bossart
On Wed, Sep 28, 2022 at 03:09:46PM -0400, Stephen Frost wrote:
> On Wed, Sep 28, 2022 at 14:50 Nathan Bossart 
> wrote:
>> I've been testing aclmask() with long aclitem arrays (2,000 entries is
>> close to the limit for pg_class entries), and I haven't found any
>> significant impact from bumping AclMode to 64 bits.
> 
> The max is the same regardless of the size..?  Considering the size is
> capped since pg_class doesn’t (and isn’t likely to..) have a toast table,
> that seems unlikely, so I’m asking for clarification on that. We may be
> able to get consensus that the difference isn’t material since no one is
> likely to have such long lists, but we should at least be aware.

While pg_class doesn't have a TOAST table, that column is marked as
"extended," so I believe it is still compressed, and the maximum aclitem
array length for pg_class.relacl would depend on how well the array
compresses.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 12:20 PM Alvaro Herrera  wrote:
> What do you think would constitute a test here?

I would start with something simple. Focus on the record types that we
know are the most common. It's very skewed towards heap and nbtree
record types, plus some transaction rmgr types.

> Say: insert N records to a heapam table with one index of each kind
> (under controlled conditions: no checkpoint, no autovacuum, no FPIs),
> then measure the total number of bytes used by WAL records of each rmgr.
> Have a baseline and see how that changes over time.

There are multiple flavors of alignment involved here, which makes it
tricky. For example, in index AMs the lp_len field from each line
pointer is always MAXALIGN()'d. It is only aligned as required for the
underlying heap tuple attributes in the case of heap tuples, though.
Of course you also have alignment considerations for the record itself
-- buffer data can usually be stored without being aligned at all. But
you can still have an impact from WAL header alignment, especially for
record types that tend to be relatively small -- like nbtree index
tuple inserts on leaf pages.

I think that the most interesting variation is among boundary cases
for those records that affect a variable number of page items. These
record types may be impacted by alignment considerations in subtle
though important ways. Things like PRUNE records often don't have that
many items. So having coverage of the overhead of every variation of a
small PRUNE record could be important as a way of catching regressions
that would otherwise be hard to catch.

Continuing with that example, we could probably cover every possible
permutation of PRUNE records that affect 5 or so items. Let's say that
we have a regression where PRUNE records that happen to have 3 items
that must all be set LP_DEAD increase in size by one MAXALIGN()
quantum. This will probably make a huge difference in many workloads,
but it's difficult to spot after the fact when it only affects those
records that happen to have a number of items that happen to fall in
some narrow but critical range. It might not affect PRUNE  records
with (say) 5 items at all. So if we're looking at the macro picture
with (say) pgbench and pg_waldump we'll tend to miss the regression
right now; it'll be obscured by the fact that the regression only
affects a minority of all PRUNE records, for whatever reason.

This is just a made up example, so the specifics might be off
significantly -- I'd have to work on it to be sure. Hopefully the
example still gets the general idea across.

--
Peter Geoghegan




Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-09-28 Thread Tom Lane
Kyotaro Horiguchi  writes:
> Okay. the points you brought up above are sufficient grounds for not
> doing so.  Now they are in the following format.

> LOG: terminating process 16034 to release replication slot "rep1"
> because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes

This seems to me to be a pretty blatant violation of our first message
style guideline [1]:

The primary message should be short, factual, and avoid reference to
implementation details such as specific function names. “Short” means
“should fit on one line under normal conditions”. Use a detail message
if needed to keep the primary message short ...

I think you should leave the primary message alone and add a DETAIL,
as the first version of the patch did.

The existing "invalidating slot" message is already in violation
of this guideline, so splitting off a DETAIL from that seems
indicated as well.

regards, tom lane

[1] https://www.postgresql.org/docs/devel/error-style-guide.html




Re: more descriptive message for process termination due to max_slot_wal_keep_size

2022-09-28 Thread Tom Lane
... oh, one other point is that using %ld to print an int64 is entirely
not portable, as indeed the cfbot is complaining about.

I think our best practice on that is to put %lld in the format string
and explicitly cast the corresponding argument to "long long".

regards, tom lane




problems with making relfilenodes 56-bits

2022-09-28 Thread Robert Haas
OK, so the recent commit and revert of the 56-bit relfilenode patch
revealed a few issues that IMHO need design-level input. Let me try to
surface those here, starting a new thread to separate this discussion
from the clutter:

1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord()
are dependent on every subsidiary structure that can be added to a
commit or abort record requiring exactly 4-byte alignment. IMHO, this
seems awfully fragile, even leaving the 56-bit patch aside. Prepare
records seem to have a much saner scheme: they've also got a bunch of
different things that can be stuck onto the main record, but they
maxalign each top-level thing that they stick in there. So
ParsePrepareRecord() doesn't have to make any icky alignment
assumptions the way ParseCommitRecord() and ParseAbortRecord() do.
Unfortuantely, that scheme doesn't work as well for commit records,
because the very first top-level thing only needs 2 bytes. We're
currently using 4, and it would obviously be nicer to cut that down to
2 than to have it go up to 8. We could try to rejigger things around
somehow to avoid needing that 2-byte quantity in there as a separate
toplevel item, but I'm not quite sure how to do that, or we could just
copy everything to ensure alignment, but that seems kind of expensive.

If we don't decide to do either of those things, we should at least
better document, and preferably enforce via assets, the requirement
that these structs be exactly 4-byte aligned, so that nobody else
makes the same mistake in the future.

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

3. Sinval Message Size. Sinval messages are 16 bytes right now.
They'll have to grow to 20 bytes if we do this. There's even less room
for bit-squeezing here than there is for the WAL stuff. I'm skeptical
that this really matters, but Tom seems concerned.

4. Other Uses of RelFileLocator. There are a bunch of structs I
haven't looked into yet that also embed RelFileLocator, which may have
their own issues with alignment, padding, and/or size: ginxlogSplit,
ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse,
xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData,
xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a
bunch of these are things that get written into WAL, but at least some
of them seem like they probably don't get written into WAL enough to
matter. Needs more investigation, though.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Refactor UnpinBuffer()

2022-09-28 Thread Nathan Bossart
On Wed, Sep 28, 2022 at 08:14:23PM +0300, Aleksander Alekseev wrote:
> +   ResourceOwnerForgetBuffer(CurrentResourceOwner, b);
> +
> /* not moving as we're likely deleting it soon anyway */
> ref = GetPrivateRefCountEntry(b, false);
> Assert(ref != NULL);
> -
> -   if (fixOwner)
> -   ResourceOwnerForgetBuffer(CurrentResourceOwner, b);

Is it safe to move the call to ResourceOwnerForgetBuffer() to before the
call to GetPrivateRefCountEntry()?  From my quick skim of the code, it
seems like it should be safe, but I thought I'd ask the question.
Otherwise, LGTM.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: making relfilenodes 56 bits

2022-09-28 Thread Thomas Munro
On Wed, Sep 28, 2022 at 9:40 PM Dilip Kumar  wrote:
> Thanks, Thomas, these all look fine to me.  So far we have committed
> the patch to make relfilenode 56 bits wide.  The Tombstone file
> removal patch is still pending to be committed, so when I will rebase
> that patch I will accommodate all these comments in that patch.

I noticed that your new unlinking algorithm goes like this:

stat("x")
stat("x.1")
stat("x.2")
stat("x.3") -> ENOENT /* now we know how many segments there are */
truncate("x.2")
unlink("x.2")
truncate("x.1")
unlink("x.1")
truncate("x")
unlink("x")

Could you say what problem this solves, and, guessing that it's just
that you want the 0 file to be "in the way" until the other files are
gone (at least while we're running; who knows what'll be left if you
power-cycle), could you do it like this instead?

truncate("x")
truncate("x.1")
truncate("x.2")
truncate("x.3") -> ENOENT /* now we know how many segments there are */
unlink("x.2")
unlink("x.1")
unlink("x")




Re: identifying the backend that owns a temporary schema

2022-09-28 Thread Tom Lane
Nathan Bossart  writes:
> Thanks for the suggestion.  I used it in v4 of the patch.

I reviewed this and made some changes, some cosmetic some less so.

Notably, I was bemused that of the four calls of
pgstat_fetch_stat_local_beentry, three tested for a NULL result even
though they cannot get one, while the fourth (pg_stat_get_backend_idset)
*is* at hazard of a NULL result but lacked a check.  I changed
pg_stat_get_backend_idset so that it too cannot get a NULL, and deleted
the dead code from the other callers.

A point that still bothers me a bit about pg_stat_get_backend_idset is
that it could miss or duplicate some backend IDs if the user calls
pg_stat_clear_snapshot() partway through the SRF's run, and we reload
a different set of backend entries than we had before.  I added a comment
about that, with an argument why it's not worth working harder, but
is the argument convincing?  If not, what should we do?

Also, I realized that the functions we're changing here are mostly
not exercised in the current regression tests :-(.  So I added a
small test case.

I think this is probably committable if you agree with my changes.

regards, tom lane

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1d9509a2f6..342b20ebeb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5485,20 +5485,23 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
the pg_stat_activity view, returns a set of records
containing all the available information about each backend process.
Sometimes it may be more convenient to obtain just a subset of this
-   information.  In such cases, an older set of per-backend statistics
+   information.  In such cases, another set of per-backend statistics
access functions can be used; these are shown in .
-   These access functions use a backend ID number, which ranges from one
-   to the number of currently active backends.
+   These access functions use the session's backend ID number, which is a
+   small positive integer that is distinct from the backend ID of any
+   concurrent session, although a session's ID can be recycled as soon as
+   it exits.  The backend ID is used, among other things, to identify the
+   session's temporary schema if it has one.
The function pg_stat_get_backend_idset provides a
-   convenient way to generate one row for each active backend for
+   convenient way to list all the active backends' ID numbers for
invoking these functions.  For example, to show the PIDs and
current queries of all backends:
 
 
-SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
-   pg_stat_get_backend_activity(s.backendid) AS query
-FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;
+SELECT pg_stat_get_backend_pid(backendid) AS pid,
+   pg_stat_get_backend_activity(backendid) AS query
+FROM pg_stat_get_backend_idset() AS backendid;
 
   
 
@@ -5526,8 +5529,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 setof integer


-Returns the set of currently active backend ID numbers (from 1 to the
-number of active backends).
+Returns the set of currently active backend ID numbers.

   
 
diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index c7ed1e6d7a..22c79e156b 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -846,6 +846,12 @@ pgstat_read_current_status(void)
 		/* Only valid entries get included into the local array */
 		if (localentry->backendStatus.st_procpid > 0)
 		{
+			/*
+			 * The array index is exactly the BackendId of the source backend.
+			 * Note that this means the localBackendStatusTable is in order by
+			 * backend_id.  pgstat_fetch_stat_beentry() depends on that.
+			 */
+			localentry->backend_id = i;
 			BackendIdGetTransactionIds(i,
 	   &localentry->backend_xid,
 	   &localentry->backend_xmin);
@@ -1045,26 +1051,57 @@ pgstat_get_my_query_id(void)
 	return MyBEEntry->st_query_id;
 }
 
+/* --
+ * cmp_lbestatus
+ *
+ *	Comparison function for bsearch() on an array of LocalPgBackendStatus.
+ *	The backend_id field is used to compare the arguments.
+ * --
+ */
+static int
+cmp_lbestatus(const void *a, const void *b)
+{
+	const LocalPgBackendStatus *lbestatus1 = (const LocalPgBackendStatus *) a;
+	const LocalPgBackendStatus *lbestatus2 = (const LocalPgBackendStatus *) b;
+
+	return lbestatus1->backend_id - lbestatus2->backend_id;
+}
 
 /* --
  * pgstat_fetch_stat_beentry() -
  *
  *	Support function for the SQL-callable pgstat* functions. Returns
- *	our local copy of the current-activity entry for one backend.
+ *	our local copy of the current-activity entry for one backend,
+ *	or NULL if the given beid doesn't identify any known session.
+ *
+ *	The beid argument is the BackendId of the desired

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread David Rowley
On Thu, 29 Sept 2022 at 08:57, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm wondering if the best way to fix it if doing it that way would be
> > to invent tuplesort_getdatum_nocopy() which would be the same as
> > tuplesort_getdatum() except it wouldn't do the datumCopy for byref
> > types.
>
> Yeah, perhaps.  We'd need a clear spec on how long the Datum could
> be presumed good --- probably till the next tuplesort_getdatum_nocopy
> call, but that'd need to be checked --- and then check if that is
> satisfactory for nodeSort's purposes.

Yeah, I think the same rules around scope apply as
tuplesort_gettupleslot() with copy==false.  We could do it by adding a
copy flag to the existing function, but I'd rather not add the
branching to that function. It's probably just better to duplicate it
and adjust.

> If we had such a thing, I wonder if any of the other existing
> tuplesort_getdatum callers would be happier with that.  nodeAgg for
> one is tediously freeing the result, but could we drop that logic?

Looking at process_ordered_aggregate_single(), it's likely more
efficient to use the nocopy version and just perform a datumCopy()
when we need to store the oldVal.  At least, that would be more
efficient when many values are being skipped due to being the same as
the last one.

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

David




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 12:57 PM Tom Lane  wrote:
> David Rowley  writes:
> > I'm wondering if the best way to fix it if doing it that way would be
> > to invent tuplesort_getdatum_nocopy() which would be the same as
> > tuplesort_getdatum() except it wouldn't do the datumCopy for byref
> > types.
>
> Yeah, perhaps.  We'd need a clear spec on how long the Datum could
> be presumed good --- probably till the next tuplesort_getdatum_nocopy
> call, but that'd need to be checked --- and then check if that is
> satisfactory for nodeSort's purposes.

I am reminded of the discussion that led to bugfix commit c2d4eb1b
some years back.

As the commit message of that old bugfix notes, tuplesort_getdatum()
and tuplesort_gettupleslot() are "the odd ones out" among "get tuple"
routines (i.e. routines that get a tuple from a tuplesort by calling
tuplesort_gettuple_common()). We used to sometimes do that with
tuplesort_getindextuple() and possible other such routines, but the
need for that capability was eliminated on the caller side around the
same time as the bugfix went in.

-- 
Peter Geoghegan




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 4:00 PM Peter Geoghegan  wrote:
> I am reminded of the discussion that led to bugfix commit c2d4eb1b
> some years back.

Also potentially relevant: the 2017 commit fa117ee4 anticipated adding
a "copy" argument to tuplesort_getdatum() (the same commit added such
a "copy" argument to tuplesort_gettupleslot()). I see that that still
hasn't happened to tuplesort_getdatum() all these years later. Might
be a good idea to do it in the next year or two, though.

If David is interested in pursuing this now then I certainly won't object.

-- 
Peter Geoghegan




Re: problems with making relfilenodes 56-bits

2022-09-28 Thread Tom Lane
Robert Haas  writes:
> 3. Sinval Message Size. Sinval messages are 16 bytes right now.
> They'll have to grow to 20 bytes if we do this. There's even less room
> for bit-squeezing here than there is for the WAL stuff. I'm skeptical
> that this really matters, but Tom seems concerned.

As far as that goes, I'm entirely prepared to accept a conclusion
that the benefits of widening relfilenodes justify whatever space
or speed penalties may exist there.  However, we cannot honestly
make that conclusion if we haven't measured said penalties.
The same goes for the other issues you raise here.

regards, tom lane




Re: problems with making relfilenodes 56-bits

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 4:08 PM Tom Lane  wrote:
> As far as that goes, I'm entirely prepared to accept a conclusion
> that the benefits of widening relfilenodes justify whatever space
> or speed penalties may exist there.  However, we cannot honestly
> make that conclusion if we haven't measured said penalties.
> The same goes for the other issues you raise here.

I generally agree, but the devil is in the details.

I tend to agree with Robert that many individual WAL record types just
don't appear frequently enough to matter (it also helps that even the
per-record space overhead with wider 56-bit relfilenodes isn't so
bad). Just offhand I'd say that ginxlogSplit, ginxlogDeletePage,
ginxlogUpdateMeta, gistxlogPageReuse and xl_btree_reuse_page are
likely to be in this category (though would be nice to see some
numbers for those).

I'm much less sure about the other record types. Any WAL records with
a variable number of relfilenode entries seem like they might be more
of a problem. But I'm not ready to accept that that cannot be
ameliorated in some way. Just for example, it wouldn't be impossible to
do some kind of varbyte encoding for some record types. How many times
will the cluster actually need billions of relfilenodes? It has to
work, but maybe it can be suboptimal from a space overhead
perspective.

I'm not saying that we need to do anything fancy just yet. I'm
just saying that there definitely *are* options. Maybe it's not really
necessary to come up with something like a varbyte encoding, and maybe
the complexity it imposes just won't be worth it -- I really have no
opinion on that just yet.

--
Peter Geoghegan




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Michael Paquier
On Thu, Sep 29, 2022 at 11:58:17AM +1300, David Rowley wrote:
> I've just pushed the disable byref Datums patch I posted earlier. I
> only made a small adjustment to make use of the TupleDescAttr() macro.
> Önder, thank you for the report.

Wouldn't it be better to have 3a58176 reflect the non-optimization
path in the EXPLAIN output of a new regression test if none of the
existing tests are able to show any difference?
--
Michael


signature.asc
Description: PGP signature


Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Tom Lane
Michael Paquier  writes:
> Wouldn't it be better to have 3a58176 reflect the non-optimization
> path in the EXPLAIN output of a new regression test if none of the
> existing tests are able to show any difference?

This decision is not visible in EXPLAIN in any case.

regards, tom lane




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread David Rowley
On Thu, 29 Sept 2022 at 12:30, Michael Paquier  wrote:
>
> On Thu, Sep 29, 2022 at 11:58:17AM +1300, David Rowley wrote:
> > I've just pushed the disable byref Datums patch I posted earlier. I
> > only made a small adjustment to make use of the TupleDescAttr() macro.
> > Önder, thank you for the report.
>
> Wouldn't it be better to have 3a58176 reflect the non-optimization
> path in the EXPLAIN output of a new regression test if none of the
> existing tests are able to show any difference?

There's nothing in EXPLAIN that shows that this optimization occurs.
Or, are you proposing that you think there should be something? and
for 15??

David




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Michael Paquier
On Wed, Sep 28, 2022 at 07:35:07PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> Wouldn't it be better to have 3a58176 reflect the non-optimization
>> path in the EXPLAIN output of a new regression test if none of the
>> existing tests are able to show any difference?
> 
> This decision is not visible in EXPLAIN in any case.

Okay, thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Refactor UnpinBuffer()

2022-09-28 Thread Zhang Mingli
HI,

On Sep 29, 2022, 05:08 +0800, Nathan Bossart , wrote:
> On Wed, Sep 28, 2022 at 08:14:23PM +0300, Aleksander Alekseev wrote:
> > + ResourceOwnerForgetBuffer(CurrentResourceOwner, b);
> > +
> > /* not moving as we're likely deleting it soon anyway */
> > ref = GetPrivateRefCountEntry(b, false);
> > Assert(ref != NULL);
> > -
> > - if (fixOwner)
> > - ResourceOwnerForgetBuffer(CurrentResourceOwner, b);
+1, Good catch.
>
> Is it safe to move the call to ResourceOwnerForgetBuffer() to before the
> call to GetPrivateRefCountEntry()? From my quick skim of the code, it
> seems like it should be safe, but I thought I'd ask the question.
Same question, have a look, it doesn’t seem to matter.

Regards,
Zhang Mingli


Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread David Rowley
On Thu, 29 Sept 2022 at 12:07, Peter Geoghegan  wrote:
> Also potentially relevant: the 2017 commit fa117ee4 anticipated adding
> a "copy" argument to tuplesort_getdatum() (the same commit added such
> a "copy" argument to tuplesort_gettupleslot()). I see that that still
> hasn't happened to tuplesort_getdatum() all these years later. Might
> be a good idea to do it in the next year or two, though.
>
> If David is interested in pursuing this now then I certainly won't object.

Just while this is fresh in my head, I wrote some code to make this
happen.  My preference would be not to add the "copy" param to the
existing function and instead just add a new function to prevent
additional branching.

The attached puts back the datum sort in nodeSort.c for byref types
and adjusts process_ordered_aggregate_single() to make use of this
function.

I did a quick benchmark to see if this help DISTINCT aggregate any:

create table t1 (a varchar(32) not null, b varchar(32) not null);
insert into t1 select md5((x%10)::text),md5((x%10)::text) from
generate_Series(1,100)x;
vacuum freeze t1;
create index on t1(a);

With a work_mem of 256MBs I get:

query = select max(distinct a), max(distinct b) from t1;

Master:
latency average = 313.197 ms

Patched:
latency average = 304.335 ms

So not a very impressive speedup there (about 3%)

Some excerpts from perf top show:

Master:
   1.40%  postgres  [.] palloc
   1.13%  postgres  [.] tuplesort_getdatum
   0.77%  postgres  [.] datumCopy

Patched:
   0.91%  postgres  [.] tuplesort_getdatum_nocopy
   0.65%  postgres  [.] palloc

I stared for a while at the mode_final() function and thought maybe we
could use the nocopy variant there. I just didn't quite pluck up the
motivation to write any code to see if it could be made faster.

David
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index fe74e49814..b5f63b5a2b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -878,8 +878,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 * pfree them when they are no longer needed.
 */
 
-   while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
- true, newVal, isNull, 
&newAbbrevVal))
+   while 
(tuplesort_getdatum_nocopy(pertrans->sortstates[aggstate->current_set],
+true, 
newVal, isNull, &newAbbrevVal))
{
/*
 * Clear and select the working context for evaluation of the 
equality
@@ -900,24 +900,33 @@ process_ordered_aggregate_single(AggState *aggstate,

 pertrans->aggCollation,

 oldVal, *newVal)
{
-   /* equal to prior, so forget this one */
-   if (!pertrans->inputtypeByVal && !*isNull)
-   pfree(DatumGetPointer(*newVal));
+   MemoryContextSwitchTo(oldContext);
+   continue;
}
else
{
advance_transition_function(aggstate, pertrans, 
pergroupstate);
-   /* forget the old value, if any */
-   if (!oldIsNull && !pertrans->inputtypeByVal)
-   pfree(DatumGetPointer(oldVal));
-   /* and remember the new one for subsequent equality 
checks */
-   oldVal = *newVal;
+
+   MemoryContextSwitchTo(oldContext);
+
+   /*
+* Forget the old value, if any and remember the new 
one for
+* subsequent equality checks.
+*/
+   if (!pertrans->inputtypeByVal)
+   {
+   if (!oldIsNull)
+   pfree(DatumGetPointer(oldVal));
+   if (!*isNull)
+   oldVal = datumCopy(*newVal, 
pertrans->inputtypeByVal,
+  
pertrans->inputtypeLen);
+   }
+   else
+   oldVal = *newVal;
oldAbbrevVal = newAbbrevVal;
oldIsNull = *isNull;
haveOldVal = true;
}
-
-   MemoryContextSwitchTo(oldContext);
}
 
if (!oldIsNull && !pertrans->inputtypeByVal)
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 37ad35704b..f418be30a1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSo

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2022 at 6:13 PM David Rowley  wrote:
> Master:
> latency average = 313.197 ms
>
> Patched:
> latency average = 304.335 ms
>
> So not a very impressive speedup there (about 3%)

Worth a try, at least. Having a more consistent interface is valuable
in itself too.

-- 
Peter Geoghegan




[patch] Adding an assertion to report too long hash table name

2022-09-28 Thread Xiaoran Wang
Hi,

The max size for the shmem hash table name is SHMEM_INDEX_KEYSIZE - 1.
but when the caller uses a longer hash table name, it doesn't report any error, 
instead
it just uses the first SHMEM_INDEX_KEYSIZE -1 chars as the hash table name.

I created some shmem hash tables with the same prefix which was longer than
SHMEM_INDEX_KEYSIZE - 1, and the size of those hash tables were the same,
then only one hash table was created. But I thought those hash tables were 
created
successfully.

I know this is a corner case, but it's difficult to figure it out when run into 
it. So I add
an assertion to prevent it.


Thanks,
Xiaoran


Adding-an-assertion-to-report-too-long-hash-table-name.patch
Description:  Adding-an-assertion-to-report-too-long-hash-table-name.patch


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Andres Freund
Hi,

On 2022-09-28 16:07:13 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > And like the existing buildfarm, it's severely under-documented.
> 
> That complaint I agree with.  A wiki page is a pretty poor substitute
> for in-tree docs.

I assume we're talking about CI?

What would you like to see documented? There is some content in
src/tools/ci/README and the wiki page links to that too. Should we lift it
into the sgml docs?

If we're talking about meson - there's a pending documentation commit. It'd be
good to get some review for it!

Greetings,

Andres Freund




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Tom Lane
Andres Freund  writes:
> On 2022-09-28 16:07:13 -0400, Tom Lane wrote:
>> Robert Haas  writes:
>>> And like the existing buildfarm, it's severely under-documented.

>> That complaint I agree with.  A wiki page is a pretty poor substitute
>> for in-tree docs.

> I assume we're talking about CI?

I was thinking of meson when I wrote that ... but re-reading it,
I think Robert meant CI.

regards, tom lane




Re: Make ON_ERROR_STOP stop on shell script failure

2022-09-28 Thread bt22nakamorit

2022-09-28 21:49 に torikoshia さんは書きました:

   if (result == 127 || result == -1)
   {
   pg_log_error("\\!: failed");
   return false;
   }
   else if (result != 0) {
   pg_log_error("command failed");
   return false;


Since it would be hard to understand the cause of failures from these
two messages, it might be better to clarify them in the messages.

The former comes from failures of child process creation or execution
on it and the latter occurs when child process creation and execution
succeeded but the return code is not 0, doesn't it?


I also felt it'd be natural that the latter message also begins with
"\\!" since both message concerns with \!.

How do you think?


Thank you for the feedback!
I agree that the messages should be more clear.
\\!: command was not executed
\\!: command failed
Would these two messages be enough to describe the two cases?

Tatsu




Re: identifying the backend that owns a temporary schema

2022-09-28 Thread Nathan Bossart
On Wed, Sep 28, 2022 at 06:56:20PM -0400, Tom Lane wrote:
> I reviewed this and made some changes, some cosmetic some less so.

Thanks for the detailed review.

> A point that still bothers me a bit about pg_stat_get_backend_idset is
> that it could miss or duplicate some backend IDs if the user calls
> pg_stat_clear_snapshot() partway through the SRF's run, and we reload
> a different set of backend entries than we had before.  I added a comment
> about that, with an argument why it's not worth working harder, but
> is the argument convincing?  If not, what should we do?

Isn't this an existing problem?  Granted, it'd manifest differently with
this patch, but ISTM we could already end up with too many or too few
backend IDs if there's a refresh partway through.  I don't know if there's
an easy way to avoid mismatches altogether besides perhaps ERROR-ing if
there's a concurrent refresh.

> - if (beid < 1 || beid > localNumBackends)
> - return NULL;

The reason I'd kept this in was because I was worried about overflow in the
comparator function.  Upon further inspection, I don't think there's
actually any way that will produce incorrect results.  And I'm not sure we
should worry about such cases too much, anyway.

Overall, LGTM.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Fix some newly modified tab-complete changes

2022-09-28 Thread shiy.f...@fujitsu.com
On Wed, Sep 28, 2022 1:49 PM Kyotaro Horiguchi  wrote:
> 
> At Wed, 28 Sep 2022 14:14:01 +1000, Peter Smith
>  wrote in
> > On Tue, Sep 27, 2022 at 8:28 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > > Hi hackers,
> > >
> > > I saw a problem when using tab-complete for "GRANT", "TABLES IN
> SCHEMA" should
> > > be "ALL TABLES IN SCHEMA" in the following case.
> > >
> > > postgres=# grant all on
> > > ALL FUNCTIONS IN SCHEMA   DATABASE  FUNCTION
> PARAMETER SCHEMATABLESPACE
> > > ALL PROCEDURES IN SCHEMA  DOMAINinformation_schema.
> PROCEDURE SEQUENCE  tbl
> > > ALL ROUTINES IN SCHEMAFOREIGN DATA WRAPPER  LANGUAGE
> public.   TABLE TYPE
> > > ALL SEQUENCES IN SCHEMA   FOREIGN SERVERLARGE OBJECT
> ROUTINE   TABLES IN SCHEMA
> > >
> > > I found that it is related to the recent commit 790bf615dd, and maybe it's
> > > better to fix it. I also noticed that some comments should be modified
> according
> > > to this new syntax. Attach a patch to fix them.
> > >
> >
> > Thanks for the patch! Below are my review comments.
> >
> > The patch looks good to me but I did find some other tab-completion
> > anomalies. IIUC these are unrelated to your work, but since I found
> > them while testing your patch I am reporting them here.
> 
> Looks fine to me, too.
> 

Thanks for reviewing it.

> > Perhaps you want to fix them in the same patch, or just raise them
> > again separately?
> >
> > ==
> >
> > 1. tab complete for CREATE PUBLICATION
> >
> > I donʼt think this is any new bug, but I found that it is possible to do 
> > this...
> >
> > test_pub=# create publication p for ALL TABLES IN SCHEMA 
> > information_schema  pg_catalog  pg_toastpublic
> >
> > or, even this...
> >
> > test_pub=# create publication p for XXX TABLES IN SCHEMA 
> > information_schema  pg_catalog  pg_toastpublic
> 
> Completion is responding to "IN SCHEMA" in these cases.  However, I
> don't reach this state only by completion becuase it doesn't suggest
> "IN SCHEMA" after "TABLES" nor "ALL TABLES".  I don't see a reason to
> change that behavior unless that fix doesn't cause any additional
> complexity.
> 

+1

> > ==
> >
> > 2. tab complete for GRANT
> >
> > test_pub=# grant 
> > ALLEXECUTE
> > pg_execute_server_program  pg_read_server_files   postgres
> >   TRIGGER
> > ALTER SYSTEM   GRANT  pg_monitor
> >   pg_signal_backend  REFERENCES
> > TRUNCATE
> > CONNECTINSERT pg_read_all_data
> >   pg_stat_scan_tablesSELECT UPDATE
> > CREATE pg_checkpoint
> > pg_read_all_settings   pg_write_all_data  SET
> >   USAGE
> > DELETE pg_database_owner
> > pg_read_all_stats  pg_write_server_files  TEMPORARY
> >
> > 2a.
> > grant "GRANT" ??
> 
> Yeah, for the mement I thought that might a kind of admin option but
> there's no such a privilege. REVOKE gets the same suggestion.
> 

Maybe that's for "REVOKE GRANT OPTION FOR". But it is used by both GRANT and
REVOKE. I think it's a separate problem, I have tried to fix it in the attached
0002 patch.

> > 2b.
> > grant "TEMPORARY" but not "TEMP" ??
> 
> TEMP is an alternative spelling so that's fine.
> 

Agreed.

> 
> I found the following suggestion.
> 
> CREATE PUBLICATION p FOR TABLES  -> ["IN SCHEMA", "WITH ("]
> 
> I believe "WITH (" doesn't come there.
> 

Fixed.

Attach the updated patch.

Regards,
Shi yu


v2-0001-Fix-some-newly-modified-tab-complete-changes-and-.patch
Description:  v2-0001-Fix-some-newly-modified-tab-complete-changes-and-.patch


v2-0002-Fix-tab-completion-for-GRANT-REVOKE.patch
Description: v2-0002-Fix-tab-completion-for-GRANT-REVOKE.patch


Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Andres Freund
Hi,

On 2022-09-28 22:14:11 -0400, Tom Lane wrote:
> I was thinking of meson when I wrote that ... but re-reading it,
> I think Robert meant CI.

FWIW, I had planned to put the "translation table" between autoconf and meson
into the docs, but Peter E. argued that the wiki is better for that. Happy to
change course on that aspect if there's agreement on it.

Greetings,

Andres Freund




Re: Make ON_ERROR_STOP stop on shell script failure

2022-09-28 Thread Kyotaro Horiguchi
At Thu, 29 Sep 2022 11:29:40 +0900, bt22nakamorit 
 wrote in 
> 2022-09-28 21:49 に torikoshia さんは書きました:
> >>if (result == 127 || result == -1)
> >>{
> >>pg_log_error("\\!: failed");
> >>return false;
> >>}
> >>else if (result != 0) {
> >>pg_log_error("command failed");
> >>return false;
> > Since it would be hard to understand the cause of failures from these
> > two messages, it might be better to clarify them in the messages.
> > The former comes from failures of child process creation or execution
> > on it and the latter occurs when child process creation and execution
> > succeeded but the return code is not 0, doesn't it?
> > I also felt it'd be natural that the latter message also begins with
> > "\\!" since both message concerns with \!.
> > How do you think?
> 
> Thank you for the feedback!
> I agree that the messages should be more clear.
> \\!: command was not executed
> \\!: command failed
> Would these two messages be enough to describe the two cases?

FWIW, I would spell these as something like this:

> \\!: command execution failure: %m
> \\!: command returned failure status: %d

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: longfin and tamandua aren't too happy but I'm not sure why

2022-09-28 Thread Andres Freund
Hi,

On 2022-09-28 21:22:26 +0200, Alvaro Herrera wrote:
> I have an additional, unrelated complaint about CI, which is that we
> don't have anything for past branches.  I have a partial hack(*), but
> I wish we had something we could readily use.
> 
> (*) I just backpatched the commit that added the .cirrus.yml file, plus
> some later fixes to it, and I keep that as a separate branch which I
> merge with whatever other changes I want to test.  I then push that to
> github, and ignore the windows results when looking at cirrus-ci.com.

I'd not be against backpatching the ci stuff if there were sufficient demand
for it. It'd probably be a decent bit of initial work, but after that it
shouldn't be too bad.

Greetings,

Andres Freund




Re: Eliminating SPI from RI triggers - take 2

2022-09-28 Thread Amit Langote
On Thu, Aug 4, 2022 at 1:05 PM Amit Langote  wrote:
> On Wed, Jul 13, 2022 at 8:59 PM Amit Langote  wrote:
> > That bit came in to make DETACH CONCURRENTLY produce sane answers for
> > RI queries in some cases.
> >
> > I guess my comment should really have said something like:
> >
> > HACK: find_inheritance_children_extended() has a hack that assumes
> > that the queries originating in this module push the latest snapshot
> > in transaction-snapshot mode.
>
> Posting a new version with this bit fixed; cfbot complained that 0002
> needed a rebase over 3592e0ff98.
>
> I will try to come up with a patch to enhance the PartitionDirectory
> interface to allow passing the snapshot to use when scanning
> pg_inherits explicitly, so we won't need the above "hack".

Sorry about the delay.

So I came up with such a patch that is attached as 0003.

The main problem I want to fix with it is the need for RI_FKey_check()
to "force"-push the latest snapshot that the PartitionDesc code wants
to use to correctly include or omit a detach-pending partition from
the view of that function's RI query.  Scribbling on ActiveSnapshot
that way means that *all* scans involved in the execution of that
query now see a snapshot that they shouldn't likely be seeing; a bug
resulting from this has been demonstrated in a test case added by the
commit 00cb86e75d.

The fix is to make RI_FKey_check(), or really its RI_Plan's execution
function ri_LookupKeyInPkRel() added by patch 0002, pass the latest
snapshot explicitly as a parameter of PartitionDirectoryLookup(),
which passes it down to the PartitionDesc code.  No need to manipulate
ActiveSnapshot.  The actual fix is in patch 0004, which I extracted
out of 0002 to keep the latter a mere refactoring patch without any
semantic changes (though a bit more on that below).  BTW, I don't know
of a way to back-patch a fix like this for the bug, because there is
no way other than ActiveSnapshot to pass the desired snapshot to the
PartitionDesc code if the only way we get to that code is by executing
an SQL query plan.

0003 moves the relevant logic out of
find_inheritance_children_extended() into its callers.  The logic of
deciding which snapshot to use to determine if a detach-pending
partition should indeed be omitted from the consideration of a caller
based on the result of checking the visibility of the corresponding
pg_inherits row with the snapshot; it just uses ActiveSnapshot now.
Given the problems with using ActiveSnapshot mentioned above, I think
it is better to make the callers decide the snapshot and pass it using
a parameter named omit_detached_snapshot.  Only PartitionDesc code
actually cares about sending anything but the parent query's
ActiveSnapshot, so the PartitionDesc and PartitionDirectory interface
has been changed to add the same omit_detached_snapshot parameter.
find_inheritance_children(), the other caller used in many sites that
look at a table's partitions, defaults to using ActiveSnapshot, which
does not seem problematic.  Furthermore, only RI_FKey_check() needs to
pass anything other than ActiveSnapshot, so other users of
PartitionDesc, like user queries, still default to using the
ActiveSnapshot, which doesn't have any known problems either.

0001 and 0002 are mostly unchanged in this version, except I took out
the visibility bug-fix from 0002 into 0004 described above, which
looks better using the interface added by 0003 anyway.  I need to
address the main concern that it's still hard to be sure that the
patch in its current form doesn't break any user-level semantics of
these RI check triggers and other concerns about the implementation
that Robert expressed in [1].

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaiTNj4DgQy42OT9JmTTP1NWcMV%2Bke0i%3D%2Ba7%3DVgnzqGXw%40mail.gmail.com


v4-0002-Avoid-using-an-SQL-query-for-some-RI-checks.patch
Description: Binary data


v4-0004-Teach-ri_LookupKeyInPkRel-to-pass-omit_detached_s.patch
Description: Binary data


v4-0003-Make-omit_detached-logic-independent-of-ActiveSna.patch
Description: Binary data


v4-0001-Avoid-using-SPI-in-RI-trigger-functions.patch
Description: Binary data


Re: Make ON_ERROR_STOP stop on shell script failure

2022-09-28 Thread Kyotaro Horiguchi
At Thu, 29 Sep 2022 12:35:04 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> > Thank you for the feedback!
> > I agree that the messages should be more clear.
> > \\!: command was not executed
> > \\!: command failed
> > Would these two messages be enough to describe the two cases?
> 
> FWIW, I would spell these as something like this:
> 
> > \\!: command execution failure: %m

The following might be more complient to our policy.

> \\!: failed to execute command \"%s\": %m


> > \\!: command returned failure status: %d

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Log details for client certificate failures

2022-09-28 Thread Masahiko Sawada
On Thu, Sep 29, 2022 at 1:43 AM Jacob Champion  wrote:
>
> On Tue, Sep 27, 2022 at 6:14 PM Masahiko Sawada  wrote:
> > No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster
> > memory only once when starting up. application_name is PGC_USERSET but
> > since we normally allocate memory in PortalMemoryContext we eventually
> > can free it.
>
> Oh, I see; thank you for the correction. And even if someone put an
> application_name into their postgresql.conf, and then changed it a
> bunch of times, we'd free the leaked memory from the config_cxt that's
> created in ProcessConfigFile().

Right.

>
> Is there a reason we don't provide a similar temporary context during
> InitializeGUCOptions()? Naively it seems like that would suppress any
> future one-time leaks, and maybe cut down on some Valgrind noise. Then
> again, maybe there's just not that much demand for pallocs during GUC
> hooks.

While this seems a future-proof idea, I wonder if it might be overkill
since we don't need to worry about accumulation of leaked memory in
this case. Given that only check_cluter_name is the case where we
found a small memory leak, I think it's adequate to fix it.

Fixing this issue suppresses the valgrind's complaint but since the
boot value of cluster_name is "" the memory leak we can avoid is only
1 byte.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

2022-09-28 Thread David Rowley
On Thu, 29 Sept 2022 at 14:32, Peter Geoghegan  wrote:
>
> On Wed, Sep 28, 2022 at 6:13 PM David Rowley  wrote:
> > Master:
> > latency average = 313.197 ms
> >
> > Patched:
> > latency average = 304.335 ms
> >
> > So not a very impressive speedup there (about 3%)
>
> Worth a try, at least. Having a more consistent interface is valuable
> in itself too.

Just testing the datum sort in nodeSort.c with the same table as
before but using the query:

select b from t1 order by b offset 100;

Master:
latency average = 344.763 ms

Patched:
latency average = 268.374 ms

about 28% faster.

I'll take this to another thread and put it in the next CF

David




Have nodeSort.c use datum sorts single-value byref types

2022-09-28 Thread David Rowley
We originally did this in 91e9e89dc, but a memory leak was discovered
as I neglected to pfree the datum which is freshly allocated in
tuplesort_getdatum. Because that was discovered late in the PG15
cycle, we opted to just disable the datum sort optimisation for byref
types in 3a5817695.

As was mentioned in [1], it looks like we could really use a version
of tuplesort_getdatum which does not palloc a new Datum.  nodeSort.c,
when calling tuplesort_gettupleslot passes copy==false, so it would
make sense if the datum sort variation didn't do any copying either.

In the attached patch, I've added a function named
tuplesort_getdatum_nocopy() which is the same as tuplesort_getdatum()
only without the datumCopy(). I opted for the new function rather than
a new parameter in the existing function just to reduce branching and
additional needless overhead.

I also looked at the tuplesort_getdatum() call inside
process_ordered_aggregate_single() and made a few changes there so we
don't needlessly perform a datumCopy() when we skip a Datum due to
finding it the same as the previous Datum in a DISTINCT aggregate
situation.

I was also looking at mode_final(). Perhaps that could do with the
same treatment, I just didn't touch it in the attached patch.

A quick performance test with:

create table t1 (a varchar(32) not null, b varchar(32) not null);
insert into t1 select md5((x%10)::text),md5((x%10)::text) from
generate_Series(1,100)x;
vacuum freeze t1;
create index on t1(a);

Yields a small speedup for the DISTINCT aggregate case.

work_mem = 256MB
query = select max(distinct a), max(distinct b) from t1;

Master:
latency average = 313.197 ms

Patched:
latency average = 304.335 ms (about 3% faster)

The Datum sort in nodeSort.c is more impressive.

query = select b from t1 order by b offset 100;

Master:
latency average = 344.763 ms

Patched:
latency average = 268.374 ms (about 28% faster)

I'll add this to the November CF

David

[1] 
https://www.postgresql.org/message-id/CAApHDvqS6wC5U==k9Hd26E4EQXH3QR67-T4=q1rq36ngvjf...@mail.gmail.com
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index fe74e49814..b5f63b5a2b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -878,8 +878,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 * pfree them when they are no longer needed.
 */
 
-   while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
- true, newVal, isNull, 
&newAbbrevVal))
+   while 
(tuplesort_getdatum_nocopy(pertrans->sortstates[aggstate->current_set],
+true, 
newVal, isNull, &newAbbrevVal))
{
/*
 * Clear and select the working context for evaluation of the 
equality
@@ -900,24 +900,33 @@ process_ordered_aggregate_single(AggState *aggstate,

 pertrans->aggCollation,

 oldVal, *newVal)
{
-   /* equal to prior, so forget this one */
-   if (!pertrans->inputtypeByVal && !*isNull)
-   pfree(DatumGetPointer(*newVal));
+   MemoryContextSwitchTo(oldContext);
+   continue;
}
else
{
advance_transition_function(aggstate, pertrans, 
pergroupstate);
-   /* forget the old value, if any */
-   if (!oldIsNull && !pertrans->inputtypeByVal)
-   pfree(DatumGetPointer(oldVal));
-   /* and remember the new one for subsequent equality 
checks */
-   oldVal = *newVal;
+
+   MemoryContextSwitchTo(oldContext);
+
+   /*
+* Forget the old value, if any and remember the new 
one for
+* subsequent equality checks.
+*/
+   if (!pertrans->inputtypeByVal)
+   {
+   if (!oldIsNull)
+   pfree(DatumGetPointer(oldVal));
+   if (!*isNull)
+   oldVal = datumCopy(*newVal, 
pertrans->inputtypeByVal,
+  
pertrans->inputtypeLen);
+   }
+   else
+   oldVal = *newVal;
oldAbbrevVal = newAbbrevVal;
oldIsNull = *isNull;
haveOldVal = true;
}
-
-   MemoryContextSwitchTo(oldContext);
}
 
 

  1   2   >