date:20221027

Here are some review comments for patch v32-0001.

This is a WIP - I have not yet looked at the largest file of this
patch (src/backend/commands/ddl_deparse.c)

==

Commit Message

1.

The list of the supported statements should be in alphabetical order
to make it easier to read

~~~

2.

The "Notes" are obviously notes, so the text does not need to say
"Note that..." etc again

"(Note #1) Note that some..." -> "(Note #1) Some..."

"(Note #2) Note that, for..." -> "(Note #2) For..."

"(Note #4) Note that, for..." -> "(Note #4) For..."

~~~

3.

For "Note #3", use uppercase for the SQL keywords in the example.

~~~

4.

For "Note #4":

"We created" -> "we created"

==

src/backend/catalog/aclchk.c

5. ExecuteGrantStmt

@@ -385,7 +385,11 @@ ExecuteGrantStmt(GrantStmt *stmt)
  ereport(ERROR,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("grantor must be current user")));
+
+ istmt.grantor_uid = grantor;
  }
+ else
+ istmt.grantor_uid = InvalidOid;

This code can be simpler by just declaring the 'grantor' variable at
function scope, then assigning the istmt.grantor_uid along with the
other grantor assignments.

SUGGESTION
Oid grantor = InvalidOid;
...
istmt.grantor_uid = grantor;
istmt.is_grant = stmt->is_grant;
istmt.objtype = stmt->objtype;

==

src/backend/commands/collationcmds.c

6. DefineCollation

+ /* make from existing collationid available to callers */
+ if (from_collid && OidIsValid(collid))
+ ObjectAddressSet(*from_collid,
+ CollationRelationId,
+ collid);

6a.
Maybe the param can be made 'from_existing_colid', then the above code
comment can be made more readable?

~

6b.
Seems some unnecessary wrapping here


==

src/backend/commands/ddl_deparse.c

WIP - I will try to post some review comments on this file next week

==

src/backend/commands/ddl_json.c

7. convSpecifier

typedef enum
{
SpecTypename,
SpecOperatorname,
SpecDottedName,
SpecString,
SpecNumber,
SpecStringLiteral,
SpecIdentifier,
SpecRole
} convSpecifier;

Inconsistent case. Some of these say "name" and some say "Name"

~~~

8. Forward declarations

char *ddl_deparse_json_to_string(char *jsonb);

Is this needed here? I thought this was already declared extern in
ddl_deparse.h.

~~~

9. find_string_in_jsonbcontainer

The function comment says "If it's of a type other than jbvString, an
error is raised.", but I do not see this check in the function code.

~~~

10. expand_fmt_recursive

/*
 * Recursive helper for pg_event_trigger_expand_command
 *
 * Find the "fmt" element in the given container, and expand it into the
 * provided StringInfo.
 */


10a.
I am not sure if the mention of "pg_event_trigger_expand_command" is
stale or is not relevant anymore, because that caller is not in this
module.

~

10b.
The first sentence is missing a period.

~~~

11.

value = findJsonbValueFromContainer(container, JB_FOBJECT, &key);

Should this be checking is value is NULL?

~~~

12. expand_jsonval_dottedname

 * Expand a json value as a dot-separated-name.  The value must be of type
 * object and may contain elements "schemaname" (optional), "objname"
 * (mandatory), "attrname" (optional).  Double quotes are added to each element
 * as necessary, and dot separators where needed.

The comment says "The value must be of type object" but I don't see
any check/assert for that in the code.

~~~

13. expand_jsonval_typename

In other code (e.g. expand_jsonval_dottedname) there are lots of
pfree(str) so why not similar here?

e.g. Shouldn’t the end of the function have like shown below:
pfree(schema);
pfree(typename);
pfree(typmodstr);

~~~

14. expand_jsonval_operator

The function comment is missing a period.

~~~

15. expand_jsonval_string

/*
 * Expand a JSON value as a string.  The value must be of type string or of
 * type object.  In the latter case, it must contain a "fmt" element which will
 * be recursively expanded; also, if the object contains an element "present"
 * and it is set to false, the expansion is the empty string.

15a.
Although the comment says "The value must be of type string or of type
object" the code is checking for jbvString and jbvBinary (??)

~

15b.
else
return false;

Is that OK to just return false, or should this in fact be throwing an
error if the wrong type?

~~~

16. expand_jsonval_strlit

/* Easy case: if there are no ' and no \, just use a single quote */
if (strchr(str, '\'') == NULL &&
strchr(str, '\\') == NULL)

That could be simplified as:

if ((strpbk(str, "\'\\") == NULL)

~~~

17. expand_jsonval_number

strdatum = DatumGetCString(DirectFunctionCall1(numeric_out,

NumericGetDatum(jsonval->val.numeric)));
appendStringInfoString(buf, strdatum);

Shouldn't this function do pfree(strdatum) at the end?

~~~

18. expand_jsonval_role

/*
 * Expand a JSON value as a role name.  If the is_public element is set to
 * true, PUBLIC is expanded (no quotes); otherwise, expand the given role name,
 * quoting as an identifie

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Fri, Oct 28, 2022 at 7:39 AM Michael Paquier  wrote:
>
> On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> > The block sizes don't need to match, do they? As long as the block is
properly
> > aligned, we can change the iov_len of the final iov to match whatever
the size
> > is being passed in, no?
>
> Hmm.  Based on what Bharath has written upthread, it does not seem to
> matter if the size of the aligned block changes, either:
>
https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com
>
> I am honestly not sure whether it is a good idea to make file_utils.c
> depend on one of the compile-time page sizes in this routine, be it
> the page size of the WAL page size, as pg_write_zeros() would be used
> for some rather low-level operations.  But we could as well just use a
> locally-defined structure with a buffer at 4kB or 8kB and call it a
> day?

+1. Please see the attached v8 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 88d67c6e3e92a62ac4c783a137759b2f3795cb21 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 28 Oct 2022 04:44:41 +
Subject: [PATCH v8] Use pg_pwritev_with_retry() instead of write() in
 walmethods.c

Use pg_pwritev_with_retry() while prepadding a WAL segment
instead of write() in walmethods.c dir_open_for_write() to avoid
partial writes. As the pg_pwritev_with_retry() function uses
pwritev, we can avoid explicit lseek() on non-Windows platforms
to seek to the beginning of the WAL segment. It looks like on
Windows, we need an explicit lseek() call here despite using
pwrite() implementation from win32pwrite.c. Otherwise
an error occurs.

These changes are inline with how core postgres initializes the
WAL segment in XLogFileInitInternal().

Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Michael Paquier
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c  | 33 ++-
 src/bin/pg_basebackup/walmethods.c | 26 +
 src/common/file_utils.c| 92 ++
 src/include/common/file_utils.h|  3 +
 4 files changed, 114 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..15dd5ce834 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
@@ -2965,14 +2964,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
+		ssize_t rc;
 
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
@@ -2983,29 +2979,10 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
+		rc = pg_pwrite_zeros(fd, wal_segment_size);
 
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
+		if (rc < 0)
+			save_errno = errno;
 	}
 	else
 	{
@@ -3014,7 +2991,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * enough.
 		 */
 		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1)
 		{
 			/* if write didn't set errno, assume no disk space */
 			save_errno = errno ? errno : ENOSPC;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index bc2e83d02b..18bad52c96 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -220,22 +220,24 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	/* Do pre-padding on non-compressed files */
 	if (pad_to_size && wwmethod->compression_algorithm == PG_COMPRESSION_NONE)
 	{
-		PGAlignedXLogBlock zerobuf;
-		int			bytes;
+		ssize_t rc;

Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql

2022-10-27 Thread Amul Sul

On Fri, Oct 28, 2022 at 10:43 AM Tom Lane  wrote:
>
> David Rowley  writes:
> > On Fri, 28 Oct 2022 at 16:51, Amul Sul  wrote:
> >> If we
> >> are too sure that the output usually comes in the same order then the
> >> ORDER BY clause that exists in other tests seems useless. I am a bit
> >> confused & what could be a possible bug?
>
> > You can't claim that if this test shouldn't get an ORDER BY that all
> > tests shouldn't have an ORDER BY. That's just crazy. What if the test
> > is doing something like testing sort?!
>
> The general policy is that we'll add ORDER BY when a test is demonstrated
> to have unstable output order for identifiable environmental reasons
> (e.g. locale dependency) or timing reasons (e.g. background autovacuum
> sometimes changing statistics).  But the key word there is "identifiable".
> Without some evidence as to what's causing this, it remains possible
> that it's a code bug not the fault of the test case.
>
> regress.sgml explains the policy further:
>
>   You might wonder why we don't order all the regression test queries 
> explicitly
>   to get rid of this issue once and for all.  The reason is that that would
>   make the regression tests less useful, not more, since they'd tend
>   to exercise query plan types that produce ordered results to the
>   exclusion of those that don't.
>

Understood. Thanks for the clarification.

Regards,
Amul

Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql

2022-10-27 Thread Amul Sul

On Fri, Oct 28, 2022 at 10:28 AM David Rowley  wrote:
>
> On Fri, 28 Oct 2022 at 16:51, Amul Sul  wrote:
> >
> > On Thu, Oct 27, 2022 at 6:54 PM Tom Lane  wrote:
> > > Please be specific about the circumstances in which the output is
> > > unstable for you.  With zero information to go on, it seems about as
> > > likely that this change is masking a bug as that it's a good idea.
> > >
> >
> > At the first glance, I thought the patch is pretty much obvious, and
> > we usually add an ORDER BY clause to ensure stable output.
>
> Unfortunately, you'll need to do better than that. We're not in the
> business of accepting patches with zero justification for why they're
> required. If you're not willing to do the analysis on why the order
> changes sometimes, why should we accept your patch?
>

Unfortunately the test is not failing at me. Otherwise, I would have
done that analysis. When I saw the patch for the first time, somehow,
I didn't think anything spurious due to my misconception that we
usually add the ORDER BY clause for the select queries just to be
sure.

> If you can't find the problem then you should modify insert.sql to
> EXPLAIN the problem query to see if the plan has changed between the
> passing and failing run. The only thing that comes to mind about why
> this test might produce rows in a different order would be if a
> parallel Append was sorting the subpaths by cost (See
> create_append_path's call to list_sort) and the costs were for some
> reason coming out differently sometimes. It's hard to imagine why this
> query would be parallelised though. If you show us the EXPLAIN from a
> passing and failing run, it might help us see the problem.
>

Understood.

> > If we
> > are too sure that the output usually comes in the same order then the
> > ORDER BY clause that exists in other tests seems useless. I am a bit
> > confused & what could be a possible bug?
>
> You can't claim that if this test shouldn't get an ORDER BY that all
> tests shouldn't have an ORDER BY. That's just crazy. What if the test
> is doing something like testing sort?!
>

That I can understand that the sorted output doesn't need further
sorting. I am just referring to the simple SELECT queries that do not
have any sorting.

Thanks & Regards,
Amul

Re: Adding doubly linked list type which stores the number of items in the list

2022-10-27 Thread David Rowley

Thank you for having a look at this.

On Thu, 27 Oct 2022 at 19:32, Bharath Rupireddy
 wrote:
> Some comments on the patch:
> 1. I think it's better to just return dlist_is_empty(&head->dlist) &&
> (head->count == 0); from dclist_is_empty() and remove the assert for
> better readability and safety against count being zero.

I don't think that's a good change. For 1) it adds unnecessary
overhead due to the redundant checks and 2) it removes the Assert
which is our early warning that the dclist's count is getting out of
sync somewhere.

> 2. Missing dlist_is_memberof() in dclist_delete_from()?

I put that in dlist_delete_from() which is called from dclist_delete_from().

> 3. Just thinking if we need to move dlist_is_memberof() check from
> dclist_* functions to dlist_* functions, because they also need such
> insurance against callers passing spurious nodes.

I think the affected functions there would be; dlist_move_head(),
dlist_move_tail(), dlist_has_next(), dlist_has_prev(),
dlist_next_node() and dlist_prev_node(). I believe if we did that then
it's effectively an API change. The comments only claim that it's
undefined if node is not a member of the list. It does not say 'node'
*must* be part of the list.  Now, perhaps doing this would just make
it more likely that we'd find bugs in our code and extension authors
would find bugs in their code, but it does move the bar.
dlist_move_head and dlist_move_tail look like they'd work perfectly
well to remove an item from 1 list and put it on the head or tail of
some completely different list. Should we really be changing that in a
patch that is meant to just add the dclist type?

> 4. More opportunities to use dclist_* in below places, no?
> dlist_push_tail(&src->mappings, &pmap->node);
> src->num_mappings++;
>
> dlist_push_head(&MXactCache, &entry->node);
> if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)

Thanks for finding those. I've adjusted them both to use dclists.

> 5. dlist_is_memberof() - do we need this at all? We trust the callers
> of dlist_* today that the passed in node belongs to the list, no?

hmm, this seems to contradict your #3?

If you look at something like dlist_move_head(), if someone calls that
and passes a 'node' that does not belong to 'head' then the result of
that is that we delete 'node' from whichever dlist that it's on and
push it onto 'head'. Nothing bad happens there.  If we do the same on
a dclist then the count gets out of sync. That's bad as it could lead
to assert failures and bugs.

> 6. If we decide to have dlist_is_memberof() and the asserts around it,
> can we design it on the similar lines as dlist_check() to avoid many
> #ifdef ILIST_DEBUG #endif blocks spreading around the code?

OK, that likely is a better idea. I've done this in the attached by
way of dlist_member_check()

> 7. Do we need Assert(head->count > 0); in more places like 
> dclist_delete_from()?

I guess it does no harm. I've added some additional ones in the attached.

> 8. Don't we need dlist_container(), dlist_head_element(),
> dlist_tail_element() for dclist_*? Even though, we might not use them
> immediately, just for the sake for completeness of dclist data
> structure.

OK, I think I'd left those because dclist_container() would just be
the same as dlist_container(), but that's not the case for the other
two, so I've added all 3.

One additional change is that I also ended up removing the use of
dclist that I had in the previous patch for ReorderBufferTXN.subtxns.
Looking more closely at the code in ReorderBufferAssignChild():

/*
* We already saw this transaction, but initially added it to the
* list of top-level txns.  Now that we know it's not top-level,
* remove it from there.
*/
dlist_delete(&subtxn->node);

The problem is that since ReorderBufferTXN is used for both
transactions and sub-transactions that it's not easy to determine if
the ReorderBufferTXN.node is part of the ReorderBuffer.toplevel_by_lsn
dlist or the ReorderBufferTXN.subtxns.  It seems safer just to leave
this one alone.

David
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index b01b39b008..a34e9b352d 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -196,8 +196,7 @@ typedef struct RewriteMappingFile
 	TransactionId xid;			/* xid that might need to see the row */
 	int			vfd;			/* fd of mappings file */
 	off_t		off;			/* how far have we written yet */
-	uint32		num_mappings;	/* number of in-memory mappings */
-	dlist_head	mappings;		/* list of in-memory mappings */
+	dclist_head mappings;		/* list of in-memory mappings */
 	char		path[MAXPGPATH];	/* path, for error messages */
 } RewriteMappingFile;

@@ -864,9 +863,10 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		Oid			dboid;
 		uint32		len;
 		int			written;
+		uint32		num_mappings = dclist_count(&src->mappings);

 		/* this file hasn't got any new mappings */
-		if (src->num_mappings == 0)
+		if (num

Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql

David Rowley  writes:
> On Fri, 28 Oct 2022 at 16:51, Amul Sul  wrote:
>> If we
>> are too sure that the output usually comes in the same order then the
>> ORDER BY clause that exists in other tests seems useless. I am a bit
>> confused & what could be a possible bug?

> You can't claim that if this test shouldn't get an ORDER BY that all
> tests shouldn't have an ORDER BY. That's just crazy. What if the test
> is doing something like testing sort?!

The general policy is that we'll add ORDER BY when a test is demonstrated
to have unstable output order for identifiable environmental reasons
(e.g. locale dependency) or timing reasons (e.g. background autovacuum
sometimes changing statistics).  But the key word there is "identifiable".
Without some evidence as to what's causing this, it remains possible
that it's a code bug not the fault of the test case.

regress.sgml explains the policy further:

  You might wonder why we don't order all the regression test queries explicitly
  to get rid of this issue once and for all.  The reason is that that would
  make the regression tests less useful, not more, since they'd tend
  to exercise query plan types that produce ordered results to the
  exclusion of those that don't.

regards, tom lane

Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql

2022-10-27 Thread David Rowley

On Fri, 28 Oct 2022 at 16:51, Amul Sul  wrote:
>
> On Thu, Oct 27, 2022 at 6:54 PM Tom Lane  wrote:
> > Please be specific about the circumstances in which the output is
> > unstable for you.  With zero information to go on, it seems about as
> > likely that this change is masking a bug as that it's a good idea.
> >
>
> At the first glance, I thought the patch is pretty much obvious, and
> we usually add an ORDER BY clause to ensure stable output.

Unfortunately, you'll need to do better than that. We're not in the
business of accepting patches with zero justification for why they're
required. If you're not willing to do the analysis on why the order
changes sometimes, why should we accept your patch?

If you can't find the problem then you should modify insert.sql to
EXPLAIN the problem query to see if the plan has changed between the
passing and failing run. The only thing that comes to mind about why
this test might produce rows in a different order would be if a
parallel Append was sorting the subpaths by cost (See
create_append_path's call to list_sort) and the costs were for some
reason coming out differently sometimes. It's hard to imagine why this
query would be parallelised though. If you show us the EXPLAIN from a
passing and failing run, it might help us see the problem.

> If we
> are too sure that the output usually comes in the same order then the
> ORDER BY clause that exists in other tests seems useless. I am a bit
> confused & what could be a possible bug?

You can't claim that if this test shouldn't get an ORDER BY that all
tests shouldn't have an ORDER BY. That's just crazy. What if the test
is doing something like testing sort?!

David

Latches vs lwlock contention

2022-10-27 Thread Thomas Munro

Hi,

We usually want to release lwlocks, and definitely spinlocks, before
calling SetLatch(), to avoid putting a system call into the locked
region so that we minimise the time held.  There are a few places
where we don't do that, possibly because it's not just a simple latch
to hold a pointer to but rather a set of them that needs to be
collected from some data structure and we don't have infrastructure to
help with that.  There are also cases where we semi-reliably create
lock contention, because the backends that wake up immediately try to
acquire the very same lock.

One example is heavyweight lock wakeups.  If you run BEGIN; LOCK TABLE
t; ... and then N other sessions wait in SELECT * FROM t;, and then
you run ... COMMIT;, you'll see the first session wake all the others
while it still holds the partition lock itself.  They'll all wake up
and begin to re-acquire the same partition lock in exclusive mode,
immediately go back to sleep on *that* wait list, and then wake each
other up one at a time in a chain.  We could avoid the first
double-bounce by not setting the latches until after we've released
the partition lock.  We could avoid the rest of them by not
re-acquiring the partition lock at all, which ... if I'm reading right
... shouldn't actually be necessary in modern PostgreSQL?  Or if there
is another reason to re-acquire then maybe the comment should be
updated.

Presumably no one really does that repeatedly while there is a long
queue of non-conflicting waiters, so I'm not claiming it's a major
improvement, but it's at least a micro-optimisation.

There are some other simpler mechanical changes including synchronous
replication, SERIALIZABLE DEFERRABLE and condition variables (this one
inspired by Yura Sokolov's patches[1]).  Actually I'm not at all sure
about the CV implementation, I feel like a more ambitious change is
needed to make our CVs perform.

See attached sketch patches.  I guess the main thing that may not be
good enough is the use of a fixed sized latch buffer.  Memory
allocation in don't-throw-here environments like the guts of lock code
might be an issue, which is why it just gives up and flushes when
full; maybe it should try to allocate and fall back to flushing only
if that fails.  These sketch patches aren't proposals, just
observations in need of more study.

[1] 
https://postgr.es/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
From b64d5782e2c3a2e34274a3bf9df4449afaee94dc Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 26 Oct 2022 15:51:45 +1300
Subject: [PATCH 1/8] Provide SetLatches() for batched deferred latches.

If we have a way to buffer a set of wakeup targets and process them at a
later time, we can:

* move SetLatch() system calls out from under LWLocks, so that locks can
  be released faster; this is especially interesting in cases where the
  target backends will immediately try to acquire the same lock, or
  generally when the lock is heavily contended

* possibly gain some micro-opimization from issuing only two memory
  barriers for the whole batch of latches, not two for each latch to be
  set

* provide the opportunity for potential future latch implementation
  mechanisms to deliver wakeups in a single system call

Individual users of this facility will follow in separate patches.
---
 src/backend/storage/ipc/latch.c  | 187 ++-
 src/include/storage/latch.h  |  13 +++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 123 insertions(+), 78 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index eb3a569aae..71fdc388c8 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -576,105 +576,136 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 }
 
 /*
- * Sets a latch and wakes up anyone waiting on it.
- *
- * This is cheap if the latch is already set, otherwise not so much.
- *
- * NB: when calling this in a signal handler, be sure to save and restore
- * errno around it.  (That's standard practice in most signal handlers, of
- * course, but we used to omit it in handlers that only set a flag.)
- *
- * NB: this function is called from critical sections and signal handlers so
- * throwing an error is not a good idea.
+ * Set multiple latches at the same time.
+ * Note: modifies input array.
  */
-void
-SetLatch(Latch *latch)
+static void
+SetLatchV(Latch **latches, int nlatches)
 {
-#ifndef WIN32
-	pid_t		owner_pid;
-#else
-	HANDLE		handle;
-#endif
-
-	/*
-	 * The memory barrier has to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
+	/* Flush any other changes out to main memory just once. */
 	pg_memory_barrier();
 
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
+	/* Keep only latches that are not already set, and set them. */
+	for (int i = 0; i < nlatches; ++i)
+	{
+		Latch

Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql

2022-10-27 Thread Amul Sul

On Thu, Oct 27, 2022 at 6:54 PM Tom Lane  wrote:
>
> Nishant Sharma  writes:
> > We would like to share a proposal of a patch, where we have added order by
> > clause in two select statements in src/test/regress/sql/insert.sql file and
> > respective changes in src/test/regress/expected/insert.out output file.
>
> > This would help in generating output in consistent sequence, as sometimes
> > we have observed change in sequence in output.
>
> Please be specific about the circumstances in which the output is
> unstable for you.  With zero information to go on, it seems about as
> likely that this change is masking a bug as that it's a good idea.
>

At the first glance, I thought the patch is pretty much obvious, and
we usually add an ORDER BY clause to ensure stable output.   If we
are too sure that the output usually comes in the same order then the
ORDER BY clause that exists in other tests seems useless. I am a bit
confused & what could be a possible bug?

I have tested on my Centos and the Mac OS, insert.sql test is giving
stable output, I didn't find failure in the subsequent runs too but I
am not sure if that is enough evidence to skip the ORDER BY clause.

Regards,
Amul

Re: Allow file inclusion in pg_hba and pg_ident files

2022-10-27 Thread Julien Rouhaud

On Fri, Oct 28, 2022 at 10:24:23AM +0900, Michael Paquier wrote:
> On Thu, Oct 27, 2022 at 12:26:25PM +0800, Julien Rouhaud wrote:
>
> I am still not completely sure what's the best way to do things here,
> so let's do the following: let's keep the patch the way you think is
> better for now (I may change my opinion on that but I'll hack that by
> myself anyway).  Using what you have as a base, could you split the
> test and have it in its simplest to ease irs review?  It would be able
> to stress the buildfarm with a first version of the test and see how
> it goes from there, and it is useful by itself IMO as HEAD has zero
> coverage for this area.

To be honest I'd rather not to.  It's excessively annoying to work on those
tests (I spent multiple days trying to make it as clean and readable as
possible), and splitting it to only test the current infrastructure will need
some substantial efforts.

But more importantly, the next commit that will add tests for file inclusion
will then be totally unmaintainable and unreadable, so that's IMO even worse.
I think it will probably either be the current file overwritten or a new one
written from scratch if some changes are done in the simplified test, and I'm
not volunteering to do that.

Re: pg_recvlogical prints bogus error when interrupted

On Fri, Oct 28, 2022 at 4:41 AM Andres Freund  wrote:
>
> On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote:
>
>
> > + /* When we get SIGINT/SIGTERM, we exit */
> > + if (ready_to_exit)
> > + {
> > + /*
> > +  * Try informing the server about our exit, but don't 
> > wait around
> > +  * or retry on failure.
> > +  */
> > + (void) PQputCopyEnd(conn, NULL);
> > + (void) PQflush(conn);
> > + time_to_abort = ready_to_exit;
>
> This doesn't strike me as great - because the ready_to_exit isn't checked in
> the loop around StreamLogicalLog(), we'll reconnect if something else causes
> StreamLogicalLog() to return.

Fixed.

> Why do we need both time_to_abort and ready_to_exit?

Intention to have ready_to_exit is to be able to distinguish between
SIGINT/SIGTERM and aborting when endpos is reached so that necessary
code is skipped/executed and proper logs are printed.

> Perhaps worth noting that
> time_to_abort is still an sig_atomic_t, but isn't modified in a signal
> handler, which seems a bit unnecessarily confusing.

time_to_abort is just a static variable, no?

+static booltime_to_abort = false;
+static volatile sig_atomic_t ready_to_exit = false;

Please see the attached v3 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v3-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patch
Description: Binary data

Re: GUC values - recommended way to declare the C variables?

On Fri, Oct 28, 2022 at 11:48:13AM +0900, Michael Paquier wrote:
> Actually, pg_iovec.h uses WIN32 without any previous header declared,
> but win32.h tells a different story as of ed9b3606, where we would
> define WIN32 if it does not exist yet.

Seeing all the places where pg_status.h is included, that should be
fine, so please just ignore this part.
--
Michael


signature.asc
Description: PGP signature

Re: GUC values - recommended way to declare the C variables?

On Thu, Oct 27, 2022 at 07:00:26PM +1100, Peter Smith wrote:
> The GUC defaults of guc_tables.c, and the modified GUC C var
> declarations now share the same common #define'd value (instead of
> cut/paste preprocessor code).

Thanks.  I have not looked at the checkup logic yet, but the central
declarations seem rather sane, and I have a few comments about the
latter.

+#ifdef WIN32
+#define DEFAULT_UPDATE_PROCESS_TITLE false
+#else
+#define DEFAULT_UPDATE_PROCESS_TITLE true
+#endif
This is the kind of things I would document as a comment, say
"Disabled on Windows as the performance overhead can be significant".

Actually, pg_iovec.h uses WIN32 without any previous header declared,
but win32.h tells a different story as of ed9b3606, where we would
define WIN32 if it does not exist yet.  That may impact the default
depending on the environment used?  I am wondering whether the top of
win32.h could be removed, these days..

+#ifdef USE_PREFETCH
+#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 1
+#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 10
+#else
+#define DEFAULT_EFFECTIVE_IO_CONCURRENCY 0
+#define DEFAULT_MAINTENANCE_IO_CONCURRENCY 0
+#endif
These don't make sense without prefetching available.  Perhaps that's
obvious enough when reading the code still I would add a small note.
--
Michael

signature.asc
Description: PGP signature

Merging LatchWaitSet and FeBeWaitSet

2022-10-27 Thread Thomas Munro

Hi,

Currently all backends have LatchWaitSet (latch.c), and most also have
FeBeWaitSet (pqcomm.c).  It's not the end of the world, but it's a
little bit wasteful in terms of kernel resources to have two
epoll/kqueue descriptors per backend.

I wonder if we should consider merging them into a single
BackendWaitSet.  The reason they exist is because callers of
WaitLatch() might be woken by the kernel just because data appears on
the FeBe socket.  One idea is that we could assume that socket
readiness events should be rare enough at WaitLatch() sites that it's
enough to disable them lazily if they are reported.  The FeBe code
already adjusts as required.  For example, if you're waiting for a
heavyweight lock or condition variable while executing a query, and
pipelined query or COPY data arrives, you'll spuriously wake up, but
only once and not again until you eventually reach FeBe read and all
queued socket data is drained and more data arrives.

Sketch patch attached.  Just an idea, not putting into commitfest yet.

(Besides the wasted kernel sources, I also speculate that things get
pretty confusing if you try to switch to completion based APIs for
more efficient socket IO on various OSes, depending on how you
implement latches.  I have some handwavy theories about various
schemes to achieve that on Linux, Windows and FreeBSD with various
different problems relating to the existence of two kernel objects.
Which is a bit more fuel for my early suspicion that postgres_fdw,
which currently creates and destroys WES, should eventually also use
BackendWaitSet, which should be dynamically resizing.  But that's for
another time.)
From e3c4bb6245d0329f53d5545f487c996da1948ade Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 27 Oct 2022 16:37:28 +1300
Subject: [PATCH] Merge FeBeWaitSet and LatchWaitSet.

Instead of using two epoll/kqueue kernel objects and descriptors in
every backend process, create a new common one called BackendWaitSet.
In backends that are connected to a FeBe socket, WaitLatch() might
occasionally report a socket readiness event if data arrives, but that's
not expected to happen much and can be lazily suppressed, and reenabled
at the next FeBe wait.
---
 src/backend/libpq/be-secure.c   | 10 ++--
 src/backend/libpq/pqcomm.c  | 21 +++-
 src/backend/replication/walsender.c |  4 +-
 src/backend/storage/ipc/latch.c | 74 +++--
 src/backend/utils/init/miscinit.c   | 12 ++---
 src/include/libpq/libpq.h   |  6 ---
 src/include/storage/latch.h |  9 +++-
 7 files changed, 77 insertions(+), 59 deletions(-)

diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index e3e54713e8..431abd58ce 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -179,9 +179,10 @@ retry:
 
 		Assert(waitfor);
 
-		ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, waitfor, NULL);
+		ModifyWaitEvent(BackendWaitSet, BackendWaitSetLatchPos, WL_LATCH_SET, MyLatch);
+		ModifyWaitEvent(BackendWaitSet, BackendWaitSetSocketPos, waitfor, NULL);
 
-		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1,
+		WaitEventSetWait(BackendWaitSet, -1 /* no timeout */ , &event, 1,
 		 WAIT_EVENT_CLIENT_READ);
 
 		/*
@@ -291,9 +292,10 @@ retry:
 
 		Assert(waitfor);
 
-		ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, waitfor, NULL);
+		ModifyWaitEvent(BackendWaitSet, BackendWaitSetLatchPos, WL_LATCH_SET, MyLatch);
+		ModifyWaitEvent(BackendWaitSet, BackendWaitSetSocketPos, waitfor, NULL);
 
-		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1,
+		WaitEventSetWait(BackendWaitSet, -1 /* no timeout */ , &event, 1,
 		 WAIT_EVENT_CLIENT_WRITE);
 
 		/* See comments in secure_read. */
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index ce56ab1d41..1762e3a6ba 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -161,8 +161,6 @@ static const PQcommMethods PqCommSocketMethods = {
 
 const PQcommMethods *PqCommMethods = &PqCommSocketMethods;
 
-WaitEventSet *FeBeWaitSet;
-
 
 /* 
  *		pq_init - initialize libpq at backend startup
@@ -172,7 +170,6 @@ void
 pq_init(void)
 {
 	int			socket_pos PG_USED_FOR_ASSERTS_ONLY;
-	int			latch_pos PG_USED_FOR_ASSERTS_ONLY;
 
 	/* initialize state variables */
 	PqSendBufferSize = PQ_SEND_BUFFER_SIZE;
@@ -200,20 +197,14 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
-	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
+	socket_pos = AddWaitEventToSet(BackendWaitSet, WL_SOCKET_WRITEABLE,
    MyProcPort->sock, NULL, NULL);
-	latch_pos = AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, PGINVALID_SOCKET,
-  MyLatch, NULL);
-	AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
-	  NULL, NULL);
 
 	/*
 	 * The event positions match the order

Re: Make mesage at end-of-recovery less scary.

2022-10-27 Thread Justin Pryzby

rebased
>From 67ce65038ae6a7d5b023b7472df9f9ca9835d5f5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Jul 2022 11:51:45 +0900
Subject: [PATCH] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 135 +-
 src/backend/access/transam/xlogrecovery.c |  95 +++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 +
 6 files changed, 299 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b2544..f891a629443 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -558,6 +561,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -640,25 +644,21 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Validate the record header.
 	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
+	 * Even though we use an XLogRecord pointer here, the whole record header
+	 * might not fit on this page.  If the whole record header is on this page,
+	 * validate it immediately.  Even otherwise xl_tot_len must be on this page
+	 * (it is the first field of MAXALIGNed records), but we still cannot
+	 * access any further fields until we've verified that we got the whole
+	 * header, so do just a basic sanity check on record length, and validate
+	 * the rest of the header after reading it from the next page.  The length
 	 * check is necessary here to ensure that we enter the "Need to reassemble
 	 * record" code path below; otherwise we might fail to apply
 	 * ValidXLogRecordHeader at all.
 	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -668,18 +668,14 @@ restart:
 	}
 	else
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-  "invalid record length at %X/%X: wanted %u, got %u",
-  LSN_FORMAT_ARGS(RecPtr),
-  (uint32) SizeOfXLogRecord, total_len);
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -1106,16 +1102,47 @@ XLogReaderInvalReadState(XLogReaderState *state)
 }
 
 /*
- * Validate an XLOG record header.
+ * Validate record length of an XLOG record header.
  *
- * This is just a convenience subroutine to avoid duplicated code in
- * XLogReadRecord.  It's not intended for use from anywhere else.
+ * This is substantially a part of ValidXLogRecordHeader.  But XL

Re: Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE

2022-10-27 Thread Dong Wook Lee

On Fri, Oct 28, 2022 at 12:08 AM vignesh C  wrote:
>
> Hi,
>
> Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE action was
> missing, this patch adds the tab completion for the same.
>
> Regards,
> Vignesh

Hi,
I applied your patch and did some tests.
Is it okay not to consider SET and RESET commands? (e.g ALTER FUNCTION)

---
Regards,
DongWook Lee.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Thu, Oct 27, 2022 at 03:58:25PM -0700, Andres Freund wrote:
> The block sizes don't need to match, do they? As long as the block is properly
> aligned, we can change the iov_len of the final iov to match whatever the size
> is being passed in, no?

Hmm.  Based on what Bharath has written upthread, it does not seem to
matter if the size of the aligned block changes, either:
https://www.postgresql.org/message-id/calj2acuccjr7kbkqwosqmqh1zgedyj7hh5ef+dohcv7+kon...@mail.gmail.com

I am honestly not sure whether it is a good idea to make file_utils.c
depend on one of the compile-time page sizes in this routine, be it
the page size of the WAL page size, as pg_write_zeros() would be used
for some rather low-level operations.  But we could as well just use a
locally-defined structure with a buffer at 4kB or 8kB and call it a
day?
--
Michael


signature.asc
Description: PGP signature

Re: Allow file inclusion in pg_hba and pg_ident files

On Thu, Oct 27, 2022 at 12:26:25PM +0800, Julien Rouhaud wrote:
> On Thu, Oct 27, 2022 at 12:08:31PM +0900, Michael Paquier wrote:
>>
>> Putting things afresh, there are two different things here (sorry I
>> need to see that typed ;p):
>> 1) How do we want to check reliably the loading of the HBA and ident
>> files on errors?
> 
> I guess you meant the failure to load HBA / ident files containing invalid
> data?

Yeah.

>> Hmm.  And what if we just gave up on the checks for error patterns in
>> pg_hba_file_rules?

One part that I was thinking about when typing this part yesterday is
that an EXEC_BACKEND build should work in non-WIN32 in TAP even if
pg_ident.conf cannot be loaded, but I forgot entirely about the part
where we need a user mapping for the SSPI authentication on WIN32, as
set by pg_regress.

> We discussed this problem in the past (1), and my understanding was that
> detecting a -DEXEC_BACKEND/Win32 build and skipping those tests in that case
> would be an acceptable solution to make sure there's at least some coverage.
> The proposed patch adds such an approach, making sure that the failure is due
> to an invalid HBA file.  If you changed you mind I can remove that part, but
> again I'd like to be sure of what you exactly want before starting to rewrite
> stuff.

I am still not completely sure what's the best way to do things here,
so let's do the following: let's keep the patch the way you think is
better for now (I may change my opinion on that but I'll hack that by
myself anyway).  Using what you have as a base, could you split the
test and have it in its simplest to ease irs review?  It would be able
to stress the buildfarm with a first version of the test and see how
it goes from there, and it is useful by itself IMO as HEAD has zero
coverage for this area.

> I'm not sure what you mean here.  The patch does check for all the errors
> looking at LOG lines and CONTEXT lines, but to make the regexp easier it
> doesn't try to make sure that each CONTEXT line is immediately following the
> expected LOG line.

Hmm.  Perhaps we'd better make sure that the LOG/CONTEXT link is
checked?  The context includes the line number while a generic
sentence, and the LOG provides all the details of the error
happening.

> That's why the errors are divided in 2 steps: a first step with a single error
> using some inclusion, so we can validate that the CONTEXT line is entirely
> correct (wrt. line number and such), and then every possible error pattern
> where we assume that the CONTEXT line are still following their LOG entry if
> they're found.  It also has the knowledge of which errors adds a CONTEXT line
> and which don't.  And that's done twice, for HBA and ident.

Okay, so you do check the relationship between both, after all.

> The part 3 is just concatenating everything back, for HBA and ident.  So
> long-term maintenance shouldn't get any harder as there won't be any need for
> more steps.  We can just keep appending stuff in the 2nd step and all the 
> tests
> should run as expected.

Hmm.  Okay.
--
Michael

signature.asc
Description: PGP signature

Re: Perform streaming logical transactions by background workers and parallel apply

2022-10-27 Thread Masahiko Sawada

On Thu, Oct 27, 2022 at 11:34 AM shiy.f...@fujitsu.com
 wrote:
>
> On Wed, Oct 26, 2022 7:19 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 25, 2022 at 8:38 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Fri, Oct 21, 2022 at 6:32 PM houzj.f...@fujitsu.com
> > >  wrote:
> > >
> > > I've started to review this patch. I tested v40-0001 patch and have
> > > one question:
> > >
> > > IIUC even when most of the changes in the transaction are filtered out
> > > in pgoutput (eg., by relation filter or row filter), the walsender
> > > sends STREAM_START. This means that the subscriber could end up
> > > launching parallel apply workers also for almost empty (and streamed)
> > > transactions. For example, I created three subscriptions each of which
> > > subscribes to a different table. When I loaded a large amount of data
> > > into one table, all three (leader) apply workers received START_STREAM
> > > and launched their parallel apply workers.
> > >
> >
> > The apply workers will be launched just the first time then we
> > maintain a pool so that we don't need to restart them.
> >
> > > However, two of them
> > > finished without applying any data. I think this behaviour looks
> > > problematic since it wastes workers and rather decreases the apply
> > > performance if the changes are not large. Is it worth considering a
> > > way to delay launching a parallel apply worker until we find out the
> > > amount of changes is actually large?
> > >
> >
> > I think even if changes are less there may not be much difference
> > because we have observed that the performance improvement comes from
> > not writing to file.
> >
> > > For example, the leader worker
> > > writes the streamed changes to files as usual and launches a parallel
> > > worker if the amount of changes exceeds a threshold or the leader
> > > receives the second segment. After that, the leader worker switches to
> > > send the streamed changes to parallel workers via shm_mq instead of
> > > files.
> > >
> >
> > I think writing to file won't be a good idea as that can hamper the
> > performance benefit in some cases and not sure if it is worth.
> >
>
> I tried to test some cases that only a small part of the transaction or an 
> empty
> transaction is sent to subscriber, to see if using streaming parallel will 
> bring
> performance degradation.
>
> The test was performed ten times, and the average was taken.
> The results are as follows. The details and the script of the test is 
> attached.
>
> 10% of rows are sent
> --
> HEAD24.4595
> patched 18.4545
>
> 5% of rows are sent
> --
> HEAD21.244
> patched 17.9655
>
> 0% of rows are sent
> --
> HEAD18.0605
> patched 17.893
>
>
> It shows that when only 5% or 10% of rows are sent to subscriber, using 
> parallel
> apply takes less time than HEAD, and even if all rows are filtered there's no
> performance degradation.

Thank you for the testing!

I think this performance improvement comes from both applying changes
in parallel to receiving changes and avoiding writing a file. I'm
happy to know there is also a benefit also for small streaming
transactions. I've also measured the overhead when processing
streaming empty transactions and confirmed the overhead is negligible.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Support logical replication of DDLs

On Fri, Oct 28, 2022 at 11:20 AM Zheng Li  wrote:
>

> > 1. It might be useful to add this thread to the commitfest, if only so
> > the cfbot can discover the latest patch set and alert about any rebase
> > problems.
>
> There is already a commitfest entry for the thread that I added back in March:
> https://commitfest.postgresql.org/40/3595/

Sorry, I missed that earlier because I searched only by authors, and
some were missing. Now I saw it has just been updated - thanks.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2022-10-27 Thread Justin Pryzby

On Fri, Dec 13, 2019 at 03:03:47PM +1300, Thomas Munro wrote:
> > Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks
> > non-regular files and thus shared filesets.  Maybe that's worth
> > discussion on a new thread ?
> >
> > src/backend/utils/adt/genfile.c
> > /* Ignore anything but regular files */
> > if (!S_ISREG(attrib.st_mode))
> > continue;
> 
> +1, that's worth fixing.

@cfbot: rebased on eddc128be.
>From f641f63a1ff3efd90a3831927c758636cc82d080 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v37 01/11] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 4 
 1 file changed, 4 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6e0425cb3dc..d958c3e74ac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27700,6 +27700,10 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 platforms only), file creation time stamp (Windows only), and a flag
 indicating if it is a directory.

+   
+   If filename is a link, this function returns information about the file
+   or directory the link refers to.
+   

 This function is restricted to superusers by default, but other users
 can be granted EXECUTE to run the function.
-- 
2.25.1

>From 3f28fa01fa8e01b633e5c2a24ba4deed4abe6053 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 17 Mar 2020 13:16:24 -0500
Subject: [PATCH v37 02/11] Add tests before changing pg_ls_*

---
 src/test/regress/expected/misc_functions.out | 59 
 src/test/regress/sql/misc_functions.sql  | 15 +
 2 files changed, 74 insertions(+)

diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 88bb696ded8..77d285ecc85 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -480,6 +480,65 @@ select count(*) > 0 from
  t
 (1 row)
 
+select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true
+ name 
+--
+ .
+(1 row)
+
+select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false
+ name 
+--
+(0 rows)
+
+select pg_ls_dir('does not exist', true, false); -- ok with missingok=true
+ pg_ls_dir 
+---
+(0 rows)
+
+select pg_ls_dir('does not exist'); -- fails with missingok=false
+ERROR:  could not open directory "does not exist": No such file or directory
+-- Check that expected columns are present
+select * from pg_ls_archive_statusdir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_ls_logdir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_ls_logicalmapdir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_ls_logicalsnapdir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_ls_replslotdir('') limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_ls_tmpdir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_ls_waldir() limit 0;
+ name | size | modification 
+--+--+--
+(0 rows)
+
+select * from pg_stat_file('.') limit 0;
+ size | access | modification | change | creation | isdir 
+--++--++--+---
+(0 rows)
+
 --
 -- Test replication slot directory functions
 --
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index b07e9e8dbb3..d299f3d8949 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -166,6 +166,21 @@ select count(*) > 0 from
where spcname = 'pg_default') pts
   join pg_database db on pts.pts = db.oid;
 
+select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true
+select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false
+select pg_ls_dir('does not exist', true, false); -- ok with missingok=true
+select pg_ls_dir('does not exist'); -- fails with missingok=false
+
+-- Check that expected columns are present
+select * from pg_ls_archive_statusdir() limit 0;
+select * from pg_ls_logdir() limit 0;
+select * from pg_ls_logicalmapdir() limit 0;
+select * from pg_ls_logicalsnapdir() limit 0;
+select * from pg_ls_replslotdir('') limit 0;
+select * from pg_ls_tmpdir() limit 0;
+select * from pg_ls_waldir() limit 0;
+select * from pg_stat_file('.') limit 0;
+
 --
 -- Test replication slot directory functions
 --
-- 
2.25.1

>From 2fcff37d19869c83eab7c8116946854f52a20515 Mon Sep

Re: Support logical replication of DDLs

2022-10-27 Thread Zheng Li

> Hi, authors on this thread.
>
> The patch v32-0001 is very large, so it will take some time to review
> the code in detail.

Thanks for reviewing!

> Meanwhile, here are some general comments about the patch:
>
> ==
>
> 1. It might be useful to add this thread to the commitfest, if only so
> the cfbot can discover the latest patch set and alert about any rebase
> problems.

There is already a commitfest entry for the thread that I added back in March:
https://commitfest.postgresql.org/40/3595/

> 2. User interface design/documentation?
>
> Please consider adding user interface documentation, so it is
> available for review sooner than later. This comment was previously
> posted 2 weeks ago [1] but no replies.
>
> I can only guess (from the test of patch 0004) that the idea is to use
> another 'ddl' option for the 'publish' parameter:
> CREATE PUBLICATION mypub FOR ALL TABLES with (publish = 'insert,
> update, delete, ddl');
>
> My first impression is that a blanket option like that could be
> painful for some users who DO (for example) want the convenience of
> the DDL replication automagically creating new tables on the fly, but
> maybe they do NOT want the side-effect of replicating every other kind
> of DDL as well.
>
> Maybe such scenarios can be handled by another publication parameter
> which can allow more fine-grained DDL replication like:
> CREATE PUBLICATION mypub FOR ALL TABLES WITH (ddl = 'tables')
>
> I also have lots of usability questions but probably the documentation
> would give the answers to those. IMO the docs for the user interface
> and runtime behaviours should not be deferred - they should form part
> of this patch 0001.

We've been deferring the discussion on user interface syntax (and
documentation) until we
get the DDL deparser in a good shape. I agree it's time
to pick up the discussion again now that we're getting close to fully
integrating
the DDL deparser with DDL replication. I think it makes sense to introduce
different DDL replication granularity levels, for example, I think the
most important levels
would be ddl = 'tables' and ddl = 'database' (or ddl = 'all').

> 5. File size
>
> As mentioned above in #4, the src/backend/commands/ddl_deparse.c is
> huge (9200+ lines as at v32-0001). It is already unwieldy. Is there
> some way to reduce this? For example, perhaps many of those
> "utility/helper" functions (even though they are static) would be
> better moved out to another file simply to get things down to a more
> manageable size.

Yes, I think we can split patch 0001 into a bare-bone patch for a few
essential commands and a patch
for the rest of the commands for ease of review.

Another topic we haven't discussed is the ownership of the replicated
objects. Currently all the replicated
objects are owned by the subscription owner regardless of their owners
in the publisher database. I think
we can consider making it user configurable so that the ownership of
the replicated objects match that of their original owner in
certain use cases such as in a full database logical replica scenario.
Otherwise the DBA will have to
fix the ownership structure manually which could be painful.

Thoughts?

Regards,
Zheng

Re: Requiring 32 bit atomics

Hi,

On 2022-10-27 19:44:13 -0400, Tom Lane wrote:
> Turns out they have a pretty cute workaround for it, on HPPA and a couple of
> other atomics-less arches they still support.  They've written short
> sequences that have the effect of CAS and are designed to store to memory
> only at the end.  To make them atomic, libc asks the kernel "pretty please,
> if you happen to notice that I've been interrupted in the PC range from here
> to here, would you reset the PC to the start of that before returning?".

That sounds roughly like restartable sequences in the linux world - a pretty
cool feature.  It's too bad that it's not yet available everywhere, it does
make some things a lot easier [to make performant].


> Anyway, I think the big picture here is that nowadays we could
> assume that the platform offers this feature.

Agreed.

Greetings,

Andres Freund

Re: Requiring 32 bit atomics

I wrote:
> But wait, you say, what about mamba-nee-gaur, my HPPA dinosaur?

sigh ... s/mamba/chickadee/.  Got too many NetBSD machines, perhaps.

regards, tom lane

Re: Requiring 32 bit atomics

Thomas Munro  writes:
> We have fallback code for computers that don't have 32 bit atomic ops.
> Of course all modern ISAs have 32 bit atomics, but various comments
> imagine that a new architecture might be born that we don't have
> support for yet, so the fallback provides a way to bring a new system
> up by implementing only the spinlock operations and emulating the
> rest.  This seems pretty strange to me: by the time someone brings an
> SMP kernel up on a hypothetical new architecture and gets around to
> porting relational databases, it's hard to imagine that the compiler
> builtins and C11 atomic support wouldn't be working.

Fair point.  Another point you could make is that we no longer have
any test coverage for machines without 32-bit atomic ops.

But wait, you say, what about mamba-nee-gaur, my HPPA dinosaur?
The only actual hardware support there is equivalent to TAS();
nonetheless, if you read mamba's configure report you'll see it
claims to have atomic ops.  I wondered if NetBSD was implementing
that by using kernel calls to disable interrupts, or something
equally badly-performing.  Turns out they have a pretty cute
workaround for it, on HPPA and a couple of other atomics-less
arches they still support.  They've written short sequences that
have the effect of CAS and are designed to store to memory only
at the end.  To make them atomic, libc asks the kernel "pretty
please, if you happen to notice that I've been interrupted in
the PC range from here to here, would you reset the PC to the
start of that before returning?".  At least on HPPA, this is
implemented for 8-bit, 16-bit, and 32-bit CAS and then all the
other standard atomics are implemented on top of that, so that
the kernel doesn't spend too much time checking for these
address ranges when it takes an interrupt.

Of course this only works on single-CPU machines.  On multi-CPU
there's a completely different implementation that I've not spent
time looking at ... but I assume the performance is a lot worse.

Anyway, I think the big picture here is that nowadays we could
assume that the platform offers this feature.

regards, tom lane

Re: Proposal to use JSON for Postgres Parser format

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
> There are certainly reasons to think about changing the node tree
> storage format; but if we change it, I'd like to see it go to something
> more compact not more verbose.

Very much seconded - the various pg_node_trees are a quite significant
fraction of the overall size of an empty database. And they're not
particularly useful for a human either.

IIRC it's not just catalog storage that's affected, but iirc also relevant for
parallel query.

My pet peeve is the way datums are output as individual bytes printed as
integers each. For narrow fixed-width datums including a lot of 0's for bytes
that aren't even used in the datum.

> Maybe a compromise could be found whereby we provide a conversion function
> that converts whatever the catalog storage format is to some JSON
> equivalent.  That would address the needs of external code that doesn't want
> to write a custom parser, while not tying us directly to JSON.

+1

Greetings,

Andres Freund

Re: Support logical replication of DDLs

Hi, authors on this thread.

The patch v32-0001 is very large, so it will take some time to review
the code in detail.

Meanwhile, here are some general comments about the patch:

1. It might be useful to add this thread to the commitfest, if only so
the cfbot can discover the latest patch set and alert about any rebase
problems.

~~~

2. User interface design/documentation?

Please consider adding user interface documentation, so it is
available for review sooner than later. This comment was previous
posted 2 weeks ago [1] but no replies.

I can only guess (from the test of patch 0004) that the idea is to use
another 'ddl' option for the 'publish' parameter:
CREATE PUBLICATION mypub FOR ALL TABLES with (publish = 'insert,
update, delete, ddl');

My first impression is that a blanket option like that could be
painful for some users who DO (for example) want the convenience of
the DDL replication automagically creating new tables on the fly, but
maybe they do NOT want the side-effect of replicating every other kind
of DDL as well.

Maybe such scenarios can be handled by another publication parameter
which can allow more fine-grained DDL replication like:
CREATE PUBLICATION mypub FOR ALL TABLES WITH (ddl = 'tables')

I also have lots of usability questions but probably the documentation
would give the answers to those. IMO the docs for the user interface
and runtime behaviours should not be deferred - they should form part
of this patch 0001.

~~~

3. Why all-or-nothing?

The current strategy for this thread appears to be to implement
*everything* in the underlying code, and then figure out what to do
with it. I'm not sure if the all-or-nothing approach is best - It just
feels risky to me, so I hope it will not result in a ton of work that
ends up unused.

Why not instead implement just some core set of the DDL replications
that are the most wanted ones (e.g. create/alter/drop
tables/whatever?) and try to get that subset committed first? Then the
remainder can be added incrementally. But this harks back to comment
#2: the user interface would need to allow flexibility to do it like
this.

~~~

4. Function name inconsistency.

Even if it was not obvious from the posts, it is clear there are
multiple authors. As an example, the file
src/backend/commands/ddl_deparse.c is 9200+ lines (and growing
rapidly) and the functions in this module are a mixture of many
different naming conventions and they seem scattered around the source
file in different ways.

I suggest this all needs to be brought under some control ASAP, by
introducing some strict naming convention and sticking to it.

For example, you might do something like:
* xxx_util_foo()
* xxx_util_bah()
* xxx_deparse_alter()
* xxx_deparse_create()
* xxx_whatever()
where xxx is the main object (table, sequence, schema, etc).

Then order everything alphabetically, so that related stuff ends up
together. IMO grouping functions like this will also make reviewing
different objects far easier.

~~~

5. File size

As mentioned above in #4, the src/backend/commands/ddl_deparse.c is
huge (9200+ lines as at v32-0001). It is already unwieldy. Is there
some way to reduce this? For example, perhaps many of those
"utility/helper" functions (even though they are static) would be
better moved out to another file simply to get things down to a more
manageable size.

--
[1]
https://www.postgresql.org/message-id/CAHut%2BPtKiTmcQ7zXs6YvR-qtuMQ9wgffnfamqCAVpM_ETa2LCg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: pg_recvlogical prints bogus error when interrupted

Hi,

On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote:
> I came up with the attached v2 patch, please have a look.

Thanks for working on this!

> + /* When we get SIGINT/SIGTERM, we exit */
> + if (ready_to_exit)
> + {
> + /*
> +  * Try informing the server about our exit, but don't 
> wait around
> +  * or retry on failure.
> +  */
> + (void) PQputCopyEnd(conn, NULL);
> + (void) PQflush(conn);
> + time_to_abort = ready_to_exit;

This doesn't strike me as great - because the ready_to_exit isn't checked in
the loop around StreamLogicalLog(), we'll reconnect if something else causes
StreamLogicalLog() to return.

Why do we need both time_to_abort and ready_to_exit? Perhaps worth noting that
time_to_abort is still an sig_atomic_t, but isn't modified in a signal
handler, which seems a bit unnecessarily confusing.

Greetings,

Andres Freund

Re: Documentation for building with meson

Hi,

On 2022-10-27 14:15:32 -0700, Jacob Champion wrote:
> On Thu, Oct 27, 2022 at 1:04 AM John Naylor
>  wrote:
> > This does not work for me in a fresh install until running
> >
> > meson test --suite setup
> >
> > In fact, we see in
> >
> > https://wiki.postgresql.org/wiki/Meson
> >
> > meson test --suite setup --suite main
> 
> (Is there a way to declare a dependency on the setup suite in Meson,
> so that we don't have to specify it manually? I was bitten by this
> recently; if you make a code change and forget to run setup, it'll
> recompile locally but then skip reinstallation, giving false test
> results.)

Tests can have dependencies, and they're correctly built. The problem however
is that, for historical reasons if I understand correctly, dependencies of
tests are automatically included in the default 'all' target. Which means if
you just type in 'ninja', it'd automatically create the test installation -
which is probably not what we want, given that that's not a fast step on some
platforms.

Greetings,

Andres Freund

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

Hi,

Interestingly, I also needed something like pg_pwrite_zeros() today. Exposed
via smgr, for more efficient relation extensions.

On 2022-10-27 14:54:00 +0900, Michael Paquier wrote:
> Regarding 0002, using pg_pwrite_zeros() as a routine name, as
> suggested by Thomas, sounds good to me.  However, I am not really a
> fan of its dependency with PGAlignedXLogBlock, because it should be
> able to work with any buffers of any sizes, as long as the input
> buffer is aligned, shouldn't it?  For example, what about
> PGAlignedBlock?  So, should we make this more extensible?  My guess
> would be the addition of the block size and the block pointer to the
> arguments of pg_pwrite_zeros(), in combination with a check to make
> sure that the input buffer is MAXALIGN()'d (with an Assert() rather
> than just an elog/pg_log_error?).

I don't like passing in the buffer. That leads to code like in Bharat's latest
version, where we now zero that buffer on every invocation of
pg_pwrite_zeros() - not at all cheap. And every caller has to have provisions
to provide that buffer.

The block sizes don't need to match, do they? As long as the block is properly
aligned, we can change the iov_len of the final iov to match whatever the size
is being passed in, no?

Why don't we define a

static PGAlignedBlock source_of_zeroes;

in file_utils.c, and use that in pg_pwrite_zeros(), being careful to set the
iov_len arguments correctly?

Greetings,

Andres Freund

Requiring 32 bit atomics

2022-10-27 Thread Thomas Munro

Hi,

We have fallback code for computers that don't have 32 bit atomic ops.
Of course all modern ISAs have 32 bit atomics, but various comments
imagine that a new architecture might be born that we don't have
support for yet, so the fallback provides a way to bring a new system
up by implementing only the spinlock operations and emulating the
rest.  This seems pretty strange to me: by the time someone brings an
SMP kernel up on a hypothetical new architecture and gets around to
porting relational databases, it's hard to imagine that the compiler
builtins and C11 atomic support wouldn't be working.

I suppose this could be considered in the spirit of recent cleanup of
obsolete code in v16.  The specific reason I'm interested is that I
have a couple of different experimental patches in development that
would like to use atomic ops from a signal handler, which is against
the law if they're emulated with spinlocks due to self-deadlock.  Not
sure if it's really a blocker, I can surely find some way to code
around the limitation (I want to collapse a lot of flags into a single
word and set them with fetch_or), but it seemed a little weird to have
to do so for such an unlikely hypothetical consideration.

(64 bit atomics are another matter, real hardware exists that doesn't
have them.)

No patch yet, just running a flame-proof flag up the poll before
investing effort...

Re: Documentation for building with meson

2022-10-27 Thread Jacob Champion

On Thu, Oct 27, 2022 at 1:04 AM John Naylor
 wrote:
> This does not work for me in a fresh install until running
>
> meson test --suite setup
>
> In fact, we see in
>
> https://wiki.postgresql.org/wiki/Meson
>
> meson test --suite setup --suite main

(Is there a way to declare a dependency on the setup suite in Meson,
so that we don't have to specify it manually? I was bitten by this
recently; if you make a code change and forget to run setup, it'll
recompile locally but then skip reinstallation, giving false test
results.)

--Jacob

Re: Have nodeSort.c use datum sorts single-value byref types

2022-10-27 Thread David Rowley

On Wed, 26 Oct 2022 at 23:35, David Rowley  wrote:
> I think this is a fairly trivial patch, so if nobody objects, I plan
> to push it in the next few days.

Pushed.

David

Re: Allow single table VACUUM in transaction block

2022-10-27 Thread Justin Pryzby

On Thu, Oct 27, 2022 at 10:31:31AM +0100, Simon Riggs wrote:
> Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL.

Maybe I misunderstood what you meant: you said "not VACUUM FULL", but
with your patch, that works:

postgres=# begin; VACUUM FULL pg_class; commit;
BEGIN
VACUUM
COMMIT

Actually, I've thought before that it was bit weird that CLUSTER can be
run within a transaction, but VACUUM FULL cannot (even though it does a
CLUSTER behind the scenes).  VACUUM FULL can process multiple relations,
whereas CLUSTER can't, but it seems nice to allow vacuum full for the
case of a single relation.

I haven't checked the rest of the patch, but +1 for allowing VACUUM FULL
within a user txn.

Maybe the error message needs to be qualified "...when multiple
relations are specified".

ERROR:  VACUUM cannot run inside a transaction block

-- 
Justin

Re: Reducing planning time on tables with many indexes

2022-10-27 Thread Alvaro Herrera

On 2022-Aug-19, David Geier wrote:

> Beyond that I did some off-CPU profiling to precisely track down which lock
> serializes execution. It turned out to be the MyProc::fpInfoLock lightweight
> lock. This lock is used in the fast path of the heavyweight lock. In the
> contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if
> there is no other process holding a stronger lock, and if not, to reserve a
> process local fast path lock slot and (2) to return the fast path lock slots
> all in one go. To do so, the current implementation always linearly iterates
> over all lock slots.

Ah, so this is the aspect that you mentioned to me today.  I definitely
think that this analysis deserves its own thread, and the fix is its own
separate patch.

> I have attached the patch to improve the heavyweight lock fast path. It also
> for now contains moving out _bt_getrootheight(). For workloads where the
> same set of locks is used over and over again, it only needs on average a
> single loop iteration to find the relation (instead of a linear scan
> before). This allows to increase the number of fast path locks by a lot. In
> this patch I increased them from 16 to 64. The code can be further improved
> for cases where to be locked relations change frequently and therefore the
> chance of not finding a relation and because of that having to linearly
> search the whole array is higher.

I suggest to put each change in a separate patch:

1. improve fast-path lock algorithm to find the element, perhaps
  together with increasing the number of elements in the array
2. change _bt_getrootheight

However, since patch (1) may have nontrivial performance implications,
you would also need to justify the change: not only that improves the
case where many locks are acquired, but also that it does not make the
case with few locks worse.

I strongly suggest to not include C++ comments or any other dirtiness in
the patch, as that might deter some potential reviewers.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html

heavily contended lwlocks with long wait queues scale badly

Hi,

I am working on posting a patch series making relation extension more
scalable. As part of that I was running some benchmarks for workloads that I
thought should not or just positively impacted - but I was wrong, there was
some very significant degradation at very high client counts. After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention. This
morning, turns out sleeping helps, I managed to reproduce it in an unmodified
postgres.

$ cat ~/tmp/txid.sql
SELECT txid_current();
$ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c 
";pgbench -n -M prepared -f ~/tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk 
'{print $3}';done
160174
2   116169
4   208119
8   373685
16  515247
32  554726
64  497508
128 415097
256 334923
512 243679
768 192959
1024157734
2048 82904
4096 32007

(I didn't properly round TPS, but that doesn't matter here)


Performance completely falls off a cliff starting at ~256 clients. There's
actually plenty CPU available here, so this isn't a case of running out of
CPU time.

Rather, the problem is very bad contention on the "spinlock" for the lwlock
wait list. I realized that something in that direction was off when trying to
investigate why I was seeing spin delays of substantial duration (>100ms).

The problem isn't a fundamental issue with lwlocks, it's that
LWLockDequeueSelf() does this:

LWLockWaitListLock(lock);

/*
 * Can't just remove ourselves from the list, but we need to iterate 
over
 * all entries as somebody else could have dequeued us.
 */
proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
{
if (iter.cur == MyProc->pgprocno)
{
found = true;
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
break;
}
}

I.e. it iterates over the whole waitlist to "find itself". The longer the
waitlist gets, the longer this takes. And the longer it takes for
LWLockWakeup() to actually wake up all waiters, the more likely it becomes
that LWLockDequeueSelf() needs to be called.


We can't make the trivial optimization and use proclist_contains(), because
PGPROC->lwWaitLink is also used for the list of processes to wake up in
LWLockWakeup().

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

client  patched   HEAD
16010960174
2   112694   116169
4   214287   208119
8   377459   373685
16  524132   515247
32  565772   554726
64  587716   497508
128 581297   415097
256 550296   334923
512 486207   243679
768 449673   192959
1024410836   157734
204832622482904
409625025232007

Not perfect with the patch, but not awful either.


I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.

I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not?  The only thing that
gives me pause is that this is quite hard to pinpoint as happening.


I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.

Greetings,

Andres Freund
diff --git i/src/include/storage/proc.h w/src/include/storage/proc.h
index 8d096fdeeb1..9a2615666a1 100644
--- i/src/include/storage/proc.h
+++ w/src/include/storage/proc.h
@@ -217,7 +217,8 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	int			lwWaiting;		/* 0 if not waiting, 1 if on waitlist, 2 if
+ * waiting to be woken */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git i/src/backend/storage/lmgr/lwlock.c w/src/backend/storage/lmgr/lwlock.c
index d274c9b1dc9..ffb852c91d7 100644
--- i/src/backend/storage/lmgr/lwlock.c
+++ w/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,9 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/* signal that the process isn't on the wait list anymore */
+		waiter->lwWaiting = 2;
+

Re: Code checks for App Devs, using new options for transaction behavior

On Thu, 27 Oct 2022 at 12:09, Simon Riggs  wrote:

> Comments please

Update from patch tester results.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


002_nested_xacts.v7.patch
Description: Binary data


001_psql_parse_only.v1.patch
Description: Binary data


003_rollback_on_commit.v1.patch
Description: Binary data


004_add_params_to_sample.v1.patch
Description: Binary data

Re: Allow single table VACUUM in transaction block

On Thu, Oct 27, 2022 at 9:49 PM Simon Riggs
 wrote:
>
> On Thu, 27 Oct 2022 at 10:31, Simon Riggs  
> wrote:
>
> > Tests, docs.
>
> The patch tester says that a pg_upgrade test is failing on Windows,
> but works for me.
>
> t/002_pg_upgrade.pl .. ok
>
> Anybody shed any light on that, much appreciated.

Please see a recent thread on pg_upgrade failure -
https://www.postgresql.org/message-id/Y04mN0ZLNzJywrad%40paquier.xyz.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Allow single table VACUUM in transaction block

On Thu, 27 Oct 2022 at 10:31, Simon Riggs  wrote:

> Tests, docs.

The patch tester says that a pg_upgrade test is failing on Windows,
but works for me.

t/002_pg_upgrade.pl .. ok

Anybody shed any light on that, much appreciated.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE

2022-10-27 Thread vignesh C

Hi,

Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE action was
missing, this patch adds the tab completion for the same.

Regards,
Vignesh
From 4b0473799a0d6af126ccd3d7802e5d0cbb83b944 Mon Sep 17 00:00:00 2001
From: "vignesh.c" 
Date: Thu, 27 Oct 2022 14:00:46 +0530
Subject: [PATCH v1] Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE.

Tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE action was missing,
added the tab completion for the same.
---
 src/bin/psql/tab-complete.c | 48 +
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a64571215b..bf3c5a7aae 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1805,12 +1805,52 @@ psql_completion(const char *text, int start, int end)
 		else
 			COMPLETE_WITH_FUNCTION_ARG(prev2_wd);
 	}
-	/* ALTER FUNCTION,PROCEDURE,ROUTINE  (...) */
-	else if (Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny))
+	/* ALTER FUNCTION  (...) */
+	else if (Matches("ALTER", "FUNCTION", MatchAny, MatchAny))
 	{
 		if (ends_with(prev_wd, ')'))
-			COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA",
-		  "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION");
+			COMPLETE_WITH("CALLED ON NULL INPUT", "COST",
+		  "DEPENDS ON EXTENSION", "EXTERNAL SECURITY",
+		  "IMMUTABLE", "LEAKPROOF", "NO DEPENDS ON EXTENSION",
+		  "NOT LEAKPROOF", "OWNER TO", "PARALLEL", "RENAME TO",
+		  "RETURNS NULL ON NULL INPUT ", "ROWS", "SECURITY",
+		  "SET SCHEMA", "STABLE", "STRICT", "SUPPORT",
+		  "VOLATILE");
+		else
+			COMPLETE_WITH_FUNCTION_ARG(prev2_wd);
+	}
+	/* ALTER FUNCTION|ROUTINE  (...) PARALLEL */
+	else if (Matches("ALTER", "FUNCTION|ROUTINE", MatchAny, MatchAny,
+			 "PARALLEL"))
+		COMPLETE_WITH("RESTRICTED", "SAFE", "UNSAFE");
+	/*
+	 * ALTER FUNCTION|PROCEDURE|ROUTINE  (...)
+	 * [EXTERNAL] SECURITY
+	 */
+	else if (Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny,
+	 "SECURITY") ||
+			 Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny,
+	 "EXTERNAL", "SECURITY"))
+		COMPLETE_WITH("DEFINER" ,"INVOKER");
+	/* ALTER PROCEDURE  (...) */
+	else if (Matches("ALTER", "PROCEDURE", MatchAny, MatchAny))
+	{
+		if (ends_with(prev_wd, ')'))
+			COMPLETE_WITH("DEPENDS ON EXTENSION", "EXTERNAL SECURITY",
+		  "NO DEPENDS ON EXTENSION", "OWNER TO", "RENAME TO",
+		  "SECURITY", "SET SCHEMA");
+		else
+			COMPLETE_WITH_FUNCTION_ARG(prev2_wd);
+	}
+	/* ALTER ROUTINE  (...) */
+	else if (Matches("ALTER", "ROUTINE", MatchAny, MatchAny))
+	{
+		if (ends_with(prev_wd, ')'))
+			COMPLETE_WITH("COST", "DEPENDS ON EXTENSION", "EXTERNAL SECURITY",
+		  "IMMUTABLE", "LEAKPROOF", "NO DEPENDS ON EXTENSION",
+		  "NOT LEAKPROOF", "OWNER TO", "PARALLEL", "RENAME TO",
+		  "ROWS", "SECURITY", "SET SCHEMA",  "STABLE",
+		  "VOLATILE");
 		else
 			COMPLETE_WITH_FUNCTION_ARG(prev2_wd);
 	}
-- 
2.32.0

Re: Proposal to use JSON for Postgres Parser format

2022-10-27 Thread Michel Pelletier

On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera 
> wrote:
> >
> > On 2022-Sep-20, Matthias van de Meent wrote:
> >
> > > Allow me to add: compressability
> > >
> > > In the thread surrounding [0] there were complaints about the size of
> > > catalogs, and specifically the template database. Significant parts of
> > > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> > > consists mostly of serialized Nodes. If we're going to replace our
> > > current NodeToText infrastructure, we'd better know we can effectively
> > > compress this data.
> >
> > True.  Currently, the largest ev_action values compress pretty well.  I
> > think if we wanted this to be more succint, we would have to invent some
> > binary format -- perhaps something like Protocol Buffers: it'd be stored
> > in the binary format in catalogs, but for output it would be converted
> > into something easy to read (we already do this for
> > pg_statistic_ext_data for example).  We'd probably lose compressibility,
> > but that'd be okay because the binary format would already remove most
> > of the redundancy by nature.
> >
> > Do we want to go there?
>
> I don't think that a binary format would be much better for
> debugging/fixing than an optimization of the current textual format
> when combined with compression.


I agree, JSON is not perfect, but it compresses and it's usable
everywhere.  My personal need for this is purely developer experience, and
Tom pointed out, a "niche" need for sure, but we are starting to do some
serious work with Dan Lynch's plpgsql deparser tool to generate RLS
policies from meta schema models, and having the same format come out of
the parser would make a complete end to end solution for us, especially if
we can get this data from a function in a ddl_command_start event trigger.
Dan also writes a popular deparser for Javascript, and unifying the formats
across these tools would be a big win for us.


> As I mentioned in that thread, there
> is a lot of improvement possible with the existing format, and I think
> any debugging of serialized nodes would greatly benefit from using a
> textual format.
>

Agreed.


> Then again, I also agree that this argument doesn't hold it's weight
> when storage and output formats are going to be different. I trust
> that any new tooling introduced as a result of this thread will be
> better than what we have right now.
>

Separating formats seems like a lot of work to me, to get what might not be
a huge improvement over compressing JSON, for what seems unlikely to be
more than a few megabytes of parsed SQL.


> As for best format: I don't know. The current format is usable, and a
> better format would not store any data for default values. JSON can do
> that, but I could think of many formats that could do the same (Smile,
> BSON, xml, etc.).
>
> I do not think that protobuf is the best choice for storage, though,
> because it has its own rules on what it considers a default value and
> what it does or does not serialize: zero is considered the only
> default for numbers, as is the empty string for text, etc.
> I think it is allright for general use, but with e.g. `location: -1`
> in just about every parse node we'd probably want to select our own
> values to ignore during (de)serialization of fields.
>

Agreed.

 Thank you everyone who has contributed to this thread, I'm pleased that it
got a very spirited debate and I apologize for the delay in getting back to
everyone.

I'd like to spike on a proposed patch that:

  - Converts the existing text format to JSON (or possibly jsonb,
considering feedback from this thread)
  - Can be stored compressed
  - Can be passed to a ddl_command_start event trigger with a function.

Thoughts?

-Michel

Re: Reducing duplicativeness of EquivalenceClass-derived clauses

2022-10-27 Thread Zhang Mingli

Hi,

On Oct 27, 2022, 21:29 +0800, Tom Lane , wrote:
> Zhang Mingli  writes:
> > How about combine ec->ec_sources and ec->derives as one list for less codes?
>
> Keeping them separate is required for the broken-EC code paths.
> Even if it weren't, I wouldn't merge them just to save a couple
> of lines of code --- I think it's useful to be able to tell which
> clauses the EC started from.
>
> regards, tom lane
Got it, thanks.

Regards,
Zhang Mingli

Re: Reducing duplicativeness of EquivalenceClass-derived clauses

Zhang Mingli  writes:
> How about combine ec->ec_sources and ec->derives as one list for less codes?

Keeping them separate is required for the broken-EC code paths.
Even if it weren't, I wouldn't merge them just to save a couple
of lines of code --- I think it's useful to be able to tell which
clauses the EC started from.

regards, tom lane

Re: [PROPOSAL] : Use of ORDER BY clause in insert.sql

Nishant Sharma  writes:
> We would like to share a proposal of a patch, where we have added order by
> clause in two select statements in src/test/regress/sql/insert.sql file and
> respective changes in src/test/regress/expected/insert.out output file.

> This would help in generating output in consistent sequence, as sometimes
> we have observed change in sequence in output.

Please be specific about the circumstances in which the output is
unstable for you.  With zero information to go on, it seems about as
likely that this change is masking a bug as that it's a good idea.

regards, tom lane

Re: Reducing duplicativeness of EquivalenceClass-derived clauses

2022-10-27 Thread Zhang Mingli

HI,

On Oct 26, 2022, 06:09 +0800, Tom Lane , wrote:
> While fooling with my longstanding outer-join variables changes
> (I am making progress on that, honest), I happened to notice that
> equivclass.c is leaving some money on the table by generating
> redundant RestrictInfo clauses. It already attempts to not generate
> the same clause twice, which can save a nontrivial amount of work
> because we cache selectivity estimates and so on per-RestrictInfo.
> I realized though that it will build and return a clause like
> "a.x = b.y" even if it already has "b.y = a.x". This is just
> wasteful. It's always been the case that equivclass.c will
> produce clauses that are ordered according to its own whims.
> Consumers that need the operands in a specific order, such as
> index scans or hash joins, are required to commute the clause
> to be the way they want it while building the finished plan.
> Therefore, it shouldn't matter which order of the operands we
> return, and giving back the commutator clause if available could
> potentially save as much as half of the selectivity-estimation
> work we do with these clauses subsequently.
>
> Hence, PFA a patch that adjusts create_join_clause() to notice
> commuted as well as exact matches among the EquivalenceClass's
> existing clauses. This results in a number of changes visible in
> regression test cases, but they're all clearly inconsequential.
>
> The only thing that I think might be controversial here is that
> I dropped the check for matching operator OID. To preserve that,
> we'd have needed to use get_commutator() in the reverse-match cases,
> which it seemed to me would be a completely unjustified expenditure
> of cycles. The operators we select for freshly-generated clauses
> will certainly always match those of previously-generated clauses.
> Maybe there's a chance that they'd not match those of ec_sources
> clauses (that is, the user-written clauses we started from), but
> if they don't and that actually makes any difference then surely
> we are talking about a buggy opclass definition.
>
> I've not bothered to make any performance tests to see if there's
> actually an easily measurable gain here. Saving some duplicative
> selectivity estimates could be down in the noise ... but it's
> surely worth the tiny number of extra tests added here.
>
> Comments?
>
> regards, tom lane
>
Make sense.

How about combine ec->ec_sources and ec->derives as one list for less codes?

```
foreach(lc, list_union(ec->ec_sources, ec->ec_derives))
{
 rinfo = (RestrictInfo *) lfirst(lc);
 if (rinfo->left_em == leftem &&
 rinfo->right_em == rightem &&
 rinfo->parent_ec == parent_ec)
 return rinfo;
 if (rinfo->left_em == rightem &&
 rinfo->right_em == leftem &&
 rinfo->parent_ec == parent_ec)
 return rinfo;
}
```
I have a try, it will change some in join.out and avoid changes in tidscan.out.

Regards,
Zhang Mingli
>

Re: [PATCHES] Post-special page storage TDE support

2022-10-27 Thread Matthias van de Meent

Hi

On Mon, 24 Oct 2022, 19:56 David Christensen, <
david.christen...@crunchydata.com> wrote:
>
> Discussion is welcome and encouraged!

Did you read the related thread with related discussion from last June,
"Re: better page-level checksums" [0]? In that I argued that space at the
end of a page is already allocated for the AM, and that reserving variable
space at the end of the page for non-AM usage is wasting the AM's
performance potential.

Apart from that: Is this variable-sized 'metadata' associated with smgr
infrastructure only, or is it also available for AM features? If not; then
this is a strong -1. The amount of tasks smgr needs to do on a page is
generally much less than the amount of tasks an AM needs to do; so in my
view the AM has priority in prime page real estate, not smgr or related
infrastructure.

re: PageFeatures
I'm not sure I understand the goal, nor the reasoning. Shouldn't this be
part of the storage manager (smgr) implementation / can't this be part of
the smgr of the relation?

re: use of pd_checksum
I mentioned this in the above-mentioned thread too, in [1], that we could
use pd_checksum as an extra area marker for this storage-specific data,
which would be located between pd_upper and pd_special.

Re: patch contents

0001:
>+ specialSize = MAXALIGN(specialSize) + reserved_page_size;

This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or
an assertion that reserved_page_size is MAXALIGNED, would be better.

>  PageValidateSpecialPointer(Page page)
>  {
>  Assert(page);
> -Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> +AssertPageHeader) page)->pd_special - reserved_page_size) <=
BLCKSZ);

This check is incorrect. With your code it would allow pd_special past the
end of the block. If you want to put the reserved_space_size effectively
inside the special area, this check should instead be:

+Assert(((PageHeader) page)->pd_special <= (BLCKSZ -
reserved_page_size));

Or, equally valid

+AssertPageHeader) page)->pd_special + reserved_page_size) <=
BLCKSZ);

> + * +-+-++-+
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +---++-+

Could you fix the table display if / when you revise the patchset? It seems
to me that the corners don't line up with the column borders.

0002:
> Make the output of "select_views" test stable
> Changing the reserved_page_size has resulted in non-stable results for
this test.

This makes sense, what kind of instability are we talking about? Are there
different results for runs with the same binary, or is this across
compilations?

0003 and up were not yet reviewed in depth.

Kind regards,

Matthias van de Meent

[0]
https://www.postgresql.org/message-id/flat/CA%2BTgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA%40mail.gmail.com#90badc63e568a89a76f51fc95f07ffaf
[1]
https://postgr.es/m/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5%3Dzg63nKtHBnn8A%40mail.gmail.com

[PROPOSAL] : Use of ORDER BY clause in insert.sql

2022-10-27 Thread Nishant Sharma

Hi,


We would like to share a proposal of a patch, where we have added order by
clause in two select statements in src/test/regress/sql/insert.sql file and
respective changes in src/test/regress/expected/insert.out output file.

This would help in generating output in consistent sequence, as sometimes
we have observed change in sequence in output.

Please find the patch attached 


Regards,
Nishant Sharma
EDB: http://www.enterprisedb.com


Proposal_OrderBy_insert.sql.out.patch
Description: Binary data

Re: pg_recvlogical prints bogus error when interrupted

On Mon, Oct 24, 2022 at 8:15 AM Bharath Rupireddy
 wrote:
>
> On Fri, Oct 21, 2022 at 7:52 AM Kyotaro Horiguchi
>  wrote:
> >
> > > +1. How about emitting a message like its friend pg_receivewal, like
> > > the attached patch?
> >
> > I'm not a fan of treating SIGINT as an error in this case. It calls
> > prepareToTerminate() when time_to_abort and everything goes fine after
> > then. So I think we should do the same thing after receiving an
> > interrupt.  This also does file-sync naturally as a part of normal
> > shutdown.  I'm also not a fan of doing fsync at error.
>
> I think the pg_recvlogical can gracefully exit on both SIGINT and
> SIGTERM to keep things simple.
>
> > > > I also then noticed that we don't fsync the output file in cases of 
> > > > errors -
> > > > that seems wrong to me? Looks to me like that block should be moved 
> > > > till after
> > > > the error:?
> > >
> > > How about something like the attached patch?
>
> The attached patch (pg_recvlogical_graceful_interrupt.text) has a
> couple of problems, I believe. We're losing prepareToTerminate() with
> keepalive true and we're not skipping pg_log_error("unexpected
> termination of replication stream: %s" upon interrupt, after all we're
> here discussing how to avoid it.
>
> I came up with the attached v2 patch, please have a look.

FWIW, I added it to CF - https://commitfest.postgresql.org/40/3966/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Code checks for App Devs, using new options for transaction behavior

In the past, developers have wondered how we can provide "--dry-run"
functionality
https://www.postgresql.org/message-id/15791.1450383201%40sss.pgh.pa.us

This is important for application developers, especially when
migrating programs to Postgres.

Presented here are 3 features aimed at developers, each of which is
being actively used by me in a large and complex migration project.

* psql --parse-only
Checks the syntax of all SQL in a script, but without actually
executing it. This is very important in the early stages of complex
migrations because we need to see if the code would generate syntax
errors before we attempt to execute it. When there are many
dependencies between objects, actual execution fails very quickly if
we run in a single transaction, yet running outside of a transaction
can leave a difficult cleanup task. Fixing errors iteratively is
difficult when there are long chains of dependencies between objects,
since there is no easy way to predict how long it will take to make
everything work unless you understand how many syntax errors exist in
the script.
001_psql_parse_only.v1.patch

* nested transactions = off (default) | all | on
Handle nested BEGIN/COMMIT, which can cause chaos on failure. This is
an important part of guaranteeing that everything that gets executed
is part of a single atomic transaction, which can then be rolled back
- this is a pre-requisite for the last feature.
002_nested_xacts.v7.patch
The default behavior is unchanged (off)
Setting "all" treats nested BEGIN/COMMIT as subtransactions, allowing
some parts to fail without rolling back the outer transaction.
Setting "outer" flattens nested BEGIN/COMMIT into one single outer
transaction, so that any failure rolls back the entire transaction.

* rollback_on_commit = off (default) | on
Force transactions to fail their final commit, ensuring that no
lasting change is made when a script is tested. i.e. accept COMMIT,
but do rollback instead.
003_rollback_on_commit.v1.patch

We will probably want to review these on separate threads, but the
common purpose of these features is hopefully clear from these notes.

001 and 003 are fairly small patches, 002 is longer.

Comments please

-- 
Simon Riggshttp://www.EnterpriseDB.com/


001_psql_parse_only.v1.patch
Description: Binary data


002_nested_xacts.v7.patch
Description: Binary data


003_rollback_on_commit.v1.patch
Description: Binary data

Avoid using list_delete_first in simplify_or/and_arguments

2022-10-27 Thread Richard Guo

Hi hackers,

While trying to measure if there is any gain from the change as
discussed in [1], I happened to notice another place that is slowed down
by list_delete_first.  I'm using the query as below:

(n=100;
printf "explain (summary on) select * from t where "
for ((i=1;i<$n;i++)); do printf "a = $i or "; done;
printf "a = $n;"
) | psql

And I notice that a large part of planning time is spent on the
list_delete_first calls inside simplify_or_arguments().

I think the issue here is clear and straightforward: list_delete_first
has an O(N) cost due to data movement.  And I believe similar issue has
been discussed several times before.

I wonder if we can improve it by using list_delete_last instead, so I
tried the following change:

--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -3612,9 +3612,9 @@ simplify_or_arguments(List *args,
unprocessed_args = list_copy(args);
while (unprocessed_args)
{
-   Node   *arg = (Node *) linitial(unprocessed_args);
+   Node   *arg = (Node *) llast(unprocessed_args);

-   unprocessed_args = list_delete_first(unprocessed_args);
+   unprocessed_args = list_delete_last(unprocessed_args);


With this change, in my box the planning time for the query above is
reduced from 64257.784 ms to 1411.666 ms, a big improvement.  The side
effect is that it results in a lot of plan diffs in regression tests,
but they are all about different order of OR arguments.

I believe simplify_and_arguments() can also benefit from similar
changes. But I'm not sure if we could have such a long AND/OR arguments
in real world. So is this worth doing?

[1]
https://www.postgresql.org/message-id/CAMbWs4-RXhgz0i4O1z62gt%2BbTLTM5vXYyYhgnius0j_txLH7hg%40mail.gmail.com

Thanks
Richard

Allow single table VACUUM in transaction block

It is a common user annoyance to have a script fail because someone
added a VACUUM, especially when using --single-transaction option.
Fix, so that this works without issue:

BEGIN;

VACUUM (ANALYZE) vactst;

COMMIT;

Allows both ANALYZE and vacuum of toast tables, but not VACUUM FULL.

When in a xact block, we do not set PROC_IN_VACUUM,
nor update datfrozenxid.

Tests, docs.

--
Simon Riggshttp://www.EnterpriseDB.com/


single_table_vacuum.v1.patch
Description: Binary data

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

On Tue, Oct 4, 2022 at 2:58 PM Bharath Rupireddy
 wrote:
>
> On Thu, Sep 29, 2022 at 7:43 PM Bharath Rupireddy
>  wrote:
> >
> > Please see the attached v1 patch.
>
> FWIW, I'm attaching Nathan's patch that introduced the new function
> pg_walfile_offset_lsn as 0002 in the v1 patch set. Here's the CF entry
> - https://commitfest.postgresql.org/40/3909/.
>
> Please consider this for further review.

I'm attaching the v2 patch set after rebasing on to the latest HEAD.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 87da2d0ae8fb0f92d2d88d8638aee1ca58332522 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 27 Oct 2022 09:06:27 +
Subject: [PATCH v2] Add LSN along with offset to error messages reported for
 WAL file read/write/validate header failures

---
 src/backend/access/transam/xlog.c |  8 ++--
 src/backend/access/transam/xlogreader.c   | 16 +++-
 src/backend/access/transam/xlogrecovery.c | 13 -
 src/backend/access/transam/xlogutils.c| 10 ++
 src/include/access/xlogreader.h   |  1 +
 5 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..1467001b4f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2214,6 +2214,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 {
 	char		xlogfname[MAXFNAMELEN];
 	int			save_errno;
+	XLogRecPtr	lsn;
 
 	if (errno == EINTR)
 		continue;
@@ -,11 +2223,14 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 	XLogFileName(xlogfname, tli, openLogSegNo,
  wal_segment_size);
 	errno = save_errno;
+	XLogSegNoOffsetToRecPtr(openLogSegNo, startoffset,
+			wal_segment_size, lsn);
 	ereport(PANIC,
 			(errcode_for_file_access(),
 			 errmsg("could not write to log file %s "
-	"at offset %u, length %zu: %m",
-	xlogfname, startoffset, nleft)));
+	"at offset %u, LSN %X/%X, length %zu: %m",
+	xlogfname, startoffset,
+	LSN_FORMAT_ARGS(lsn), nleft)));
 }
 nleft -= written;
 from += written;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b254..a321c55d8f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1226,9 +1226,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid magic number %04X in WAL segment %s, offset %u",
+			  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_magic,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1240,9 +1241,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1281,9 +1283,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1300,9 +1303,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "unexpected pageaddr %X/%X in WAL segment %s, offset %u",
+			  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 			  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1325,10 +1329,11 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
-  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",
+  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
   hdr->xlp_tli,
   state->latestPageTLI,
   fname,
+  LSN_FORMAT_ARGS(recptr),
   offset);
 			return false;
 		}
@@ -1553,6 +1558,7 @@ WALRead(XLogReaderState *state,
 			errinfo->wre_req = segbyt

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Thu, Oct 27, 2022 at 11:24 AM Michael Paquier  wrote:
>
> On Thu, Sep 29, 2022 at 08:09:56PM -0700, Nathan Bossart wrote:
> > Looks reasonable to me.
>
> 0001, to move pg_pwritev_with_retry() to a new home, seems fine, so
> applied.

Thanks.

> Regarding 0002, using pg_pwrite_zeros() as a routine name, as
> suggested by Thomas, sounds good to me.

Changed.

> However, I am not really a
> fan of its dependency with PGAlignedXLogBlock, because it should be
> able to work with any buffers of any sizes, as long as the input
> buffer is aligned, shouldn't it?  For example, what about
> PGAlignedBlock?  So, should we make this more extensible? My guess
> would be the addition of the block size and the block pointer to the
> arguments of pg_pwrite_zeros(), in combination with a check to make
> sure that the input buffer is MAXALIGN()'d (with an Assert() rather
> than just an elog/pg_log_error?).

+1 to pass in the aligned buffer, its size and an assertion on the buffer size.

Please see the attached v7 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 1d81929de5f5b56d39fc7f1273cdb4c1e36337bb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 27 Oct 2022 08:51:07 +
Subject: [PATCH v7] Use pg_pwritev_with_retry() instead of write() in
 walmethods.c

Use pg_pwritev_with_retry() while prepadding a WAL segment
instead of write() in walmethods.c dir_open_for_write() to avoid
partial writes. As the pg_pwritev_with_retry() function uses
pwritev, we can avoid explicit lseek() on non-Windows platforms
to seek to the beginning of the WAL segment. It looks like on
Windows, we need an explicit lseek() call here despite using
pwrite() implementation from win32pwrite.c. Otherwise
an error occurs.

These changes are inline with how core postgres initializes the
WAL segment in XLogFileInitInternal().

Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Michael Paquier
Reviewed-by: Thomas Munro
Discussion: https://www.postgresql.org/message-id/CALj2ACUq7nAb7%3DbJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA%40mail.gmail.com
---
 src/backend/access/transam/xlog.c  | 35 +++---
 src/bin/pg_basebackup/walmethods.c | 28 ++-
 src/common/file_utils.c| 77 ++
 src/include/common/file_utils.h|  5 ++
 4 files changed, 105 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f10effe3a..638d6d448b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2921,7 +2921,6 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
@@ -2965,14 +2964,12 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
+		PGAlignedXLogBlock zbuffer;
+		ssize_t rc;
 
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
@@ -2983,29 +2980,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
+		rc = pg_pwrite_zeros(fd, wal_segment_size, zbuffer.data,
+			 sizeof(zbuffer.data));
 
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
+		if (rc < 0)
+			save_errno = errno;
 	}
 	else
 	{
@@ -3014,7 +2993,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * enough.
 		 */
 		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (pg_pwrite(fd, "\0", 1, wal_segment_size - 1) != 1)
 		{
 			/* if write didn't set errno, assume no disk space */
 			save_errno = errno ? errno : ENOSPC;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index bc2e83d02b..f7af73c01b 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -220,22 +220,26 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	/* Do pre-padding on non-compressed files */
 	if (pad_to_size

Re: Documentation for building with meson

2022-10-27 Thread John Naylor

+# Run the main pg_regress and isolation tests
+meson test --suite main

This does not work for me in a fresh install until running

meson test --suite setup

In fact, we see in

https://wiki.postgresql.org/wiki/Meson

meson test --suite setup --suite main

That was just an eyeball check from a naive user -- it would be good to try
running everything documented here.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: GUC values - recommended way to declare the C variables?

On Thu, Oct 27, 2022 at 1:33 PM Michael Paquier  wrote:
>
>...
>
> Anyway, per my previous comments in my last message of this thread as
> of https://www.postgresql.org/message-id/y1nnwftrnl3it...@paquier.xyz,
> I don't see a need for DYNAMIC_DEFAULT from the other thread, nor do I
> see a need to a style like that:
> +/* GUC variable */
> +bool   update_process_title =
> +#ifdef WIN32
> +   false;
> +#else
> +   true;
> +#endif
>
> I think that it would be cleaner to use the same approach as
> checking_after_flush and similar GUCs with a centralized definition,
> rather than spreading such style in two places for each GUC that this
> patch touches (aka its declaration and its default value in
> guc_tables.c).  In any case, the patch of this thread still needs some
> adjustments IMO.

PSA patch v6.

The GUC defaults of guc_tables.c, and the modified GUC C var
declarations now share the same common #define'd value (instead of
cut/paste preprocessor code).

Per Michael's suggestion [1] to use centralized definitions.

--
[1] https://www.postgresql.org/message-id/Y1nuDNZDncx7%2BA1j%40paquier.xyz

Kind Regards,
Peter Smith.
Fujitsu Australia


v6-0001-GUC-C-variable-sanity-check.patch
Description: Binary data

Re: Simplifying our Trap/Assert infrastructure