date:20230328

Hi,

On 2023-03-27 15:32:47 +0900, Kyotaro Horiguchi wrote:
> At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund  wrote 
> in
> > Hi,
> >
> > Attached is v5. Lots of comment polishing, a bit of renaming. I extracted 
> > the
> > relation extension related code in hio.c back into its own function.
> >
> > While reviewing the hio.c code, I did realize that too much stuff is done
> > while holding the buffer lock. See also the pre-existing issue
> > https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de
>
> 0001, 0002 looks fine to me.
>
> 0003 adds the new function FileFallocte, but we already have
> AllocateFile. Although fd.c contains functions with varying word
> orders, it could be confusing that closely named functions have
> different naming conventions.

The syscall is named fallocate, I don't think we'd gain anything by inventing
a different name for it? Given that there's a number of File$syscall
operations, I think it's clear enough that it just fits into that. Unless you
have a better proposal?

> + /*
> +  * Return in cases of a "real" failure, if fallocate is not supported,
> +  * fall through to the FileZero() backed implementation.
> +  */
> + if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> + return returnCode;
>
> I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can
> be returned. Some googling indicate that ENOSYS might need the same
> amendment to EOPNOTSUPP. However, I'm not clear on why man
> posix_fallocate donsn't mention the former.

posix_fallocate() and its errors are specified by posix, I guess. I think
glibc etc will map ENOSYS to EOPNOTSUPP.

I really dislike this bit from the posix_fallocate manpage:

EINVAL offset was less than 0, or len was less than or equal to 0, or the 
underlying filesystem does not support the operation.

Why oh why would you add the "or .." portion into EINVAL, when there's also
EOPNOTSUPP?

>
> + (returnCode != EINVAL && returnCode != EINVAL))
> :)
>
>
> FileGetRawDesc(File file)
>  {
>   Assert(FileIsValid(file));
> +
> + if (FileAccess(file) < 0)
> + return -1;
> +
>
> The function's comment is provided below.
>
> > * The returned file descriptor will be valid until the file is closed, but
> > * there are a lot of things that can make that happen.  So the caller should
> > * be careful not to do much of anything else before it finishes using the
> > * returned file descriptor.
>
> So, the responsibility to make sure the file is valid seems to lie
> with the callers, although I'm not sure since there aren't any
> function users in the tree.

Except, as I think you realized as well, external callers *can't* call
FileAccess(), it's static.

> I'm unclear as to why FileSize omits the case lruLessRecently != file.

Not quite following - why would FileSize() deal with lruLessRecently itself?
Or do you mean why FileSize() uses FileIsNotOpen() itself, rather than relying
on FileAccess() doing that internally?

> When examining similar functions, such as FileGetRawFlags and
> FileGetRawMode, I'm puzzled to find that FileAccess() nor
> BasicOpenFilePermthe don't set the struct members referred to by the
> functions.

Those aren't involved with LRU mechanism, IIRC. Note that BasicOpenFilePerm()
returns an actual fd, not a File. So you can't call FileGetRawMode() on it. As
BasicOpenFilePerm() says:
 * This is exported for use by places that really want a plain kernel FD,
 * but need to be proof against running out of FDs. ...

I don't think FileAccess() needs to set those struct members, that's already
been done in PathNameOpenFilePerm().

> This makes my question the usefulness of these functions including
> FileGetRawDesc().

It's quite weird that we have FileGetRawDesc(), but don't allow to use it in a
safe way...

> Regardless, since the
> patchset doesn't use FileGetRawDesc(), I don't believe the fix is
> necessary in this patch set.

Yea. It was used in an earlier version, but not anymore.

>
> + if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
>
> I'm not sure it is appropriate to assume InvalidBlockNumber equals
> MaxBlockNumber + 1 in this context.

Hm. This is just the multi-block equivalent of what mdextend() already
does. It's not pretty, indeed. I'm not sure there's really a better thing to
do for mdzeroextend(), given the mdextend() precedent? mdzeroextend() (just as
mdextend()) will be called with blockNum == InvalidBlockNumber, if you try to
extend past the size limit.

>
> +  * However, we don't use FileAllocate() for small extensions, 
> as it
> +  * defeats delayed allocation on some filesystems. Not clear 
> where
> +  * that decision should be made though? For now just use a 
> cutoff of
> +  * 8, anything between 4 and 8 worked OK in some local testing.
>
> The chose is quite similar to what FileFallocate() makes. However, I'm
> not sure FileFallocate() itself

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:
> Below is my review of a slightly older version than you just posted --
> much of it you may have already addressed.

Far from all is already - thanks for the review!


> From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Mon, 20 Mar 2023 21:57:40 -0700
> Subject: [PATCH v5 01/15] createdb-using-wal-fixes
>
> This could use a more detailed commit message -- I don't really get what
> it is doing

It's a fix for a bug that I encountered while hacking on this, it since has
been committed.

commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4
Author: Andres Freund 
Date:   2023-03-20 21:57:40 -0700

Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG

RelationCopyStorageUsingBuffer() did not free the strategies used to access
the source / target relation. They memory was released at the end of the
transaction, but when using a template database with a lot of relations, the
temporary leak can become big prohibitively big.

RelationCopyStorageUsingBuffer() acquired the buffer for the target relation
with RBM_NORMAL, therefore requiring a read of a block guaranteed to be
zero. Use RBM_ZERO_AND_LOCK instead.

Reviewed-by: Robert Haas 
Discussion: 
https://postgr.es/m/20230321070113.o2vqqxogjykwg...@awork3.anarazel.de
Backpatch: 15-, where STRATEGY WAL_LOG was introduced


> From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Wed, 1 Jul 2020 19:06:45 -0700
> Subject: [PATCH v5 02/15] Add some error checking around pinning
>
> ---
>  src/backend/storage/buffer/bufmgr.c | 40 -
>  src/include/storage/bufmgr.h|  1 +
>  2 files changed, 29 insertions(+), 12 deletions(-)
>

> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 95212a3941..fa20fab5a2 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
>  LW_EXCLUSIVE);
>  }
>
> +void
> +BufferCheckOneLocalPin(Buffer buffer)
> +{
> +if (BufferIsLocal(buffer))
> +{
> +/* There should be exactly one pin */
> +if (LocalRefCount[-buffer - 1] != 1)
> +elog(ERROR, "incorrect local pin count: %d",
> +LocalRefCount[-buffer - 1]);
> +}
> +else
> +{
> +/* There should be exactly one local pin */
> +if (GetPrivateRefCount(buffer) != 1)
>
> I'd rather this be an else if (was already like this, but, no reason not
> to change it now)

I don't like that much - it'd break the symmetry between local / non-local.


> +/*
> + * Zero a region of the file.
> + *
> + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
> + * appropriate error.
> + */
> +int
> +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +intreturnCode;
> +ssize_twritten;
> +
> +Assert(FileIsValid(file));
> +returnCode = FileAccess(file);
> +if (returnCode < 0)
> +return returnCode;
> +
> +pgstat_report_wait_start(wait_event_info);
> +written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
> +pgstat_report_wait_end();
> +
> +if (written < 0)
> +return -1;
> +else if (written != amount)
>
> this doesn't need to be an else if

You mean it could be a "bare" if instead? I don't really think that's clearer.


> +{
> +/* if errno is unset, assume problem is no disk space */
> +if (errno == 0)
> +errno = ENOSPC;
> +return -1;
> +}
>
> +int
> +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +#ifdef HAVE_POSIX_FALLOCATE
> +intreturnCode;
> +
> +Assert(FileIsValid(file));
> +returnCode = FileAccess(file);
> +if (returnCode < 0)
> +return returnCode;
> +
> +pgstat_report_wait_start(wait_event_info);
> +returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
> +pgstat_report_wait_end();
> +
> +if (returnCode == 0)
> +return 0;
> +
> +/* for compatibility with %m printing etc */
> +errno = returnCode;
> +
> +/*
> +* Return in cases of a "real" failure, if fallocate is not supported,
> +* fall through to the FileZero() backed implementation.
> +*/
> +if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> +return returnCode;
>
> I'm pretty sure you can just delete the below if statement
>
> +if (returnCode == 0 ||
> +(returnCode != EINVAL && returnCode != EINVAL))
> +return returnCode;

Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?

>
> +/*
> + *mdzeroextend() -- Add ew zeroed out blocks to

Re: Should vacuum process config file reload more often

2023-03-28 Thread Kyotaro Horiguchi

At Tue, 28 Mar 2023 20:35:28 -0400, Melanie Plageman 
 wrote in 
> On Tue, Mar 28, 2023 at 4:21 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 27 Mar 2023 14:12:03 -0400, Melanie Plageman 
> >  wrote in
> > > So, I've attached an alternate version of the patchset which takes the
> > > approach of having one commit which only enables cost-based delay GUC
> > > refresh for VACUUM and another commit which enables it for autovacuum
> > > and makes the changes to balancing variables.
> > >
> > > I still think the commit which has workers updating their own
> > > wi_cost_delay in vacuum_delay_point() is a bit weird. It relies on no one
> > > else emulating our bad behavior and reading from wi_cost_delay without a
> > > lock and on no one else deciding to ever write to wi_cost_delay (even
> > > though it is in shared memory [this is the same as master]). It is only
> > > safe because our process is the only one (right now) writing to
> > > wi_cost_delay, so when we read from it without a lock, we know it isn't
> > > being written to. And everyone else takes a lock when reading from
> > > wi_cost_delay right now. So, it seems...not great.
> > >
> > > This approach also introduces a function that is only around for
> > > one commit until the next commit obsoletes it, which seems a bit silly.
> >
> > (I'm not sure what this refers to, though..) I don't think it's silly
> > if a later patch removes something that the preceding patches
> > introdcued, as long as that contributes to readability.  Untimately,
> > they will be merged together on committing.
> 
> I was under the impression that reviewers thought config reload and
> worker balance changes should be committed in separate commits.
> 
> Either way, the ephemeral function is not my primary concern. I felt
> more uncomfortable with increasing how often we update a double in
> shared memory which is read without acquiring a lock.
> 
> > > Basically, I think it is probably better to just have one commit
> > > enabling guc refresh for VACUUM and then another which correctly
> > > implements what is needed for autovacuum to do the same.
> > > Attached v9 does this.
> > >
> > > I've provided both complete versions of both approaches (v9 and v8).
> >
> > I took a look at v9 and have a few comments.
> >
> > 0001:
> >
> > I don't believe it is necessary, as mentioned in the commit
> > message. It apperas that we are resetting it at the appropriate times.
> 
> VacuumCostBalance must be zeroed out when we disable vacuum cost.
> Previously, we did not reenable VacuumCostActive once it was disabled,
> but now that we do, I think it is good practice to always zero out
> VacuumCostBalance when we disable vacuum cost. I will squash this commit
> into the one introducing VacuumCostInactive, though.

> >
> > 0002:
> >
> > I felt a bit uneasy on this. It seems somewhat complex (and makes the
> > succeeding patches complex),
> 
> Even if we introduced a second global variable to indicate that failsafe
> mode has been engaged, we would still require the additional checks
> of VacuumCostInactive.
>
> > has confusing names,
> 
> I would be happy to rename the values of the enum to make them less
> confusing. Are you thinking "force" instead of "locked"?
> maybe:
> VACUUM_COST_FORCE_INACTIVE and
> VACUUM_COST_INACTIVE
> ?
> 
> > and doesn't seem like self-contained.
> 
> By changing the variable from VacuumCostActive to VacuumCostInactive, I
> have kept all non-vacuum code from having to distinguish between it
> being inactive due to failsafe mode or due to user settings.

My concern is that VacuumCostActive is logic-inverted and turned into
a ternary variable in a subtle way. The expression
"!VacuumCostInactive" is quite confusing. (I sometimes feel the same
way about "!XLogRecPtrIsInvalid(lsn)", and I believe most people write
it with another macro like "lsn != InvalidXLogrecPtr"). Additionally,
the constraint in this patch will be implemented as open code.  So I
wanted to suggest something like the attached. The main idea is to use
a wrapper function to enforce the restriction, and by doing so, we
eliminated the need to make the variable into a ternary without a good
reason.

> > I think it'd be simpler to add a global boolean (maybe
> > VacuumCostActiveForceDisable or such) that forces VacuumCostActive to
> > be false and set VacuumCostActive using a setter function that follows
> > the boolean.
> 
> I think maintaining an additional global variable is more brittle than
> including the information in a single variable.
> 
> > 0003:
> >
> > +* Reload the configuration file if requested. This allows changes 
> > to
> > +* vacuum_cost_limit and vacuum_cost_delay to take effect while a 
> > table is
> > +* being vacuumed or analyzed. Analyze should not reload 
> > configuration
> > +* file if it is in an outer transaction, as GUC values shouldn't be
> > +* allowed to refer to some uncommitted state (e.g. database objects
> > +* created

Re: hio.c does visibilitymap_pin()/IO while holding buffer lock

Hi,

On 2023-03-28 18:21:02 -0700, Andres Freund wrote:
> Here's a draft patch.

Attached is v2, with a stupid bug fixed and a bit of comment / pgindent
polish.

Greetings,

Andres Freund
>From 6f27018765f167a8d2a52bc7c9068cb09ca8d82c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v2 1/2] hio: Release extension lock before initializing page /
 pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.
---
 src/backend/access/heap/hio.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e152807d2dc..7479212d4e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

>From 5a33df025ce466343d88bbc57ac1bc80d883a230 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 28 Mar 2023 18:17:11 -0700
Subject: [PATCH v2 2/2] WIP: relation extension: Don't pin the VM while
 holding buffer lock

---
 src/backend/access/heap/hio.c | 120 +++---
 1 file changed, 81 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..8b3dfa0ccae 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -134,14 +134,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * buffer2 may be InvalidBuffer, if only one buffer is involved.  buffer1
  * must not be InvalidBuffer.  If both buffers are specified, block1 must
  * be less than block2.
+ *
+ * Returns whether buffer locks were temporarily released.
  */
-static void
+static bool
 GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	 BlockNumber block1, BlockNumber block2,
 	 Buffer *vmbuffer1, Buffer *vmbuffer2)
 {
 	bool		need_to_pin_buffer1;
 	bool		need_to_pin_buffer2;
+	bool		released_locks = false;
 
 	Assert(BufferIsValid(buffer1));
 	Assert(buffer2 == InvalidBuffer || block1 <= block2);
@@ -155,9 +158,10 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			&& PageIsAllVisible(BufferGetPage(buffer2))
 			&& !visibilitymap_pin_ok(block2, *vmbuffer2);
 		if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
-			return;
+			break;
 
 		/* We must unlock both buffers before doing any I/O. */
+		released_locks = true;
 		LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
 		if (buffer2 != InvalidBuffer && buffer2 != buffer1)
 			LockBuffer(buffer2, BUFFER_LOCK_UNLOCK);
@@ -183,6 +187,8 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			|| (need_to_pin_buffer1 && need_to_pin_buffer2))
 			break;
 	}
+
+	return released_locks;
 }
 
 /*
@@ -345,6 +351,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 otherBlock;
 	bool		needLock;
+	bool		unlockedTargetBuffer;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -630,6 +637,9 @@ loop:
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+	unlockedTargetBuffer = false;
+	targetBlock = BufferGetBlockNumber(buffer);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -639,75 +649,107 @@ loop:
 
 	if (!PageIsNew(page))
 		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
+			 targetBlock,
 			 RelationGetRelationName(relation));
 
 	PageInit(page, BufferGetPageSize(buffer), 0);
 	MarkBufferDirty(buffer);
 
 	/*
-	 * The page is empty, pin vmbuffer to set

Re: Data is copied twice when specifying both child and parent table in publication

2023-03-28 Thread Peter Smith

A minor review comment for v25-0001.

==
src/backend/commands/subscriptioncmds.c

1.
@@ -1936,21 +1936,56 @@ fetch_table_list(WalReceiverConn *wrconn, List
*publications)
  WalRcvExecResult *res;
  StringInfoData cmd;
  TupleTableSlot *slot;
- Oid tableRow[3] = {TEXTOID, TEXTOID, NAMEARRAYOID};
+ Oid tableRow[3] = {TEXTOID, TEXTOID, InvalidOid};

The patch could be slightly less invasive if you did not make this
change, but instead, only overwrite tableRow[2] for the >= PG16 case.

Or vice versa, if you prefer.

The point is, there are only 2 cases, so you might as well initialize
a default tableRow[2] that is valid for one case and overwrite it only
for the other case, instead of overwriting it in 2 places.

YMMV.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2023-03-28 Thread Yugo NAGATA

Hello,

Attached patch introduces a function pg_column_toast_chunk_id
that returns a chunk ID of a TOASTed value.

Recently, one of our clients needed a way to show which columns
are actually TOASTed because they would like to know how much
updates on the original table affects to its toast table
specifically with regard to auto VACUUM. We could not find a
function for this purpose in the current PostgreSQL, so I would
like propose pg_column_toast_chunk_id.

This function returns a chunk ID of a TOASTed value, or NULL
if the value is not TOASTed. Here is an example;

postgres=# \d val
   Table "public.val"
 Column | Type | Collation | Nullable | Default 
+--+---+--+-
 t  | text |   |  | 

postgres=# select length(t), pg_column_size(t), pg_column_compression(t), 
pg_column_toast_chunk_id(t), tableoid from val;
 length | pg_column_size | pg_column_compression | pg_column_toast_chunk_id | 
tableoid 
++---+--+--
  3 |  4 |   |   |
16388
   3000 | 46 | pglz  |   |
16388
  32000 |413 | pglz  |   |
16388
305 |309 |   |   |
16388
  64000 |  64000 |   | 16393 |
16388
(5 rows)

postgres=# select chunk_id, chunk_seq from pg_toast.pg_toast_16388;
 chunk_id | chunk_seq 
--+---
16393 | 0
16393 | 1
16393 | 2
 (snip)
16393 |30
16393 |31
16393 |32
(33 rows)

This function is also useful to identify a problematic row when
an error like 
  "ERROR:  unexpected chunk number ... (expected ...) for toast value"
occurs.

The patch is a just a concept patch and not including documentation
and tests.

What do you think about this feature?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 5778e3f0ef..4ddcf885cd 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -5070,6 +5070,47 @@ pg_column_compression(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(result));
 }
 
+/*
+ * Return the chunk id of the TOASTed value.
+ * Return NULL for unTOASTed value.
+ */
+Datum
+pg_column_toast_chunk_id(PG_FUNCTION_ARGS)
+{
+	int			typlen;
+	struct varlena *attr;
+	struct varatt_external toast_pointer;
+
+	/* On first call, get the input type's typlen, and save at *fn_extra */
+	if (fcinfo->flinfo->fn_extra == NULL)
+	{
+		/* Lookup the datatype of the supplied argument */
+		Oid			argtypeid = get_fn_expr_argtype(fcinfo->flinfo, 0);
+
+		typlen = get_typlen(argtypeid);
+		if (typlen == 0)		/* should not happen */
+			elog(ERROR, "cache lookup failed for type %u", argtypeid);
+
+		fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+	  sizeof(int));
+		*((int *) fcinfo->flinfo->fn_extra) = typlen;
+	}
+	else
+		typlen = *((int *) fcinfo->flinfo->fn_extra);
+
+	if (typlen != -1)
+		PG_RETURN_NULL();
+
+	attr = (struct varlena *) DatumGetPointer(PG_GETARG_DATUM(0));
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		PG_RETURN_NULL();
+
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+	PG_RETURN_OID(toast_pointer.va_valueid);
+}
+
 /*
  * string_agg - Concatenates values and returns string.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7c358cff16..4c214f4c63 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7414,6 +7414,9 @@
 { oid => '2121', descr => 'compression method for the compressed datum',
   proname => 'pg_column_compression', provolatile => 's', prorettype => 'text',
   proargtypes => 'any', prosrc => 'pg_column_compression' },
+{ oid => '8393', descr => 'chunk ID of TOASTed value',
+  proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => 'oid',
+  proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' },
 { oid => '2322',
   descr => 'total disk space usage for the specified tablespace',
   proname => 'pg_tablespace_size', provolatile => 'v', prorettype => 'int8',

Re: allow_in_place_tablespaces vs. pg_basebackup

On Tue, Mar 28, 2023 at 05:15:25PM +1300, Thomas Munro wrote:
> The commit message explains pretty well, and it seems to work in
> simple testing, and yeah commit c6f2f016 was not a work of art.
> pg_basebackeup --format=plain "worked", but your way is better.  I
> guess we should add a test of -Fp too, to keep it working?  Here's one
> of those.

The patch is larger than it is complicated.  Particularly, in xlog.c,
this is just adding one block for the PGFILETYPE_DIR case to store a
relative path.

> I know it's not your patch's fault, but I wonder if this might break
> something.  We have this strange beast ti->rpath (commit b168c5ef in
> 2014):
> 
> +/*
> + * Relpath holds the relative path of the tablespace directory
> + * when it's located within PGDATA, or NULL if it's located
> + * elsewhere.
> + */
> 
> That's pretty confusing, because relative paths have been banned since
> the birth of tablespaces (commit 2467394ee15, 2004):

I think that this comes down to people manipulating pg_tblspc/ by
themselves post-creation with CREATE, because we don't track the
location anywhere post-creation and rely on the structure of
pg_tblspc/ for any future decision taken by the backend?  I recall
this specific case, where users were creating tablespaces in PGDATA
itself to make "cleaner" for them the structure in the data folder,
even if it makes no sense as the mount point is the same..  33cb8ff
added a warning about that in tablespace.c, but we could be more
aggressive and outright drop this case entirely when in-place
tablespaces are not involved.

Tablespace maps defined in pg_basebackup -T require both the old and
new locations to be absolute, so it seems shaky to me to assume that
this should always be fine in the backend..

> I guess what I'm wondering here is if there is a hazard where we
> confuse these outlawed tablespaces that supposedly roam the plains
> somewhere in your code that is assuming that "relative implies
> in-place".  Or if not now, maybe in future changes.  I wonder if these
> "semi-supported-but-don't-tell-anyone" relative symlinks are worthy of
> a defensive test (or is it in there somewhere already?).

Yeah, it makes for surprising and sometimes undocumented behaviors,
which is confusing for users and developers at the end.  I'd like to
think that we'd live happier long-term if the borders are clearer, aka
switch to more aggressive ERRORs instead of WARNINGs in some places
where the boundaries are blurry.  A good thing about in-place
tablespaces taken in isolation is that the borders of what you can do
are well-defined, which comes down to the absolute vs relative path
handling.

Looking at the patch, nothing really stands out..
--
Michael

signature.asc
Description: PGP signature

Re: hio.c does visibilitymap_pin()/IO while holding buffer lock

Hi,

On 2023-03-25 11:17:07 -0700, Andres Freund wrote:
> I don't see how that's easily possible with the current lock ordering
> rules. At least without giving up using RBM_ZERO_AND_LOCK for extending or
> stuffing even more things to happen with the the extension lock held, which I
> don't think we want to. I don't think INSERT_FROZEN is worth that price.

I think I might have been thinking of this too narrowly. It's extremely
unlikely that another backend would discover the page. And we can use
visibilitymap_pin_ok() to amortize the cost to almost nothing - there's a lot
of bits in an 8k block...

Here's a draft patch.

The bulk relation patch I am polishing has a similar issue, except that there
the problem is inserting into the FSM, instead of pinning a VM pageabout the
FSM. Hence the patch above makes the infrastructure a bit more general than
required for the HEAP_INSERT_FROZEN case alone (where we currently shouldn't
ever have a valid otherBuffer).

The way the parameter ordering for GetVisibilityMapPins() works make it
somewhat unwieldy - see e.g the existing
if (otherBuffer == InvalidBuffer || targetBlock <= otherBlock)
GetVisibilityMapPins(relation, buffer, otherBuffer,
 targetBlock, 
otherBlock, vmbuffer,

vmbuffer_other);
else
GetVisibilityMapPins(relation, otherBuffer, buffer,
 otherBlock, 
targetBlock, vmbuffer_other,
 vmbuffer);

Which I now duplicated in yet another place.

Perhaps we just ought to switch buffer1/block1 with buffer2/block2 inside
GetVisibilityMapPins(), to avoid duplicating that code elsewhere?

Because we now track whether the *targetBuffer* was ever unlocked, we can be a
bit more narrow about the possibility of there not being sufficient space.

The patch could be narrowed for backpatching. But as there's likely no
practical problem at this point, I wouldn't want to backpatch anyway?

Greetings,

Andres Freund
>From 6f27018765f167a8d2a52bc7c9068cb09ca8d82c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v1 1/2] hio: Release extension lock before initializing page /
 pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.
---
 src/backend/access/heap/hio.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e152807d2dc..7479212d4e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);

+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}

-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

>From 5aeb7f5d042eda6ba4c70d8f475400c51607fac5 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 28 Mar 2023 18:17:11 -0700
Subject: [PATCH v1 2/2] WIP: relation extension: Don't pin the VM while
 holding buffer lock

---
 src/backend/access/heap/hio.c | 115 +++---
 1 file changed, 78 insertions(+), 37 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..11ef053705e 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -134,14 +134,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * buffer2 may be InvalidBuffer, if only one buffer is involved.  buffer1
  * must not be InvalidBuffer.  If both buffers are specified, block1 must
  * be less than block2.
+ *
+

RE: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-28 Thread Hayato Kuroda (Fujitsu)

Dear Amit-san,

> There is no need to break the link line. See attached.

I understood your saying. I think it's better.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: [RFC] Add jit deform_counter

2023-03-28 Thread David Rowley

On Sun, 12 Jun 2022 at 21:14, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> I've noticed that JIT performance counter generation_counter seems to include
> actions, relevant for both jit_expressions and jit_tuple_deforming options. It
> means one can't directly see what is the influence of jit_tuple_deforming
> alone, which would be helpful when adjusting JIT options. To make it better a
> new counter can be introduced, does it make sense?

I'm not so sure about this idea. As of now, if I look at EXPLAIN
ANALYZE's JIT summary, the individual times add up to the total time.

If we add this deform time, then that's no longer going to be true as
the "Generation" time includes the newly added deform time.

master:
 JIT:
   Functions: 600
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 37.758 ms, Inlining 0.000 ms, Optimization 6.736
ms, Emission 172.244 ms, Total 216.738 ms

37.758 + 6.736 + 172.244 = 216.738

I think if I was a DBA wondering why JIT was taking so long, I'd
probably either be very astonished or I'd report a bug if I noticed
that all the individual component JIT times didn't add up to the total
time.

I don't think the solution is to subtract the deform time from the
generation time either.

Can't users just get this by looking at EXPLAIN ANALYZE with and
without jit_tuple_deforming?

David

Re: Add LZ4 compression in pg_dump

On 3/28/23 00:34, gkokola...@pm.me wrote:
> 
> ...
>
>> Got it. In that case I agree it's fine to do that in a single commit.
> 
> For what is worth, I think that this patch should get a +1 and get in. It
> solves the empty writes problem and includes a test to a previous untested
> case.
> 

Pushed, after updating / rewording the commit message a little bit.

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Should vacuum process config file reload more often

2023-03-28 Thread Melanie Plageman

On Tue, Mar 28, 2023 at 4:21 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 27 Mar 2023 14:12:03 -0400, Melanie Plageman 
>  wrote in
> > So, I've attached an alternate version of the patchset which takes the
> > approach of having one commit which only enables cost-based delay GUC
> > refresh for VACUUM and another commit which enables it for autovacuum
> > and makes the changes to balancing variables.
> >
> > I still think the commit which has workers updating their own
> > wi_cost_delay in vacuum_delay_point() is a bit weird. It relies on no one
> > else emulating our bad behavior and reading from wi_cost_delay without a
> > lock and on no one else deciding to ever write to wi_cost_delay (even
> > though it is in shared memory [this is the same as master]). It is only
> > safe because our process is the only one (right now) writing to
> > wi_cost_delay, so when we read from it without a lock, we know it isn't
> > being written to. And everyone else takes a lock when reading from
> > wi_cost_delay right now. So, it seems...not great.
> >
> > This approach also introduces a function that is only around for
> > one commit until the next commit obsoletes it, which seems a bit silly.
>
> (I'm not sure what this refers to, though..) I don't think it's silly
> if a later patch removes something that the preceding patches
> introdcued, as long as that contributes to readability.  Untimately,
> they will be merged together on committing.

I was under the impression that reviewers thought config reload and
worker balance changes should be committed in separate commits.

Either way, the ephemeral function is not my primary concern. I felt
more uncomfortable with increasing how often we update a double in
shared memory which is read without acquiring a lock.

> > Basically, I think it is probably better to just have one commit
> > enabling guc refresh for VACUUM and then another which correctly
> > implements what is needed for autovacuum to do the same.
> > Attached v9 does this.
> >
> > I've provided both complete versions of both approaches (v9 and v8).
>
> I took a look at v9 and have a few comments.
>
> 0001:
>
> I don't believe it is necessary, as mentioned in the commit
> message. It apperas that we are resetting it at the appropriate times.

VacuumCostBalance must be zeroed out when we disable vacuum cost.
Previously, we did not reenable VacuumCostActive once it was disabled,
but now that we do, I think it is good practice to always zero out
VacuumCostBalance when we disable vacuum cost. I will squash this commit
into the one introducing VacuumCostInactive, though.

>
> 0002:
>
> I felt a bit uneasy on this. It seems somewhat complex (and makes the
> succeeding patches complex),

Even if we introduced a second global variable to indicate that failsafe
mode has been engaged, we would still require the additional checks
of VacuumCostInactive.

> has confusing names,

I would be happy to rename the values of the enum to make them less
confusing. Are you thinking "force" instead of "locked"?
maybe:
VACUUM_COST_FORCE_INACTIVE and
VACUUM_COST_INACTIVE
?

> and doesn't seem like self-contained.

By changing the variable from VacuumCostActive to VacuumCostInactive, I
have kept all non-vacuum code from having to distinguish between it
being inactive due to failsafe mode or due to user settings.

> I think it'd be simpler to add a global boolean (maybe
> VacuumCostActiveForceDisable or such) that forces VacuumCostActive to
> be false and set VacuumCostActive using a setter function that follows
> the boolean.

I think maintaining an additional global variable is more brittle than
including the information in a single variable.

> 0003:
>
> +* Reload the configuration file if requested. This allows changes to
> +* vacuum_cost_limit and vacuum_cost_delay to take effect while a 
> table is
> +* being vacuumed or analyzed. Analyze should not reload configuration
> +* file if it is in an outer transaction, as GUC values shouldn't be
> +* allowed to refer to some uncommitted state (e.g. database objects
> +* created in this transaction).
>
> I'm not sure GUC reload is or should be related to transactions. For
> instance, work_mem can be changed by a reload during a transaction
> unless it has been set in the current transaction. I don't think we
> need to deliberately suppress changes in variables caused by realods
> during transactions only for analzye. If analyze doesn't like changes
> to certain GUC variables, their values should be snapshotted before
> starting the process.

Currently, we only reload the config file in top-level statements. We
don't reload the configuration file from within a nested transaction
command. BEGIN starts a transaction but not a transaction command. So
BEGIN; ANALYZE; probably wouldn't violate this rule. But it is simpler
to just forbid reloading when it is not a top-level transaction command.
I have updated the comment to reflect this.

> 0004:
> -

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

On Tue, Mar 28, 2023 at 02:56:28PM +0900, Michael Paquier wrote:
> Hmm.  This is a good point.  It is true that the patch feels
> incomplete on this side.  I don't see why we could not be flexible,
> and allow a value of 0 in a partitioned table's relam to mean that we
> would pick up the database default in this case when a partition is
> is created on it.  This would have the advantage to be consistent with
> older versions where we fallback on the default.  We cannot be
> completely consistent with the reltablespace of the leaf partitions
> unfortunately, as relam should always be set if a relation has
> storage.  And allowing a value of 0 means that there are likely other
> tricky cases with dumps?
> 
> Another thing: would it make sense to allow an empty string in
> default_table_access_method so as we'd always fallback to a database
> default?

FYI, I am not sure that I will be able to look more at this patch by
the end of the commit fest, and there are quite more points to
consider.  Perhaps at this stage we'd better mark it as returned with
feedback?  I understand that I've arrived late at this party :/
--
Michael


signature.asc
Description: PGP signature

Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry

On Tue, Mar 28, 2023 at 05:43:26PM +0900, Kyotaro Horiguchi wrote:
> No. Fine by me, except that "block read requests" seems to suggest
> kernel read() calls, maybe because it's not clear whether "block"
> refers to our buffer blocks or file blocks to me.. If it is generally
> clear, I'm fine with the proposal.

Okay.  Would somebody like to draft a patch?
--
Michael


signature.asc
Description: PGP signature

Re: Why mark empty pages all visible?

2023-03-28 Thread Peter Geoghegan

On Tue, Mar 28, 2023 at 3:29 PM Andres Freund  wrote:
> Why is that? It's as clear a signal of concurrent activity on the buffer
> you're going to get.

Not immediately getting a cleanup lock in VACUUM is informative to the
extent that you only care about what is happening that very
nanosecond. If you look at which pages it happens to in detail, what
you seem to end up with is a whole bunch of noise, which (on its own)
tells you exactly nothing about what VACUUM really ought to be doing
with those pages. In almost all cases we could get a cleanup lock by
waiting for one millisecond and retrying.

I suspect that the cleanup lock thing might be a noisy, unreliable
proxy for the condition that you actually care about, in the context
of your work on relation extension. I bet there is a better signal to
go on, if you look for one.

> It's interesting to understand *why* we are doing what we are. I think it'd
> make sense to propose changing how things work around this, but I just don't
> feel like I have a good enough grasp for why we do some of the things we
> do. And given there's not a lot of comments around it and some of the comments
> that do exist are inconsistent with themselves, looking at the history seems
> like the next best thing?

I think that I know why Heikki had no comments about PageIsEmpty()
pages when the VM first went in, back in 2009: because it was just so
obvious that you'd treat them the same as any other initialized page,
it didn't seem to warrant a comment at all. The difference between a
page with 0 tuples and 1 tuple is the same difference between a page
with 1 tuple and a page with 2 tuples. A tiny difference (one extra
tuple), of no particular consequence.

I think that you don't see it that way now because you're focussed on
the hio.c view of things. That code looked very different back in
2009, and in any case is very far removed from vacuumlazy.c.

I can tell you what I was thinking of with lazy_scan_new_or_empty: I
hate special cases. I will admit to being a zealot about it.

> > I gather that you *don't* want to do anything about the
> > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just
> > the lazy_scan_new_or_empty behavior).
>
> Not in the short-medium term, at least. In the long term I do suspect we might
> want to do something about it.  We have a *crapton* of contention in the FSM
> and caused by the FSM in bulk workloads. With my relation extension patch
> disabling the FSM nearly doubles concurrent load speed.

I've seen the same effect myself. There is no question that that's a
big problem.

I think that the problem is that the system doesn't have any firm
understanding of pages as things that are owned by particular
transactions and/or backends, at least to a limited, scoped extent.
It's all really low level, when it actually needs to be high level and
take lots of context that comes from the application into account.

> At the same time, the fact that we might loose knowledge about all the
> existing free space in case of promotion or crash and never rediscover that
> space (because the pages are frozen), seems decidedly not great.

Unquestionably.

> I don't know what the path forward is, but it seems somewhat clear that we
> ought to do something. I suspect having a not crash-safe FSM isn't really
> acceptable anymore - it probably is fine to not persist a *reduction* in free
> space, but we can't permanently loose track of free space, no matter how many
> crashes.

Strongly agreed. It's a terrible false economy. If we bit the bullet
and made relation extension and the FSM crash safe, we'd gain so much
more than we'd lose.

> One thing I am fairly certain about is that using the FSM to tell other
> backends about newly bulk extended pages is not a good solution, even though
> we're stuck with it for the moment.

Strongly agreed.

> > I think that you must be arguing for making the early lifetime of a
> > heap page special to VACUUM, since AFAICT you want to change VACUUM's
> > behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
> > with pages that have one remaining LP_UNUSED item, but are otherwise
> > empty (which could be set all-visible/all-frozen in either the first
> > or second heap pass, even if we disabled the lazy_scan_new_or_empty()
> > behavior you're complaining about). You seem to want to distinguish
> > between very new pages (that also happen to be empty), and old pages
> > that happen to be empty. Right?
>
> I think that might be worthwhile, yes. The retry code in
> RelationGetBufferForTuple() is quite hairy and almost impossible to test. If
> we can avoid the complexity, at a fairly bound cost (vacuum needing to
> re-visit new/empty pages if they're currently pinned), it'd imo be more that
> worth the price.

Short term, you could explicitly say that PageIsEmpty() means that the
page is qualitatively different to other empty pages that were left
behind by VACUUM's second phase.

> But perhaps the better

Re: Request for comment on setting binary format output per session

On Tue, 2023-03-28 at 10:22 -0400, Dave Cramer wrote:
> OK, IIUC what you are proposing here is that there would be a
> separate pool for 
> database, user, and OIDs. This doesn't seem too flexible. For
> instance if I create a UDT and then want it to be returned 
> as binary then I have to reconfigure the pool to be able to accept a
> new list of OID's.

There are two ways that I could imagine the connection pool working:

1. Accept whatever clients connect, and pass along the binary_formats
setting to the outbound (server) connection. The downside here is that
if you have many different clients (or different versions) that have
different binary_formats settings, then it creates too many pools and
doesn't share well enough.

2. Some kind of configuration setting (or maybe it can be done
automatically) that organizes based on a common subset of binary
formats that many clients can understand.

These can evolve once the protocol extension is in place.

Regards,
Jeff Davis

Re: running logical replication as the subscription owner

On Tue, 2023-03-28 at 13:55 -0400, Robert Haas wrote:
> Oh, interesting. I hadn't realized we were doing that, and I do think
> that narrows the vulnerability quite a bit.

It's good to know I wasn't missing something obvious.

> bsent logical replication, you don't really need to worry
> about whether that function is written securely -- it will run with
> privileges of the person performing the DML, and not impact your
> account at all.

That's not strictly true. See the example at the bottom of this email.

The second trigger is SECURITY INVOKER, and it captures the leaked
secret number that was stored in the tuple by an earlier SECURITY
DEFINER trigger.

Perhaps I'm being pedantic, but my point is that SECURITY INVOKER is
not a magical shield that protects users from themselves without bound.
It protects users against some kinds of attacks, sure, but there's a
limit to what it has to offer.

>  They have reason to be scared of you, but not the
> reverse. However, if the people doing DML on the table can arrange to
> perform that DML as you, then any security vulnerabilities in your
> function potentially allow them to compromise your account.

OK, but I'd like to be clear that you've moved from your prior
statement:

  "in theory they might be able to protect themselves, but in practice
they are unlikely to be careful enough"

To something very different, where it seems that we can't think of a
concrete problem even for not-so-careful users.

The dangerous cases seem to be something along the lines of a security-
invoker trigger function that builds and executes arbirary SQL based on
the row contents. And then the table owner would then still need to set
ENABLE ALWAYS TRIGGER.

Do we really want to take that case on as our security responsibility?

> It would also affect someone who uses a default
> expression or other means of associating executable code with the
> table, and something like a default expression doesn't require any
> explicit setting to make it apply in case of replication, so maybe
> the
> risk of someone messing up is a bit higher.

Perhaps, but I still don't understand that case. Unless I'm missing
something, the table owner would have to write a pretty weird default
expression or check constraint or whatever to end up with something
dangerous.

> But this definitely makes it more of a judgment call than I thought
> initially.

And I'm fine if the judgement is that we just require SET ROLE to be
conservative and make sure we don't over-promise in 16. That's a
totally reasonable thing: it's easier to loosen restrictions later than
to tighten them. Furthermore, just because you and I can't think of
exploitable problems doesn't mean they don't exist.

I have found this discussion enlightening, so thank you for going
through these details with me.

> I'm still inclined to leave the patch checking for SET ROLE

+0.5. I want the patch to go in either way, and can carry on the
discussion later and improve things more for 17.

But I want to say generally that I believe we spend too much effort
trying to protect unreasonable users from themselves, which interferes
with our ability to protect reasonable users and solve real use cases.

> However, there might be an argument
> that we ought to do something else, like have a REPLICATE privilege

Sounds cool. Not sure I have an opinion yet, but probably 17 material.

> What I would like to change in the
> patch in this release is to add a new subscription property along the
> lines of what I proposed to Jelte in my earlier email: let's provide
> an escape hatch that turns off the user-switching behavior.

I believe switching to the table owner (as done in your patch) is the
right best practice, and should be the default.

I'm fine with an escape hatch here to ease migrations or whatever. But
please do it in an unobtrusive way such that we (as hackers) can mostly
forget that the non-switching behavior exists. At least for me, our
system is already pretty complex without two kinds of subscription
security models.

Regards,
Jeff Davis

- example follows --

  -- user foo
  CREATE TABLE secret(I INT);
  INSERT INTO secret VALUES(42);
  CREATE TABLE r(i INT, secret INT);
  CREATE OR REPLACE FUNCTION a_func() RETURNS TRIGGER
SECURITY DEFINER LANGUAGE plpgsql AS
  $$
BEGIN
   SELECT i INTO NEW.secret FROM secret;
   RETURN NEW;
END;
  $$;
  CREATE OR REPLACE FUNCTION z_func() RETURNS TRIGGER
SECURITY INVOKER LANGUAGE plpgsql AS
  $$
BEGIN
  IF NEW.secret <> abs(NEW.secret) THEN
RAISE EXCEPTION 'no negative secret numbers allowed';
  END IF;
  RETURN NEW;
END;
  $$;
  CREATE TRIGGER a_trigger BEFORE INSERT ON r
FOR EACH ROW EXECUTE PROCEDURE a_func();
  CREATE TRIGGER z_trigger BEFORE INSERT ON r
FOR EACH ROW EXECUTE PROCEDURE z_func();
  GRANT INSERT ON r TO bar;
  GRANT USAGE ON SCHEMA foo TO bar;

  -- user bar
  CREATE OR REPLACE FUNCTION bar.abs(i INT4) RETURNS INT4

Re: Add pg_walinspect function with block info columns

On Tue, Mar 28, 2023 at 11:15:17AM -0700, Peter Geoghegan wrote:
> On Mon, Mar 27, 2023 at 7:47 PM Bharath Rupireddy
>  wrote:
>> Hm, agreed. Changed in the attached v7-0002 patch. We can as well
>> write a case statement in the create function SQL to output forkname
>> instead forknumber, but I'd stop doing that to keep in sync with
>> pg_buffercache.
> 
> I just don't see much value in any textual representation of fork
> name, however generated. In practice it's just not adding very much
> useful information. It is mostly useful as a way of filtering block
> references, which makes simple integers more natural.

I disagree with this argument.  Personally, I have a *much* better
experience with textual representation because there is no need to
cross-check the internals of the code in case you don't remember what
a given number means in an enum or in a set of bits, especially if
you're in a hurry of looking at a production or customer deployment.
In short, it makes for less mistakes because you don't have to think
about some extra mapping between some integers and what they actually
mean through text.  The clauses you'd apply for a group by on the
forks, or for a filter with IN clauses don't change, they're just made
easier to understand for the common user, and that includes
experienced people.  We'd better think about that like the text[]
arrays we use for the flag values, like the FPI flags, or why we've
introduced text[] for the HEAP_* flags in the heap functions of
pageinspect.

There's even more consistency with pageinspect in using a fork name,
where we can pass down a fork name to get a raw page.
--
Michael

signature.asc
Description: PGP signature

Re: zstd compression for pg_dump

2023-03-28 Thread Jacob Champion

On Wed, Mar 15, 2023 at 9:50 PM Justin Pryzby  wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> > It looks like pg_dump's meson.build is missing dependencies on zstd
> > (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?

I thought I replied to this, sorry -- your newest patchset builds
correctly with subprojects, so the new dependency looks good to me.
Thanks!

> > Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> > handling safety for this, by isolating each thread's state. I don't feel
> > comfortable pronouncing this new addition safe or not, because I'm not
> > sure I understand what the comments in the format-specific _Clone()
> > callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).

To (maybe?) move this forward a bit, note that pg_backup_custom's
_Clone() function makes sure that there is no active compressor state
at the beginning of the new thread. pg_backup_directory's
implementation has no such provision. And I don't think it can,
because the parent thread might have concurrently set one up -- see
the directory-specific implementation of _CloseArchive(). Perhaps we
should just NULL out those fields after the copy, instead?

To illustrate why I think this is tough to characterize: if I've read
the code correctly, the _Clone() and CloneArchive() implementations
are running concurrently with code that is actively modifying the
ArchiveHandle and the lclContext. So safety is only ensured to the
extent that we keep track of which fields threads are allowed to
touch, and I don't have that mental model.

--Jacob

Re: Why mark empty pages all visible?

Hi,

On 2023-03-28 13:05:19 -0700, Peter Geoghegan wrote:
> On Tue, Mar 28, 2023 at 12:01 PM Andres Freund  wrote:
> > I will probably make that argument - so far I was just trying to understand
> > the intent of the current code. There aren't really comments explaining why 
> > we
> > want to mark currently-pinned empty/new pages all-visible and enter them 
> > into
> > the FSM.
>
> I don't think that not being able to immediately get a cleanup lock on
> a page signifies much of anything. I never have.

Why is that? It's as clear a signal of concurrent activity on the buffer
you're going to get.

> > Historically we did *not* enter currently pinned empty/new pages into the 
> > FSM
> > / VM. Afaics that's new as of 44fa84881fff.
>
> Of course that's true, but I don't know why that history is
> particularly important.

It's interesting to understand *why* we are doing what we are. I think it'd
make sense to propose changing how things work around this, but I just don't
feel like I have a good enough grasp for why we do some of the things we
do. And given there's not a lot of comments around it and some of the comments
that do exist are inconsistent with themselves, looking at the history seems
like the next best thing?

> I actually think that I might agree with the substance of much of what
> you're saying, but at the same time I don't think that you're defining
> the problem in a way that's particularly helpful.

Likely because the goals of the existing code aren't clear to me. So I don't
feel like I have a firm grasp...

> I gather that you *don't* want to do anything about the
> lazy_vacuum_heap_page behavior with setting empty pages all-visible (just
> the lazy_scan_new_or_empty behavior).

Not in the short-medium term, at least. In the long term I do suspect we might
want to do something about it.  We have a *crapton* of contention in the FSM
and caused by the FSM in bulk workloads. With my relation extension patch
disabling the FSM nearly doubles concurrent load speed.

At the same time, the fact that we might loose knowledge about all the
existing free space in case of promotion or crash and never rediscover that
space (because the pages are frozen), seems decidedly not great.

I don't know what the path forward is, but it seems somewhat clear that we
ought to do something. I suspect having a not crash-safe FSM isn't really
acceptable anymore - it probably is fine to not persist a *reduction* in free
space, but we can't permanently loose track of free space, no matter how many
crashes.

I know that you strongly dislike the way the FSM works, although I forgot some
of the details.

One thing I am fairly certain about is that using the FSM to tell other
backends about newly bulk extended pages is not a good solution, even though
we're stuck with it for the moment.

> I actually agree that VACUUM is way too unconcerned about interfering
> with concurrent activity in terms of how it manages free space in the
> FSM. But this seems like just about the least important example of
> that (outside the context of your RelationGetBufferForTuple work). The
> really important case (that VACUUM gets wrong) all involve recently
> dead tuples. But I don't think that you want to talk about that right
> now. You want to talk about RelationGetBufferForTuple.

That's indeed the background. By now I'd also like to add a few comments
explaining why we do what we currently do, because I don't find all of it
obvious.

> > The reason I'm looking at this is that there's a lot of complexity at the
> > bottom of RelationGetBufferForTuple(), related to needing to release the 
> > lock
> > on the newly extended page and then needing to recheck whether there still 
> > is
> > free space on the page. And that it's not complicated enough
> > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).
> >
> > As far as I can tell, if we went back to not entering new/empty pages into
> > either VM or FSM, we could rely on the page not getting filled while just
> > pinning, not locking it.
>
> What you're essentially arguing for is inventing a new rule that makes
> the early lifetime of a page (what we currently call a PageIsEmpty()
> page, and new pages) special, to avoid interference from VACUUM. I
> have made similar arguments myself on quite a few occasions, so I'm
> actually sympathetic. I just think that you should own it. And no, I'm
> not just reflexively defending my work in 44fa84881fff; I actually
> think that framing the problem as a case of restoring a previous
> behavior is confusing and ahistorical. If there was a useful behavior
> that was lost, then it was quite an accidental behavior all along. The
> difference matters because now you have to reconcile what you're
> saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added
> in 14.

I really don't have a position to own yet, not on firm enough ground.

> I think that you must be arguing for making the early lifetime of a
> heap page special

Re: Add LZ4 compression in pg_dump

On 3/28/23 18:07, gkokola...@pm.me wrote:
> 
> --- Original Message ---
> On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me  
> wrote:
> 
>>
>> --- Original Message ---
>> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra 
>> tomas.von...@enterprisedb.com wrote:
>>
>>> This leaves the empty-data issue (which we have a fix for) and the
>>> switch to LZ4F. And then the zstd part.
>>
>> Please expect promptly a patch for the switch to frames.
> 
> Please find the expected patch attached. Note that the bulk of the
> patch is code unification, variable renaming to something more
> appropriate, and comment addition. These are changes that are not
> strictly necessary to switch to LZ4F. I do believe that are
> essential for code hygiene after the switch and they do belong
> on the same commit. 
> 

I think the patch is fine, but I'm wondering if the renames shouldn't go
a bit further. It removes references to LZ4File struct, but there's a
bunch of functions with LZ4File_ prefix. Why not to simply use LZ4_
prefix? We don't have GzipFile either.

Sure, it might be a bit confusing because lz4.h uses LZ4_ prefix, but
then we probably should not define LZ4_compressor_init ...

Also, maybe the comments shouldn't use "File API" when compress_io.c
calls that "Compressed stream API".

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Support logical replication of DDLs

2023-03-28 Thread Zheng Li

On Sun, Mar 26, 2023 at 5:22 PM Tom Lane  wrote:

> I spent some time looking through this thread to try to get a sense
> of the state of things, and I came away quite depressed.  The patchset
> has ballooned to over 2MB, which is a couple orders of magnitude
> larger than anyone could hope to meaningfully review from scratch.
> Despite that, it seems that there are fundamental semantics issues
> remaining, not to mention clear-and-present security dangers, not
> to mention TODO comments all over the code.

Thanks for looking into this!

> I'm also less than sold on the technical details, specifically
> the notion of "let's translate utility parse trees into JSON and
> send that down the wire".  You can probably make that work for now,
> but I wonder if it will be any more robust against cross-version
> changes than just shipping the outfuncs.c representation.  (Perhaps
> it can be made more robust than the raw parse trees, but I see no
> evidence that anyone's thought much about how.)

I explored the idea of using the outfuncs.c representation in [1] and
found this existing format is not designed to be portable between
different major versions. So it can't be directly used for replication
without
serious modification. I think the DDL deparser is a necessary tool if
we want to be able to handle cross-version DDL syntax differences by
providing the capability to machine-edit the JSON representation.

> And TBH, I don't think that I quite believe the premise in the
> first place.  The whole point of using logical rather than physical
> replication is that the subscriber installation(s) aren't exactly like
> the publisher.  Given that, how can we expect that automated DDL
> replication is going to do the right thing often enough to be a useful
> tool rather than a disastrous foot-gun?  The more you expand the scope
> of what gets replicated, the worse that problem becomes --- for
> example, I don't buy for one second that "let's replicate roles"
> is a credible solution for the problems that come from the roles
> not being the same on publisher and subscriber.

I agree that a full fledged DDL deparser and DDL replication is too
big of a task for one patch. I think we may consider approaching this
feature in the following ways:
1. Phased development and testing as discussed in other emails.
Probably support table commands first (as they are the most common
DDLs), then the other commands in multiple phases.
2. Provide a subscription option to receive the DDL change, raise a
notice and to skip applying the change. The users can listen to the
DDL notice and implement application logic to apply the change if
needed. The idea is we can start gathering user feedback by providing
a somewhat useful feature (compared to doing nothing about DDLs), but
also avoid heading straight into the potential footgun situation
caused by automatically applying any mal-formatted DDLs.
3. As cross-version DDL syntax differences are expected to be uncommon
(in real workload), maybe we can think about other options to handle
such edge cases instead of fully automating it? For example, what
about letting the user specify how a DDL should be replicated on the
subscriber by explicitly providing two versions of DDL commands in
some way?

> I'm not sure how we get from here to a committable and useful feature,
> but I don't think we're close to that yet, and I'm not sure that minor
> iterations on a 2MB patchset will accomplish much.

About 1 MB of the patch are testing output files for the DDL deparser
(postgres/src/test/modules/test_ddl_deparse_regress/expected/).

Regards,
Zane

[1] 
https://www.postgresql.org/message-id/CAAD30U%2Boi6e6Vh_zAzhuXzkqUhagmLGD%2B_iyn2N9w_sNRKsoag%40mail.gmail.com

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

2023-03-28 Thread Robert Haas

On Mon, Mar 27, 2023 at 4:52 PM Peter Geoghegan  wrote:
> This is fine, as far as it goes. Obviously it fixes the immediate problem.

OK, I've committed and back-patched this fix to v14, just like the
erroneous commit that created the issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

2023-03-28 Thread Corey Huinker

On Tue, Mar 28, 2023 at 3:25 PM Isaac Morland 
wrote:

> On Mon, 19 Dec 2022 at 17:57, Corey Huinker 
> wrote:
>
>>
>> Attached is my work in progress to implement the changes to the CAST()
>> function as proposed by Vik Fearing.
>>
>> CAST(expr AS typename NULL ON ERROR)
>> will use error-safe functions to do the cast of expr, and will return
>> NULL if the cast fails.
>>
>> CAST(expr AS typename DEFAULT expr2 ON ERROR)
>> will use error-safe functions to do the cast of expr, and will return
>> expr2 if the cast fails.
>>
>
> Is there any difference between NULL and DEFAULT NULL?
>

What I think you're asking is: is there a difference between these two
statements:

SELECT CAST(my_string AS integer NULL ON ERROR) FROM my_table;

SELECT CAST(my_string AS integer DEFAULT NULL ON ERROR) FROM my_table;

And as I understand it, the answer would be no, there is no practical
difference. The first case is just a convenient shorthand, whereas the
second case tees you up for a potentially complex expression. Before you
ask, I believe the ON ERROR syntax could be made optional. As I implemented
it, both cases create a default expression which then typecast to integer,
and in both cases that expression would be a const-null, so the optimizer
steps would very quickly collapse those steps into a plain old constant.

Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

2023-03-28 Thread Corey Huinker

On Tue, Mar 28, 2023 at 2:53 PM Gregory Stark (as CFM) 
wrote:

> On Tue, 3 Jan 2023 at 14:16, Tom Lane  wrote:
> >
> > Vik Fearing  writes:
> >
> > > I don't think we should add that syntax until I do get it through the
> > > committee, just in case they change something.
> >
> > Agreed.  So this is something we won't be able to put into v16;
> > it'll have to wait till there's something solid from the committee.
>
> I guess I'll mark this Rejected in the CF then. Who knows when the SQL
> committee will look at this...
>
> --
> Gregory Stark
> As Commitfest Manager
>

Yes, for now. I'm in touch with the pg-people on the committee and will
resume work when there's something to act upon.

Re: Why mark empty pages all visible?

2023-03-28 Thread Peter Geoghegan

On Tue, Mar 28, 2023 at 12:01 PM Andres Freund  wrote:
> I will probably make that argument - so far I was just trying to understand
> the intent of the current code. There aren't really comments explaining why we
> want to mark currently-pinned empty/new pages all-visible and enter them into
> the FSM.

I don't think that not being able to immediately get a cleanup lock on
a page signifies much of anything. I never have.

> Historically we did *not* enter currently pinned empty/new pages into the FSM
> / VM. Afaics that's new as of 44fa84881fff.

Of course that's true, but I don't know why that history is
particularly important. Either way, holding a pin was never supposed
to work as an interlock against a page being concurrently set
all-visible, or having its space recorded in the FSM. It's true that
my work on VACUUM has shaken out a couple of bugs where we
accidentally relied on that being true. But those were all due to the
change in lazy_vacuum_heap_page() made in Postgres 14 -- not the
addition of lazy_scan_new_or_empty() in Postgres 15.

I actually think that I might agree with the substance of much of what
you're saying, but at the same time I don't think that you're defining
the problem in a way that's particularly helpful. I gather that you
*don't* want to do anything about the lazy_vacuum_heap_page behavior
with setting empty pages all-visible (just the lazy_scan_new_or_empty
behavior). So clearly this isn't really about marking empty pages
all-visible, with or without a cleanup lock. It's actually about
something rather more specific: the interactions with
RelationGetBufferForTuple.

I actually agree that VACUUM is way too unconcerned about interfering
with concurrent activity in terms of how it manages free space in the
FSM. But this seems like just about the least important example of
that (outside the context of your RelationGetBufferForTuple work). The
really important case (that VACUUM gets wrong) all involve recently
dead tuples. But I don't think that you want to talk about that right
now. You want to talk about RelationGetBufferForTuple.

> The reason I'm looking at this is that there's a lot of complexity at the
> bottom of RelationGetBufferForTuple(), related to needing to release the lock
> on the newly extended page and then needing to recheck whether there still is
> free space on the page. And that it's not complicated enough
> (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).
>
> As far as I can tell, if we went back to not entering new/empty pages into
> either VM or FSM, we could rely on the page not getting filled while just
> pinning, not locking it.

What you're essentially arguing for is inventing a new rule that makes
the early lifetime of a page (what we currently call a PageIsEmpty()
page, and new pages) special, to avoid interference from VACUUM. I
have made similar arguments myself on quite a few occasions, so I'm
actually sympathetic. I just think that you should own it. And no, I'm
not just reflexively defending my work in 44fa84881fff; I actually
think that framing the problem as a case of restoring a previous
behavior is confusing and ahistorical. If there was a useful behavior
that was lost, then it was quite an accidental behavior all along. The
difference matters because now you have to reconcile what you're
saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added
in 14.

I think that you must be arguing for making the early lifetime of a
heap page special to VACUUM, since AFAICT you want to change VACUUM's
behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
with pages that have one remaining LP_UNUSED item, but are otherwise
empty (which could be set all-visible/all-frozen in either the first
or second heap pass, even if we disabled the lazy_scan_new_or_empty()
behavior you're complaining about). You seem to want to distinguish
between very new pages (that also happen to be empty), and old pages
that happen to be empty. Right?

> I also do wonder whether the different behaviour of PageIsEmpty() and
> PageIsNew() pages makes sense. The justification for not marking PageIsNew()
> pages as all-visible holds just as true for empty pages. There's just as much
> free space there.

What you say here makes a lot of sense to me. I'm just not sure what
I'd prefer to do about it.

-- 
Peter Geoghegan

Re: Some revises in adding sorting path

2023-03-28 Thread David Rowley

On Tue, 21 Feb 2023 at 22:02, Richard Guo  wrote:
> Looking at the codes now I have some concern that what we do in
> create_ordered_paths for partial paths may have already been done in
> generate_useful_gather_paths, especially when query_pathkeys is equal to
> sort_pathkeys.  Not sure if this is a problem.  And the comment there
> mentions generate_gather_paths(), but ISTM we should mention what
> generate_useful_gather_paths has done.

I think you need to write some tests for this. I've managed to come up
with something to get that code to perform a Sort, but I've not
managed to get it to perform an incremental sort.

create or replace function parallel_safe_volatile(a int) returns int
as $$ begin return a; end; $$ parallel safe volatile language plpgsql;
create table abc(a int, b int, c int);
insert into abc select x,y,z from generate_Series(1,100)x,
generate_Series(1,100)y, generate_Series(1,100)z;
set parallel_tuple_cost=0;

Without making those parallel paths, we get:

postgres=# explain select * from abc where a=1 order by
a,b,parallel_safe_volatile(c);
   QUERY PLAN

 Sort  (cost=13391.49..13417.49 rows=10400 width=16)
   Sort Key: b, (parallel_safe_volatile(c))
   ->  Gather  (cost=1000.00..12697.58 rows=10400 width=16)
 Workers Planned: 2
 ->  Parallel Seq Scan on abc  (cost=0.00..11697.58 rows=4333 width=16)
   Filter: (a = 1)
(6 rows)

but with, the plan is:

postgres=# explain select * from abc where a=1 order by
a,b,parallel_safe_volatile(c);
   QUERY PLAN

 Gather Merge  (cost=12959.35..13060.52 rows=8666 width=16)
   Workers Planned: 2
   ->  Sort  (cost=11959.32..11970.15 rows=4333 width=16)
 Sort Key: b, (parallel_safe_volatile(c))
 ->  Parallel Seq Scan on abc  (cost=0.00..11697.58 rows=4333 width=16)
   Filter: (a = 1)
(6 rows)

I added the parallel safe and volatile function so that
get_useful_pathkeys_for_relation() wouldn't include all of the
query_pathkeys.

If you write some tests for this code, it will be useful to prove that
it actually does something, and also that it does not break again in
the future. I don't really want to just blindly copy the pattern used
in 3c6fc5820 for creating incremental sort paths if it's not useful
here. It would be good to see tests that make an Incremental Sort path
using the code you're changing.

Same for the 0003 patch.

I'll mark this as waiting on author while you work on that.

David

Re: POC: Better infrastructure for automated testing of concurrency issues

2023-03-28 Thread Alexander Korotkov

On Tue, Mar 28, 2023 at 9:44 PM Gregory Stark (as CFM)
 wrote:
> On Wed, 31 Aug 2022 at 20:52, Alexander Korotkov  wrote:
> >
> > I'll continue work on this patch.  The rebased patch is attached.  It
> > implements stop events as configure option (not runtime GUC option).
>
> It looks like this patch isn't going to be ready this commitfest. And
> it hasn't received much discussion since August 2022. If I'm wrong say
> something but otherwise I'll mark it Returned With Feedback. It can be
> resurrected (and moved to the next commitfest) when you're free to
> work on it again.

I'm good with that.

--
Regards,
Alexander Korotkov

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2023-03-28 Thread stephane tachoires

Hi,

Patch v15-0001-ALTER-TABLE-MERGE-PARTITIONS-command.patch
Apply nicely.
One warning on meson compile (configure -Dssl=openssl -Dldap=enabled 
-Dauto_features=enabled -DPG_TEST_EXTRA='ssl,ldap,kerberos' -Dbsd_auth=disabled 
-Dbonjour=disabled -Dpam=disabled -Dpltcl=disabled -Dsystemd=disabled 
-Dzstd=disabled  -Db_coverage=true)

../../src/pgmergesplit/src/test/modules/test_ddl_deparse/test_ddl_deparse.c: In 
function ‘get_altertable_subcmdinfo’:
../../src/pgmergesplit/src/test/modules/test_ddl_deparse/test_ddl_deparse.c:112:17:
 warning: enumeration value ‘AT_MergePartitions’ not handled in switch 
[-Wswitch]
  112 | switch (subcmd->subtype)
  | ^~
Should be the same with 0002...

meson test perfect, patch coverage is very good.

Patch v15-0002-ALTER-TABLE-SPLIT-PARTITION-command.patch
Doesn't apply on 326a33a289c7ba2dbf45f17e610b7be98dc11f67

Patch v15-0003-Documentation-for-ALTER-TABLE-SPLIT-PARTITION-ME.patch
Apply with one warning  1 line add space error (translate from french "warning: 
1 ligne a ajouté des erreurs d'espace").
v15-0003-Documentation-for-ALTER-TABLE-SPLIT-PARTITION-ME.patch:54: trailing 
whitespace.
  One of the new partitions partition_name1, 
Comment are ok for me. A non native english speaker.
Perhaps you could add some remarks in ddl.html and alter-ddl.html

Stéphane

The new status of this patch is: Waiting on Author

Re: Schema variables - new implementation for Postgres 15

2023-03-28 Thread Pavel Stehule

Hi

ne 26. 3. 2023 v 19:53 odesílatel Dmitry Dolgov <9erthali...@gmail.com>
napsal:

> > On Sun, Mar 26, 2023 at 07:32:05PM +0800, Julien Rouhaud wrote:
> > Hi,
> >
> > I just have a few minor wording improvements for the various comments /
> > documentation you quoted.
>
> Talking about documentation I've noticed that the implementation
> contains few limitations, that are not mentioned in the docs. Examples
> are WITH queries:
>
> WITH x AS (LET public.svar = 100) SELECT * FROM x;
> ERROR:  LET not supported in WITH query
>

 The LET statement doesn't support the RETURNING clause, so using inside
CTE does not make any sense.

Do you have some tips, where this behaviour should be mentioned?


> and using with set-returning functions (haven't found any related tests).
>

There it is:

+CREATE VARIABLE public.svar AS int;
+-- should be ok
+LET public.svar = generate_series(1, 1);
+-- should fail
+LET public.svar = generate_series(1, 2);
+ERROR:  expression returned more than one row
+LET public.svar = generate_series(1, 0);
+ERROR:  expression returned no rows
+DROP VARIABLE public.svar;


>
> Another small note is about this change in the rowsecurity:
>
> /*
> -* For SELECT, UPDATE and DELETE, add security quals to enforce
> the USING
> -* policies.  These security quals control access to existing
> table rows.
> -* Restrictive policies are combined together using AND, and
> permissive
> -* policies are combined together using OR.
> +* For SELECT, LET, UPDATE and DELETE, add security quals to
> enforce the
> +* USING policies.  These security quals control access to
> existing table
> +* rows. Restrictive policies are combined together using AND, and
> +* permissive policies are combined together using OR.
>  */
>
> From this commentary one may think that LET command supports row level
> security, but I don't see it being implemented. A wrong commentary?
>

I don't think so.  The row level security should be supported. I tested it
on example from doc:

CREATE TABLE public.accounts (
manager text,
company text,
contact_email text
);

CREATE VARIABLE public.v AS text;

COPY public.accounts (manager, company, contact_email) FROM stdin;
t1role xxx t1r...@xxx.org
t2role yyy t2r...@yyy.org
\.

CREATE POLICY account_managers ON public.accounts USING ((manager =
CURRENT_USER));
ALTER TABLE public.accounts ENABLE ROW LEVEL SECURITY;

GRANT SELECT,INSERT ON TABLE public.accounts TO t1role;
GRANT SELECT,INSERT ON TABLE public.accounts TO t2role;

GRANT ALL ON VARIABLE public.v TO t1role;
GRANT ALL ON VARIABLE public.v TO t2role;


[pavel@localhost postgresql.master]$ psql
Assertions: on
psql (16devel)
Type "help" for help.

(2023-03-28 21:32:33) postgres=# set role to t1role;
SET
(2023-03-28 21:32:40) postgres=# select * from accounts ;
┌─┬─┬┐
│ manager │ company │ contact_email  │
╞═╪═╪╡
│ t1role  │ xxx │ t1r...@xxx.org │
└─┴─┴┘
(1 row)

(2023-03-28 21:32:45) postgres=# let v = (select company from accounts);
LET
(2023-03-28 21:32:58) postgres=# select v;
┌─┐
│  v  │
╞═╡
│ xxx │
└─┘
(1 row)

(2023-03-28 21:33:03) postgres=# set role to default;
SET
(2023-03-28 21:33:12) postgres=# set role to t2role;
SET
(2023-03-28 21:33:19) postgres=# select * from accounts ;
┌─┬─┬┐
│ manager │ company │ contact_email  │
╞═╪═╪╡
│ t2role  │ yyy │ t2r...@yyy.org │
└─┴─┴┘
(1 row)

(2023-03-28 21:33:22) postgres=# let v = (select company from accounts);
LET
(2023-03-28 21:33:26) postgres=# select v;
┌─┐
│  v  │
╞═╡
│ yyy │
└─┘
(1 row)


Regards

Pavel

Re: Documentation for building with meson

2023-03-28 Thread samay sharma

Hi,

On Wed, Mar 15, 2023 at 4:28 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

>  > [PATCH v8 1/5] Make minor additions and corrections to meson docs
>
> The last hunk revealed that there is some mixing up between meson setup
> and meson configure.  This goes a bit further.  For example, earlier it
> says that to get a list of meson setup options, call meson configure
> --help and look at https://mesonbuild.com/Commands.html#configure, which
> are both wrong.  Also later throughout the text it uses one or the
> other.  I think this has the potential to be very confusing, and we
> should clean this up carefully.
>
> The text about additional meson test options maybe should go into the
> regress.sgml chapter?
>

I tried to make the meson setup and meson configure usage consistent. I've
removed the text for the test options.

>
>
>  > [PATCH v8 2/5] Add data layout options sub-section in installation
>   docs
>
> This makes sense.  Please double check your patch for correct title
> casing, vertical spacing of XML, to keep everything looking consistent.
>

Thanks for noticing. Made it consistent on both sides.

>
> This text isn't yours, but since your patch emphasizes it, I wonder if
> it could use some clarification:
>
> + These options affect how PostgreSQL lays out data on disk.
> + Note that changing these breaks on-disk database compatibility,
> + meaning you cannot use pg_upgrade to upgrade to
> + a build with different values of these options.
>
> This isn't really correct.  What breaking on-disk compatibility means is
> that you can't use a server compiled one way with a data directory
> initialized by binaries compiled another way.  pg_upgrade may well have
> the ability to upgrade between one or the other; that's up to pg_upgrade
> to figure out but not an intrinsic property.  (I wonder why pg_upgrade
> cares about the WAL block size.)
>

 Fixed.

>
>
>  > [PATCH v8 3/5] Remove Anti-Features section from Installation from
>   source docs
>
> Makes sense.  But is "--disable-thread-safety" really a developer
> feature?  I think not.
>
>
Moved to PostgreSQL features. Do you think there's a better place for it?


>
>  > [PATCH v8 4/5] Re-organize Miscellaneous section
>
> This moves the Miscellaneous section after Developer Features.  I think
> Developer Features should be last.
>
> Maybe should remove this section and add the options to the regular
> PostgreSQL Features section.
>

Yes, that makes sense. Made this change.

>
> Also consider the grouping in meson_options.txt, which is slightly
> different yet.


Removed Misc options section from meson_options.txt too.

>
>
>  > [PATCH v8 5/5] Change Short Version for meson installation guide
>
> +# create working directory
> +mkdir postgres
> +cd postgres
> +
> +# fetch source code
> +git clone https://git.postgresql.org/git/postgresql.git src
>
> This comes after the "Getting the Source" section, so at this point they
> already have the source and don't need to do "git clone" etc. again.
>
> +# setup and enter build directory (done only first time)
> +## Unix based platforms
> +meson setup build src --prefix=$PWD/install
> +
> +## Windows
> +meson setup build src --prefix=%cd%/install
>
> Maybe some people work this way, but to me the directory structures you
> create here are completely weird.
>

I'd like to discuss what you think is a good directory structure to work
with. I've mentioned some of the drawbacks I see with the current structure
for the short version. I know this structure can feel different but it
feeling weird is not ideal. Do you have a directory structure in mind which
is different but doesn't feel odd to you?


>
> +# Initialize a new database
> +../install/bin/initdb -D ../data
> +
> +# Start database
> +../install/bin/pg_ctl -D ../data/ -l logfile start
> +
> +# Connect to the database
> +../install/bin/psql -d postgres
>
> The terminology here needs to be tightened up.  You are using "database"
> here to mean three different things.
>

I'll address this together once we are aligned on the overall directory
structure etc.

There are a few reasons why I had done this. Some reasons Andres has
> described in his previous email and I'll add a few specific examples on why
> having the same section for both might not be a good idea.
>
>  * Having readline and zlib as required for building PostgreSQL is now
> wrong because they are not required for meson builds. Also, the name of the
> configs are different for make and meson and the current section only lists
> the make ones.
>  * There are many references to configure in that section which don't
> apply to meson.
>  * Last I checked Flex and Bison were always required to build via meson
> but not for make and the current section doesn't explain those differences.
>
> I spent a good amount of time thinking if we could have a single section,
> clarify these differences to make it correct and not confuse the users. I
> couldn't find a way to do all

Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

2023-03-28 Thread Isaac Morland

On Mon, 19 Dec 2022 at 17:57, Corey Huinker  wrote:

>
> Attached is my work in progress to implement the changes to the CAST()
> function as proposed by Vik Fearing.
>
> CAST(expr AS typename NULL ON ERROR)
> will use error-safe functions to do the cast of expr, and will return
> NULL if the cast fails.
>
> CAST(expr AS typename DEFAULT expr2 ON ERROR)
> will use error-safe functions to do the cast of expr, and will return
> expr2 if the cast fails.
>

Is there any difference between NULL and DEFAULT NULL?

Re: Schema variables - new implementation for Postgres 15

2023-03-28 Thread Pavel Stehule

ne 26. 3. 2023 v 13:32 odesílatel Julien Rouhaud 
napsal:

> Hi,
>
> I just have a few minor wording improvements for the various comments /
> documentation you quoted.
>
> On Sun, Mar 26, 2023 at 08:53:49AM +0200, Pavel Stehule wrote:
> > út 21. 3. 2023 v 17:18 odesílatel Peter Eisentraut <
> > peter.eisentr...@enterprisedb.com> napsal:
> >
> > > - What is the purpose of struct Variable?  It seems very similar to
> > >FormData_pg_variable.  At least a comment would be useful.
> > >
> >
> > I wrote comment there:
> >
> >
> > /*
> >  * The Variable struct is based on FormData_pg_variable struct. Against
> >  * FormData_pg_variable it can hold node of deserialized expression used
> >  * for calculation of default value.
> >  */
>
> Did you mean "Unlike" rather than "Against"?
>

fixed


>
> > > 0002
> > >
> > > expr_kind_allows_session_variables() should have some explanation
> > > about criteria for determining which expression kinds should allow
> > > variables.
> > >
> >
> > I wrote comment there:
> >
> >  /*
> >   * Returns true, when expression of kind allows using of
> >   * session variables.
> > + * The session's variables can be used everywhere where
> > + * can be used external parameters. Session variables
> > + * are not allowed in DDL. Session's variables cannot be
> > + * used in constraints.
> > + *
> > + * The identifier can be parsed as an session variable
> > + * only in expression's kinds where session's variables
> > + * are allowed. This is the primary usage of this function.
> > + *
> > + * Second usage of this function is for decision if
> > + * an error message "column does not exist" or "column
> > + * or variable does not exist" should be printed. When
> > + * we are in expression, where session variables cannot
> > + * be used, we raise the first form or error message.
> >   */
>
> Maybe
>
> /*
>  * Returns true if the given expression kind is valid for session variables
>  * Session variables can be used everywhere where external parameters can
> be
>  * used.  Session variables are not allowed in DDL commands or in
> constraints.
>  *
>  * An identifier can be parsed as a session variable only for expression
> kinds
>  * where session variables are allowed. This is the primary usage of this
>  * function.
>  *
>  * Second usage of this function is to decide whether "column does not
> exist" or
>  * "column or variable does not exist" error message should be printed.
>  * When we are in an expression where session variables cannot be used, we
> raise
>  * the first form or error message.
>  */
>

changed


>
> > > session_variables_ambiguity_warning: There needs to be more
> > > information about this.  The current explanation is basically just,
> > > "warn if your query is confusing".  Why do I want that?  Why would I
> > > not want that?  What is the alternative?  What are some examples?
> > > Shouldn't there be a standard behavior without a need to configure
> > > anything?
> > >
> >
> > I enhanced this entry:
> >
> > +   
> > +The session variables can be shadowed by column references in a
> > query. This
> > +is an expected feature. The existing queries should not be
> broken
> > by creating
> > +any session variable, because session variables are shadowed
> > always if the
> > +identifier is ambiguous. The variables should be named without
> > possibility
> > +to collision with identifiers of other database objects (column
> > names or
> > +record field names). The warnings enabled by setting
> > session_variables_ambiguity_warning
> > +should help with finding identifier's collisions.
>
> Maybe
>
> Session variables can be shadowed by column references in a query, this is
> an
> expected behavior.  Previously working queries shouldn't error out by
> creating
> any session variable, so session variables are always shadowed if an
> identifier
> is ambiguous.  Variables should be referenced using an unambiguous
> identifier
> without any possibility for a collision with identifier of other database
> objects (column names or record fields names).  The warning messages
> emitted
> when enabling session_variables_ambiguity_warning can
> help
> finding such identifier collision.
>
> > +   
> > +   
> > +This feature can significantly increase size of logs, and then
> it
> > is
> > +disabled by default, but for testing or development
> environments it
> > +should be enabled.
>
> Maybe
>
> This feature can significantly increase log size, so it's disabled by
> default.
> For testing or development environments it's recommended to enable it if
> you
> use session variables.
>

replaced

Thank you very much for these language correctures

Regards

Pavel

p.s. I'll send updated patch after today or tomorrow - I have to fix broken
dependency check after rebase

Re: Why mark empty pages all visible?

Hi,

On 2023-03-27 21:51:09 -0700, Peter Geoghegan wrote:
> On Mon, Mar 27, 2023 at 9:32 PM Andres Freund  wrote:
> > > You haven't said anything about the leading cause of marking empty
> > > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
> > > marking empty pages all-frozen?
> >
> > It's not obvious that it should - but it's not as clear a case as the
> > ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the
> > latter, we know
> > a) that we don't have to do any work to be able to advance the horizon
> > b) we know that somebody else has the page pinned
> >
> > What's the point in marking it all-visible at that point? In quite likely be
> > from RelationGetBufferForTuple() having extended the relation and then 
> > briefly
> > needed to release the lock (to acquire the lock on otherBuffer or in
> > GetVisibilityMapPins()).
> 
> I think that there is significant value in avoiding special cases, on
> general principle. If we stopped doing this in
> lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup
> locks are supposed to protect against. Maybe something like that would
> make sense, but if so then make that argument, and make it explicitly
> represented in the code.

I will probably make that argument - so far I was just trying to understand
the intent of the current code. There aren't really comments explaining why we
want to mark currently-pinned empty/new pages all-visible and enter them into
the FSM.

Historically we did *not* enter currently pinned empty/new pages into the FSM
/ VM. Afaics that's new as of 44fa84881fff.

The reason I'm looking at this is that there's a lot of complexity at the
bottom of RelationGetBufferForTuple(), related to needing to release the lock
on the newly extended page and then needing to recheck whether there still is
free space on the page. And that it's not complicated enough
(c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).

As far as I can tell, if we went back to not entering new/empty pages into
either VM or FSM, we could rely on the page not getting filled while just
pinning, not locking it.

> > I don't follow. In the ConditionalLockBufferForCleanup() ->
> > lazy_scan_new_or_empty() case we are dealing with an new or empty
> > page. Whereas lazy_vacuum_heap_page() deals with a page that definitely
> > has dead tuples on it.  How are those two cases comparable?
> 
> It doesn't have dead tuples anymore, though.
> 
> ISTM that there is an issue here with the definition of an empty page.
> You're concerned about PageIsEmpty() pages.

And not just any PageIsEmpty() page, ones that are currently pinned.

I also do wonder whether the different behaviour of PageIsEmpty() and
PageIsNew() pages makes sense. The justification for not marking PageIsNew()
pages as all-visible holds just as true for empty pages. There's just as much
free space there.

Greetings,

Andres Freund

Re: [PATCH]Feature improvement for MERGE tab completion

Ah, another thread with a bouncing email address...
Please respond to to thread from this point to avoid bounces.



-- 
Gregory Stark
As Commitfest Manager

Re: [PATCH]Feature improvement for MERGE tab completion

It looks like this remaining work isn't going to happen this CF and
therefore this release. There hasn't been an update since January when
Dean Rasheed posted a review.

Unless there's any updates soon I'll move this on to the next
commitfest or mark it returned with feedback.

-- 
Gregory Stark
As Commitfest Manager

Re: running logical replication as the subscription owner

2023-03-28 Thread Jelte Fennema

On Tue, 28 Mar 2023 at 18:13, Robert Haas  wrote:
>
> On Mon, Mar 27, 2023 at 6:08 PM Jelte Fennema  wrote:
> > For high privileged subscription owners,
> > we downgrade to the permissions of the table owner, but for low
> > privileged ones we care about permissions of the subscription owner
> > itself.
>
> Hmm. This is an interesting idea. A variant on this theme could be:
> what if we made this an explicit configuration option?

Sounds perfect to me. Let's do that. As long as the old no-superuser
tests continue to pass when disabling the new switch-to-table-owner
behaviour, that sounds totally fine to me. I think it's probably
easiest if you use the tests from my v2 patch when you add that
option, since that was by far the thing I spent most time on to get
right in the v2 patch.

On Tue, 28 Mar 2023 at 18:13, Robert Haas  wrote:
>
> On Mon, Mar 27, 2023 at 6:08 PM Jelte Fennema  wrote:
> > Attached is an updated version of your patch with what I had in mind
> > (admittedly it needed one more line than "just" the return to make it
> > work). But as you can see all previous tests for a lowly privileged
> > subscription owner that **cannot** SET ROLE to the table owner
> > continue to work as they did before. While still downgrading to the
> > table owners role when the subscription owner **can** SET ROLE to the
> > table owner.
> >
> > Obviously this needs some comments explaining what's going on and
> > probably some code refactoring and/or variable renaming, but I hope
> > it's clear what I meant now: For high privileged subscription owners,
> > we downgrade to the permissions of the table owner, but for low
> > privileged ones we care about permissions of the subscription owner
> > itself.
>
> Hmm. This is an interesting idea. A variant on this theme could be:
> what if we made this an explicit configuration option?
>
> I'm worried that if we just try to switch users and silently fall back
> to not doing so when we don't have enough permissions, the resulting
> behavior is going to be difficult to understand and troubleshoot. I'm
> thinking that maybe if you let people pick the behavior they want that
> becomes more comprehensible. It's also a nice insurance policy: say
> for the sake of argument we make switch-to-table-owner the new
> default. If that new behavior causes something to happen to somebody
> that they don't like, they can always turn it off, even if they are a
> highly privileged user who doesn't "need" to turn it off.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com

Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions

On Tue, 3 Jan 2023 at 14:16, Tom Lane  wrote:
>
> Vik Fearing  writes:
>
> > I don't think we should add that syntax until I do get it through the
> > committee, just in case they change something.
>
> Agreed.  So this is something we won't be able to put into v16;
> it'll have to wait till there's something solid from the committee.

I guess I'll mark this Rejected in the CF then. Who knows when the SQL
committee will look at this...

-- 
Gregory Stark
As Commitfest Manager

Re: Request for comment on setting binary format output per session

FYI I attached this thread to
https://commitfest.postgresql.org/42/3777 which I believe is the same
issue. I mistakenly had this listed as a CF entry with no discussion
for a long time due to that missing link.


--
Gregory Stark
As Commitfest Manager

Re: Add LZ4 compression in pg_dump

2023-03-28 Thread Justin Pryzby

On Tue, Mar 28, 2023 at 06:40:03PM +0200, Tomas Vondra wrote:
> On 3/28/23 18:07, gkokola...@pm.me wrote:
> > --- Original Message ---
> > On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me 
> >  wrote:
> > 
> >> --- Original Message ---
> >> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra 
> >> tomas.von...@enterprisedb.com wrote:
> >>
> >>> This leaves the empty-data issue (which we have a fix for) and the
> >>> switch to LZ4F. And then the zstd part.
> >>
> >> Please expect promptly a patch for the switch to frames.
> > 
> > Please find the expected patch attached. Note that the bulk of the
> > patch is code unification, variable renaming to something more
> > appropriate, and comment addition. These are changes that are not
> > strictly necessary to switch to LZ4F. I do believe that are
> > essential for code hygiene after the switch and they do belong
> > on the same commit. 
> 
> Thanks!
> 
> I agree the renames & cleanup are appropriate - it'd be silly to stick
> to misleading naming etc. Would it make sense to split the patch into
> two, to separate the renames and the switch to lz4f?
> That'd make it the changes necessary for lz4f switch clearer.

I don't think so.  Did you mean separate commits only for review ?

The patch is pretty readable - the File API has just some renames, and
the compressor API is what's being replaced, which isn't going to be any
more clear.

@Georgeos: did you consider using a C union in LZ4State, to separate the
parts used by the different APIs ?

-- 
Justin

Re: zstd compression for pg_dump

On 3/28/23 20:03, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion
> mailto:jchamp...@timescale.com>> wrote:
> > I did some smoke testing against zstd's GitHub release on
> Windows. To
> ...
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?
> 
> Thomas since I appear to be one of the few windows users (I use both),
> can I help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a
> day on windows while developing.
> 

Perhaps. But I'll leave the details up to Justin - it's his patch, and
I'm not sure how to verify the threading is OK.

I'd try applying this patch, build with --with-zstd and then run the
pg_dump TAP tests, and perhaps do some manual tests.

And perhaps do the same for --with-lz4 - there's a thread [1] suggesting
we don't detect lz4 stuff on Windows, so the TAP tests do nothing.

https://www.postgresql.org/message-id/zajl96n9zw84u...@msg.df7cb.de

> Also, I have an AWS instance I created to build PG/Win with readline
> back in November.
> I could give you access to that...  (you are not the only person who has
> made this statement here).
> I've made such instances available for other Open Source developers, to
> support them.
> 
>  Obvi I would share connection credentials privately.
> 

I'd rather leave the Windows stuff up to someone with more experience
with that platform. I have plenty of other stuff on my plate atm.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Better infrastructure for automated testing of concurrency issues

On Wed, 31 Aug 2022 at 20:52, Alexander Korotkov  wrote:
>
> I'll continue work on this patch.  The rebased patch is attached.  It
> implements stop events as configure option (not runtime GUC option).

It looks like this patch isn't going to be ready this commitfest. And
it hasn't received much discussion since August 2022. If I'm wrong say
something but otherwise I'll mark it Returned With Feedback. It can be
resurrected (and moved to the next commitfest) when you're free to
work on it again.

-- 
Gregory Stark
As Commitfest Manager

Re: "variable not found in subplan target list"

Alvaro Herrera  writes:
> So I'm back home and found a couple more weird errors in the log:

> ERROR:  mismatching PartitionPruneInfo found at part_prune_index 0
> DETALLE:  plan node relids (b 1), pruneinfo relids (b 36)

This one reproduces for me.

> select  
>   pg_catalog.pg_stat_get_buf_fsync_backend() as c9 
> from 
>   public.tenk2 as ref_0
> where (ref_0.stringu2 is NULL) 
>   and (EXISTS (
> select  1 from fkpart5.fk1 as ref_1
>   where pg_catalog.current_date() < (select pg_catalog.max(filler3) 
> from public.mcv_lists))) ;

> ERROR:  subplan "InitPlan 1 (returns $1)" was not initialized
> CONTEXTO:  parallel worker

Hmph, I couldn't reproduce that, not even with other settings of
debug_parallel_query.  Are you running it with non-default
planner parameters?

> select 1 as c0
> ...
> ERROR:  could not find commutator for operator 53286

I got a slightly different error:

ERROR:  missing support function 1(195306,195306) in opfamily 1976

where

regression=# select 195306::regtype;   
  regtype   

 int8alias1
(1 row)

So that one is related to the intentionally-somewhat-broken
int8 opclass configuration that equivclass.sql leaves behind.
I've always had mixed emotions about whether leaving that
set up that way was a good idea or not.  In principle nothing
really bad should happen, but it can lead to confusing errors
like this one.  Maybe it'd be better to roll that back?

regards, tom lane

Re: logical decoding and replication of sequences, take 2

On 3/28/23 18:34, Masahiko Sawada wrote:
> On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
>  wrote:
>>
>>
>>
>> On 3/27/23 03:32, Masahiko Sawada wrote:
>>> Hi,
>>>
>>> On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
>>>  wrote:

 I merged the earlier "fixup" patches into the relevant parts, and left
 two patches with new tweaks (deducing the corrent "WAL" state from the
 current state read by copy_sequence), and the interlock discussed here.

>>>
>>> Apart from that, how does the publication having sequences work with
>>> subscribers who are not able to handle sequence changes, e.g. in a
>>> case where PostgreSQL version of publication is newer than the
>>> subscriber? As far as I tested the latest patches, the subscriber
>>> (v15)  errors out with the error 'invalid logical replication message
>>> type "Q"' when receiving a sequence change. I'm not sure it's sensible
>>> behavior. I think we should instead either (1) deny starting the
>>> replication if the subscriber isn't able to handle sequence changes
>>> and the publication includes that, or (2) not send sequence changes to
>>> such subscribers.
>>>
>>
>> I agree the "invalid message" error is not great, but it's not clear to
>> me how to do either (1). The trouble is we don't really know if the
>> publication contains (or will contain) sequences. I mean, what would
>> happen if the replication starts and then someone adds a sequence?
>>
>> For (2), I think that's not something we should do - silently discarding
>> some messages seems error-prone. If the publication includes sequences,
>> presumably the user wanted to replicate those. If they want to replicate
>> to an older subscriber, create a publication without sequences.
>>
>> Perhaps the right solution would be to check if the subscriber supports
>> replication of sequences in the output plugin, while attempting to write
>> the "Q" message. And error-out if the subscriber does not support it.
> 
> It might be related to this topic; do we need to bump the protocol
> version? The commit 64824323e57d introduced new streaming callbacks
> and bumped the protocol version. I think the same seems to be true for
> this change as it adds sequence_cb callback.
> 

It's not clear to me what should be the exact behavior?

I mean, imagine we're opening a connection for logical replication, and
the subscriber does not handle sequences. What should the publisher do?

(Note: The correct commit hash is 464824323e57d.)

I don't think the streaming is a good match for sequences, because of a
couple important differences ...

Firstly, streaming determines *how* the changes are replicated, not what
gets replicated. It doesn't (silently) filter out "bad" events that the
subscriber doesn't know how to apply. If the subscriber does not know
how to deal with streamed xacts, it'll still get the same changes
exactly per the publication definition.

Secondly, the default value is "streming=off", i.e. the subscriber has
to explicitly request streaming when opening the connection. And we
simply check it against the negotiated protocol version, i.e. the check
in pgoutput_startup() protects against subscriber requesting a protocol
v1 but also streaming=on.

I don't think we can/should do more check at this point - we don't know
what's included in the requested publications at that point, and I doubt
it's worth adding because we certainly can't predict if the publication
will be altered to include/decode sequences in the future.

Speaking of precedents, TRUNCATE is probably a better one, because it's
a new action and it determines *what* the subscriber can handle. But
that does exactly the thing we do for sequences - if you open a
connection from PG10 subscriber (truncate was added in PG11), and the
publisher decodes a truncate, subscriber will do:

2023-03-28 20:29:46.921 CEST [2357609] ERROR:  invalid logical
   replication message type "T"
2023-03-28 20:29:46.922 CEST [2356534] LOG:  worker process: logical
   replication worker for subscription 16390 (PID 2357609) exited with
   exit code 1

I don't see why sequences should do anything else. If you need to
replicate to such subscriber, create a publication that does not have
'sequence' in the publish option ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: "variable not found in subplan target list"

2023-03-28 Thread Alvaro Herrera

So I'm back home and found a couple more weird errors in the log:

MERGE INTO public.idxpart2 as target_0
USING (select  1 
from 
public.xmltest2 as ref_0
inner join public.prt1_l_p1 as sample_0
inner join fkpart4.droppk as ref_1
on (sample_0.a = ref_1.a )
on (true)
limit 50) as subq_0
left join information_schema.transforms as ref_2
left join public.transition_table_status as sample_1
on (ref_2.transform_type is not NULL)
on (true)
ON target_0.a = sample_1.level 
WHEN MATCHED
THEN UPDATE set a = target_0.a;
ERROR:  mismatching PartitionPruneInfo found at part_prune_index 0
DETALLE:  plan node relids (b 1), pruneinfo relids (b 36)

This one is probably my fault, will look later.


select  
  pg_catalog.pg_stat_get_buf_fsync_backend() as c9 
from 
  public.tenk2 as ref_0
where (ref_0.stringu2 is NULL) 
  and (EXISTS (
select  1 from fkpart5.fk1 as ref_1
  where pg_catalog.current_date() < (select pg_catalog.max(filler3) 
from public.mcv_lists))) ;

ERROR:  subplan "InitPlan 1 (returns $1)" was not initialized
CONTEXTO:  parallel worker


select 1 as c0
from 
  (select  
  subq_0.c9 as c5, 
  subq_0.c8 as c9
from 
  public.iso8859_5_inputs as ref_0,
  lateral (select  
ref_1.ident as c2, 
ref_0.description as c8, 
ref_1.used_bytes as c9
  from 
pg_catalog.pg_backend_memory_contexts as ref_1
  where true
  ) as subq_0
where subq_0.c2 is not NULL) as subq_1
inner join pg_catalog.pg_class as sample_0
on (subq_1.c5 = public.int8alias1in(
  cast(case when subq_1.c9 is not NULL then null end
 as cstring)))
where true;
ERROR:  could not find commutator for operator 53286


There were quite a few of those "variable not found" ones, both
mentioning singular "targetlist" and others that said "targetlists".  I
reran them with your patch and they no longer error out, so I guess it's
all the same bug.


-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I must say, I am absolutely impressed with what pgsql's implementation of
VALUES allows me to do. It's kind of ridiculous how much "work" goes away in
my code.  Too bad I can't do this at work (Oracle 8/9)."   (Tom Allison)
   http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php

Re: allow_in_place_tablespaces vs. pg_basebackup

2023-03-28 Thread Thomas Munro

On Tue, Mar 28, 2023 at 5:15 PM Thomas Munro  wrote:
> I guess we should add a test of -Fp too, to keep it working?

Oops, that was already tested in existing tests, so I take that back.

Re: Add pg_walinspect function with block info columns

2023-03-28 Thread Peter Geoghegan

On Mon, Mar 27, 2023 at 7:47 PM Bharath Rupireddy
 wrote:
> Hm, agreed. Changed in the attached v7-0002 patch. We can as well
> write a case statement in the create function SQL to output forkname
> instead forknumber, but I'd stop doing that to keep in sync with
> pg_buffercache.

I just don't see much value in any textual representation of fork
name, however generated. In practice it's just not adding very much
useful information. It is mostly useful as a way of filtering block
references, which makes simple integers more natural.

> Oh, thanks for finding it out. Fixed in the attached v7-0001 patch. I
> also removed the "invalid fork number" error as users can figure that
> out if at all the fork number is wrong.

Pushed just now.

> On the ordering of the columns, I kept start_lsn, end_lsn and prev_lsn
> first and then the rel** columns (this rel** columns order follows
> pg_buffercache) and then block data related columns. Michael and
> Kyotaro are of the opinion that it's better to keep LSNs first to be
> consistent and also given that this function is WAL related, it makes
> sense to have LSNs first.

Right, but I didn't change that part in the revision of the patch I
posted. Those columns still came first, and were totally consistent
with the pg_get_wal_record_info function.

I think that there was a "mid air collision" here, where we both
posted patches that we each called v7 within minutes of each other.
Just to be clear, I ended up with a column order as described here in
my revision:

https://postgr.es/m/CAH2-WzmzO-AU4QSbnzzANBkrpg=4cuod3scvtv+7x65e+qk...@mail.gmail.com

It now occurs to me that "fpi_data" should perhaps be called
"block_fpi_data".  What do you think?

-- 
Peter Geoghegan

Re: zstd compression for pg_dump

2023-03-28 Thread Kirk Wolak

On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra 
wrote:

> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <
> jchamp...@timescale.com> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> ...
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?
>
> Thomas since I appear to be one of the few windows users (I use both), can
I help?
I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a day
on windows while developing.

Also, I have an AWS instance I created to build PG/Win with readline back
in November.
I could give you access to that...  (you are not the only person who has
made this statement here).
I've made such instances available for other Open Source developers, to
support them.

 Obvi I would share connection credentials privately.

Regards, Kirk

Re: running logical replication as the subscription owner

2023-03-28 Thread Robert Haas

On Tue, Mar 28, 2023 at 1:38 AM Jeff Davis  wrote:
> >  If they do have those things then in
> > theory they might be able to protect themselves, but in practice they
> > are unlikely to be careful enough. I judge that practically every
> > installation where table owners use triggers would be easily
> > compromised. Only the most security-conscious of table owners are
> > going to get this right.
>
> This is the interesting part.
>
> Commit 11da97024ab set the subscription process's search path as empty.
> It seems it was done to protect the subscription owner against users
> writing into the public schema. But after your apply-as-table-owner
> patch, it seems to also offer some protection for table owners against
> a malicious subscription owner, too.
>
> But I must be missing something obvious here -- if the subscription
> process has an empty search_path, what can an attacker do to trick it?

Oh, interesting. I hadn't realized we were doing that, and I do think
that narrows the vulnerability quite a bit.

But I still feel pretty uncomfortable with the idea of requiring only
INSERT/UPDATE/DELETE permissions on the target table. Let's suppose
that you create a table and attach a trigger to it which is SECURITY
INVOKER. Absent logical replication, you don't really need to worry
about whether that function is written securely -- it will run with
privileges of the person performing the DML, and not impact your
account at all. They have reason to be scared of you, but not the
reverse. However, if the people doing DML on the table can arrange to
perform that DML as you, then any security vulnerabilities in your
function potentially allow them to compromise your account. Now, there
might not be any, but there also might be some, depending on exactly
what your function does. And if logical replication into a table
requires only I/U/D permission then it's basically a back-door way to
run functions that someone could otherwise execute only as themselves
as the table owner, and that's scary.

Now, I don't know how much to worry about that. As you just pointed
out to me in some out-of-band discussion, this is only going to affect
a table owner who writes a trigger, makes it insecure in some way
other than failure to sanitize the search_path, and sets it ENABLE
ALWAYS TRIGGER or ENABLE REPLICA TRIGGER. And maybe we could say that
if you do that last part, you kind of need to think about the
consequences for logical replication. If so, we could document that
problem away. It would also affect someone who uses a default
expression or other means of associating executable code with the
table, and something like a default expression doesn't require any
explicit setting to make it apply in case of replication, so maybe the
risk of someone messing up is a bit higher.

But this definitely makes it more of a judgment call than I thought
initially. Functions that are vulnerable to search_path exploits are a
dime a dozen; other kinds of exploits are going to be less common, and
more dependent on exactly what the function is doing.

I'm still inclined to leave the patch checking for SET ROLE, because
after all, we're thinking of switching user identities to the table
owner, and checking whether we can SET ROLE is the most natural way of
doing that, and definitely doesn't leave room to escalate to any
privilege you don't already have. However, there might be an argument
that we ought to do something else, like have a REPLICATE privilege on
the table that the owner can grant to users that they trust to
replicate into that table. Because time is short, I'd like to leave
that idea for a future release. What I would like to change in the
patch in this release is to add a new subscription property along the
lines of what I proposed to Jelte in my earlier email: let's provide
an escape hatch that turns off the user-switching behavior. If we do
that, then anyone who feels themselves worse off after this patch can
switch back to the old behavior. Most people will be better off, I
believe, and the opportunity to make things still better in the future
is not foreclosed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-superuser subscription owners

On Fri, 2023-03-24 at 00:17 -0700, Jeff Davis wrote:
> The other patch you posted seems like it makes a lot of progress in
> that direction, and I think that should go in first. That was one of
> the items I suggested previously[2], so thank you for working on
> that.

The above is not a hard objection.

I still hold the opinion that the non-superuser subscriptions work is
feels premature without the apply-as-table-owner work. It would be
great if the other patch ends up ready quickly, which would moot the
commit-ordering question.

Regards,
Jeff Davis

Re: Data is copied twice when specifying both child and parent table in publication

2023-03-28 Thread Jacob Champion

On Tue, Mar 28, 2023 at 2:59 AM wangw.f...@fujitsu.com
 wrote:
> The scenario of this bug is to subscribe to two publications at the same time,
> and these two publications publish parent table and child table respectively.
> And option via_root is specified in both publications or only in the 
> publication
> of the parent table.

Ah, reading the initial mail again, that makes sense. I came to this
thread with the alternative reproduction in mind (subscribing to one
publication with viaroot=true, and another publication with
viaroot=false) and misread the report accordingly... In the end, I'm
comfortable with the current level of coverage.

> > Would it
> > be enough to just replace that whole thing with gpt.attrs?
>
> Make sense.
> Changed as suggested.

LGTM, by inspection. Thanks!

--Jacob

Re: Add LZ4 compression in pg_dump




On 3/28/23 18:07, gkokola...@pm.me wrote:
> 
> 
> 
> 
> 
> --- Original Message ---
> On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me  
> wrote:
> 
>>
>> --- Original Message ---
>> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra 
>> tomas.von...@enterprisedb.com wrote:
>>
>>> This leaves the empty-data issue (which we have a fix for) and the
>>> switch to LZ4F. And then the zstd part.
>>
>> Please expect promptly a patch for the switch to frames.
> 
> Please find the expected patch attached. Note that the bulk of the
> patch is code unification, variable renaming to something more
> appropriate, and comment addition. These are changes that are not
> strictly necessary to switch to LZ4F. I do believe that are
> essential for code hygiene after the switch and they do belong
> on the same commit. 
> 

Thanks!

I agree the renames & cleanup are appropriate - it'd be silly to stick
to misleading naming etc. Would it make sense to split the patch into
two, to separate the renames and the switch to lz4f?

That'd make it the changes necessary for lz4f switch clearer.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: logical decoding and replication of sequences, take 2

2023-03-28 Thread Masahiko Sawada

On Mon, Mar 27, 2023 at 11:46 PM Tomas Vondra
 wrote:
>
>
>
> On 3/27/23 03:32, Masahiko Sawada wrote:
> > Hi,
> >
> > On Fri, Mar 24, 2023 at 7:26 AM Tomas Vondra
> >  wrote:
> >>
> >> I merged the earlier "fixup" patches into the relevant parts, and left
> >> two patches with new tweaks (deducing the corrent "WAL" state from the
> >> current state read by copy_sequence), and the interlock discussed here.
> >>
> >
> > Apart from that, how does the publication having sequences work with
> > subscribers who are not able to handle sequence changes, e.g. in a
> > case where PostgreSQL version of publication is newer than the
> > subscriber? As far as I tested the latest patches, the subscriber
> > (v15)  errors out with the error 'invalid logical replication message
> > type "Q"' when receiving a sequence change. I'm not sure it's sensible
> > behavior. I think we should instead either (1) deny starting the
> > replication if the subscriber isn't able to handle sequence changes
> > and the publication includes that, or (2) not send sequence changes to
> > such subscribers.
> >
>
> I agree the "invalid message" error is not great, but it's not clear to
> me how to do either (1). The trouble is we don't really know if the
> publication contains (or will contain) sequences. I mean, what would
> happen if the replication starts and then someone adds a sequence?
>
> For (2), I think that's not something we should do - silently discarding
> some messages seems error-prone. If the publication includes sequences,
> presumably the user wanted to replicate those. If they want to replicate
> to an older subscriber, create a publication without sequences.
>
> Perhaps the right solution would be to check if the subscriber supports
> replication of sequences in the output plugin, while attempting to write
> the "Q" message. And error-out if the subscriber does not support it.

It might be related to this topic; do we need to bump the protocol
version? The commit 64824323e57d introduced new streaming callbacks
and bumped the protocol version. I think the same seems to be true for
this change as it adds sequence_cb callback.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: zstd compression for pg_dump




On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
 On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
 wrote:
> I did some smoke testing against zstd's GitHub release on Windows. To
> build against it, I had to construct an import library, and put that
> and the DLL into the `lib` folder expected by the MSVC scripts...
> which makes me wonder if I've chosen a harder way than necessary?

 It looks like pg_dump's meson.build is missing dependencies on zstd
 (meson couldn't find the headers in the subproject without them).
>>>
>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>> I didn't run into an issue without it.  Could you check that what I've
>>> added works for your case ?
>>>
> Parallel zstd dumps seem to work as expected, in that the resulting
> pg_restore output is identical to uncompressed dumps and nothing
> explodes. I haven't inspected the threading implementation for safety
> yet, as you mentioned.

 Hm. Best I can tell, the CloneArchive() machinery is supposed to be
 handling safety for this, by isolating each thread's state. I don't feel
 comfortable pronouncing this new addition safe or not, because I'm not
 sure I understand what the comments in the format-specific _Clone()
 callbacks are saying yet.
>>>
>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>> there's no solid answer, I could disable/ignore the option (maybe only
>>> under windows).
>>
>> I may be missing something, but why would the patch affect this? Why
>> would it even affect safety of the parallel dump? And I don't see any
>> changes to the clone stuff ...
> 
> zstd supports using threads during compression, with -Z zstd:workers=N.
> When unix forks, the child processes can't do anything to mess up the
> state of the parent processes.  
> 
> But windows pg_dump uses threads instead of forking, so it seems
> possible that the pg_dump -j threads that then spawn zstd threads could
> "leak threads" and break the main thread.  I suspect there's no issue,
> but we still ought to verify that before declaring it safe.
> 

OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

>>> The function is first checking if it was passed a filename which already
>>> has a suffix.  And if not, it searches through a list of suffixes,
>>> testing for an existing file with each suffix.  The search with stat()
>>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
>>> hasSuffix() branch isn't dead code.  Another opened question.
>>
>> AFAICS it's done this way because of this comment in pg_backup_directory
>>
>>  * ...
>>  * ".gz" suffix is added to the filenames. The TOC files are never
>>  * compressed by pg_dump, however they are accepted with the .gz suffix
>>  * too, in case the user has manually compressed them with 'gzip'.
>>
>> I haven't tried, but I believe that if you manually compress the
>> directory, it may hit this branch.
> 
> That would make sense, but when I tried, it didn't work like that.
> The filenames were all uncompressed names.  Maybe it worked differently
> in an older release.  Or maybe it changed during development of the
> parallel-directory-dump patch and it's actually dead code.
> 

Interesting. Would be good to find out. I wonder if a little bit of
git-log digging could tell us more. Anyway, until we confirm it's dead
code, we should probably do what .gz does and have the same check for
.lz4 and .zst files.

> This is rebased over the updated compression API.
> 
> It seems like I misunderstood something you said before, so now I put
> back "supports_compression()".
> 

Thanks! I need to do a bit more testing and review, but it seems pretty
much RFC to me, assuming we can figure out what to do about threading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: running logical replication as the subscription owner

2023-03-28 Thread Robert Haas

On Mon, Mar 27, 2023 at 6:08 PM Jelte Fennema  wrote:
> Attached is an updated version of your patch with what I had in mind
> (admittedly it needed one more line than "just" the return to make it
> work). But as you can see all previous tests for a lowly privileged
> subscription owner that **cannot** SET ROLE to the table owner
> continue to work as they did before. While still downgrading to the
> table owners role when the subscription owner **can** SET ROLE to the
> table owner.
>
> Obviously this needs some comments explaining what's going on and
> probably some code refactoring and/or variable renaming, but I hope
> it's clear what I meant now: For high privileged subscription owners,
> we downgrade to the permissions of the table owner, but for low
> privileged ones we care about permissions of the subscription owner
> itself.

Hmm. This is an interesting idea. A variant on this theme could be:
what if we made this an explicit configuration option?

I'm worried that if we just try to switch users and silently fall back
to not doing so when we don't have enough permissions, the resulting
behavior is going to be difficult to understand and troubleshoot. I'm
thinking that maybe if you let people pick the behavior they want that
becomes more comprehensible. It's also a nice insurance policy: say
for the sake of argument we make switch-to-table-owner the new
default. If that new behavior causes something to happen to somebody
that they don't like, they can always turn it off, even if they are a
highly privileged user who doesn't "need" to turn it off.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Commitfest 2023-03 starting tomorrow!

Status summary:
  Needs review: 116.
  Waiting on Author: 30.
  Ready for Committer: 32.
  Committed: 94.
  Moved to next CF: 17.
  Returned with Feedback: 6.
  Rejected: 6.
  Withdrawn: 18.
Total: 319.



Ok, here are the patches that have been stuck in "Waiting
on Author" for a while. I divided them into three groups.

* The first group have been stuck for over a week and mostly look like
  they should be RwF. Some I guess just moved to next CF. But some of
  them I'm uncertain if I should leave or if they really should be RfC
  or NR.

* The other two groups have had some updates in the last week
  (actually I used 10 days). But some of them still look like they're
  pretty much dead for this CF and should either be moved forward or
  Rwf or Rejected.

So here's the triage list. I'm going to send emails and start clearing
out the patches pretty much right away. Some of these are pretty
clearcut.


Nothing in over a week:
--

* Better infrastructure for automated testing of concurrency issues

- Consensus that this is desirable. But it's not clear what it's
  actually waiting on Author for. RwF?

* Provide the facility to set binary format output for specific OID's
per session

- I think Dave was looking for feedback and got it from Tom and
  Peter. I don't actually see a specific patch here but there are two
  patches linked in the original message. There seems to be enough
  feedback to proceed but nobody's working on it. RwF?

* pg_visibility's pg_check_visible() yields false positive when
working in parallel with autovacuum

- Bug, but tentatively a false positive...

* CAST( ... ON DEFAULT)

- it'll have to wait till there's something solid from the committee"
-- Rejected?

* Fix tab completion MERGE

- Partly committed but
  v9-0002-psql-Add-PartialMatches-macro-for-better-tab-complet.patch
  remains. There was a review from Dean Rasheed. Move to next CF?

* Fix recovery conflict SIGUSR1 handling

- This looks like a suite of bug fixes and looks like it should be
  Needs Review or Ready for Commit

* Prefetch the next tuple's memory during seqscans

- David Rowley said it was dependeny on "heapgettup() refactoring"
  which has been refactored. So is it now Needs Review or Ready for
  Commit? Is it likely to happen this CF?

* Pluggable toaster

- This seems to have digressed from the original patch. There were
  patches early on and a lot of feedback. Is the result that the
  original patches are Rejected or are they still live?

* psql - refactor echo code

- "I think this patch requires an up-to-date summary and explanation"
  from Peter. But it seems like Tom was ok with it and just had some
  additional improvements he wanted that were added. It sounds like
  this might be "Ready for Commit" if someone familiar with the patch
  looked at it.

* Push aggregation down to base relations and joins

- Needs a significant rebase (since March 1).

* Remove self join on a unique column

- An offer of a Bounty! There was one failing test which was
  apparently fixed? But it looks like this should be in Needs Review
  or Ready for Commit.

* Split index and table statistics into different types of stats

- Was stuck on "Generate pg_stat_get_xact*() functions with Macros"
  which was committed. So "Ready for Commit" now?

* Default to ICU during initdb

- Partly committed, 0001 waiting until after CF

* suppressing useless wakeups in logical/worker.c

- Got feedback March 17. Doesn't look like it's going to be ready this CF.

* explain analyze rows=%.0f

- Patch updated January, but I think Tom still had some simple if
  tedious changes he asked for

* Fix order of checking ICU options in initdb and create database

- Feedback Last November but no further questions or progress

* Introduce array_shuffle() and array_sample() functions

- Feedback from Tom last September. No further questions or progress



Status Updates in last week:


* Some revises in adding sorting path

- Got feedback Feb 21 and author responded but doesn't look like it's
  going to be ready this CF

* Add TAP tests for psql \g piped into program

- Peter Eisentraut asked for a one-line change, otherwise it looks
  like it's Ready for Commit?

* Improvements to Meson docs

- Some feedback March 15 but no response. I assume this is still in
  play



Emails in last week:
---

* RADIUS tests and improvements

- last feedback March 20, last patch March 4. Should probably be moved
  to the next CF unless there's progress soon.

* Direct SSL Connections

- (This is mine) Code for SSL is pretty finished. The last patch for
  ALPN support needs a bit of polish. I'll be doing that momentarily.

* Fix alter subscription concurrency errors

- "patch as-submitted is pretty uninteresting" and "patch that I don't
  care much about" ... I guess this is Rejected or Withdrawn

* Fix improper qual pushdown after applying outer join identity 3

- Tom Lane's patch. Active discussion as of

Re: Add LZ4 compression in pg_dump

2023-03-28 Thread gkokolatos






--- Original Message ---
On Friday, March 24th, 2023 at 10:30 AM, gkokola...@pm.me  
wrote:

> 
> --- Original Message ---
> On Thursday, March 23rd, 2023 at 6:10 PM, Tomas Vondra 
> tomas.von...@enterprisedb.com wrote:
> 
> > This leaves the empty-data issue (which we have a fix for) and the
> > switch to LZ4F. And then the zstd part.
> 
> Please expect promptly a patch for the switch to frames.

Please find the expected patch attached. Note that the bulk of the
patch is code unification, variable renaming to something more
appropriate, and comment addition. These are changes that are not
strictly necessary to switch to LZ4F. I do believe that are
essential for code hygiene after the switch and they do belong
on the same commit. 

Cheers,
//Georgios

> 
> Cheers,
> //GeorgiosFrom c289fb8d49b680ad180a44b20fff1dc9553b7494 Mon Sep 17 00:00:00 2001
From: Georgios Kokolatos 
Date: Tue, 28 Mar 2023 15:48:06 +
Subject: [PATCH v1] Use LZ4 frames in pg_dump's compressor API.

This change allows for greater compaction of data, especially so in very narrow
relations, by avoiding at least a compaction header and footer per row. Since
LZ4 frames are now used by both compression APIs, some code deduplication
opportunities have become obvious and are also implemented.

Reported by: Justin Pryzby
---
 src/bin/pg_dump/compress_lz4.c | 358 ++---
 1 file changed, 244 insertions(+), 114 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index fc2f4e116d..078dc35cd6 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -17,7 +17,6 @@
 #include "compress_lz4.h"
 
 #ifdef USE_LZ4
-#include 
 #include 
 
 /*
@@ -29,102 +28,279 @@
 #endif
 
 /*--
- * Compressor API
- *--
+ * Common to both APIs
  */
 
-typedef struct LZ4CompressorState
+/*
+ * State used for LZ4 (de)compression by both APIs.
+ */
+typedef struct LZ4State
 {
-	char	   *outbuf;
-	size_t		outsize;
-} LZ4CompressorState;
+	/*
+	 * Used by the File API to keep track of the file stream.
+	 */
+	FILE	   *fp;
+
+	LZ4F_preferences_t prefs;
+
+	LZ4F_compressionContext_t	ctx;
+	LZ4F_decompressionContext_t	dtx;
+
+	/*
+	 * Used by the File API's lazy initialization.
+	 */
+	bool		inited;
+
+	/*
+	 * Used by the File API to distinguish between compression
+	 * and decompression operations.
+	 */
+	bool		compressing;
+
+	/*
+	 * Used by the Compressor API to mark if the compression
+	 * headers have been written after initialization.
+	 */
+	bool		needs_header_flush;
+
+	size_t		buflen;
+	char	   *buffer;
+
+	/*
+	 * Used by the File API to store already uncompressed
+	 * data that the caller has not consumed.
+	 */
+	size_t		overflowalloclen;
+	size_t		overflowlen;
+	char	   *overflowbuf;
+
+	/*
+	 * Used by both APIs to keep track of the compressed
+	 * data length stored in the buffer.
+	 */
+	size_t		compressedlen;
+
+	/*
+	 * Used by both APIs to keep track of error codes.
+	 */
+	size_t		errcode;
+} LZ4State;
+
+/*
+ * Initialize the required LZ4State members for compression. Write the LZ4 frame
+ * header in a buffer keeping track of its length. Users of this function can
+ * choose when and how to write the header to a file stream.
+ *
+ * Returns true on success. In case of a failure returns false, and stores the
+ * error code in state->errcode.
+ */
+static bool
+LZ4_compression_state_init(LZ4State *state)
+{
+	size_t		status;
+
+	state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, >prefs);
+
+	/*
+	 * LZ4F_compressBegin requires a buffer that is greater or equal to
+	 * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
+	 */
+	if (state->buflen < LZ4F_HEADER_SIZE_MAX)
+		state->buflen = LZ4F_HEADER_SIZE_MAX;
+
+	status = LZ4F_createCompressionContext(>ctx, LZ4F_VERSION);
+	if (LZ4F_isError(status))
+	{
+		state->errcode = status;
+		return false;
+	}
+
+	state->buffer = pg_malloc(state->buflen);
+	status = LZ4F_compressBegin(state->ctx,
+state->buffer, state->buflen,
+			   >prefs);
+	if (LZ4F_isError(status))
+	{
+		state->errcode = status;
+		return false;
+	}
+
+	state->compressedlen = status;
+
+	return true;
+}
+
+/*--
+ * Compressor API
+ *--
+ */
 
 /* Private routines that support LZ4 compressed data I/O */
-static void ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs);
-static void WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
-  const void *data, size_t dLen);
-static void EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs);
 
 static void
 ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
-	LZ4_streamDecode_t lz4StreamDecode;
-	char	   *buf;
-	char	   *decbuf;
-	size_t		buflen;
-	size_t		cnt;
-
-	buflen = DEFAULT_IO_BUFFER_SIZE;
-	buf = pg_malloc(buflen);
-	decbuf = pg_malloc(buflen);
+	size_t		r;
+	size_t		readbuflen;
+	char	   *outbuf;
+	char	   *readbuf;
+

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-03-28 Thread Jelte Fennema

On Tue, 28 Mar 2023 at 16:54, Denis Laxalde  wrote:
> I had a look into your patchset (v16), did a quick review and played a
> bit with the feature.
>
> Patch 2 is missing the documentation about PQcancelSocket() and contains
> a few typos; please find attached a (fixup) patch to correct these.

Thanks applied that patch and attached a new patchset

> Namely, I wonder why it returns a PGcancelConn and what's the
> point of requiring the user to call PQcancelStatus() to see if something
> got wrong. Maybe it could be defined as:
>
>   int PQcancelSend(PGcancelConn *cancelConn);
>
> where the return value would be status? And the user would only need to
> call PQcancelErrorMessage() in case of error. This would leave only one
> single way to create a PGcancelConn value (i.e. PQcancelConn()), which
> seems less confusing to me.

To clarify what you mean, the API would then be like this:
PGcancelConn cancelConn = PQcancelConn(conn);
if (PQcancelSend(cancelConn) == CONNECTION_BAD) {
   printf("ERROR %s\n", PQcancelErrorMessage(cancelConn))
   exit(1)
}

Instead of:
PGcancelConn cancelConn = PQcancelSend(conn);
if (PQcancelStatus(cancelConn) == CONNECTION_BAD) {
   printf("ERROR %s\n", PQcancelErrorMessage(cancelConn))
   exit(1)
}

Those are so similar, that I have no preference either way. If more
people prefer one over the other I'm happy to change it, but for now
I'll keep it as is.

> As part of my testing, I've implemented non-blocking cancellation in
> Psycopg, based on v16 on this patchset. Overall this worked fine and
> seems useful; if you want to try it:
>
>   https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel

That's great to hear! I'll try to take a closer look at that change tomorrow.

> (The only thing I found slightly inconvenient is the need to convey the
> connection encoding (from PGconn) when handling error message from the
> PGcancelConn.)

Could you expand a bit more on this? And if you have any idea on how
to improve the API with regards to this?


v17-0002-Copy-and-store-addrinfo-in-libpq-owned-private-m.patch
Description: Binary data


v17-0005-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data


v17-0003-Return-2-from-pqReadData-on-EOF.patch
Description: Binary data


v17-0004-Add-non-blocking-version-of-PQcancel.patch
Description: Binary data


v17-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch
Description: Binary data

Re: Memory leak from ExecutorState context?




On 3/28/23 15:17, Jehan-Guillaume de Rorthais wrote:
> On Tue, 28 Mar 2023 00:43:34 +0200
> Tomas Vondra  wrote:
> 
>> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
>>> Please, find in attachment a patch to allocate bufFiles in a dedicated
>>> context. I picked up your patch, backpatch'd it, went through it and did
>>> some minor changes to it. I have some comment/questions thought.
>>>
>>>   1. I'm not sure why we must allocate the "HashBatchFiles" new context
>>>   under ExecutorState and not under hashtable->hashCxt?
>>>
>>>   The only references I could find was in hashjoin.h:30:
>>>
>>>/* [...]
>>> * [...] (Exception: data associated with the temp files lives in the
>>> * per-query context too, since we always call buffile.c in that
>>> context.)
>>>
>>>   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
>>>   original comment in the patch):
>>>
>>>/* [...]
>>> * Note: it is important always to call this in the regular executor
>>> * context, not in a shorter-lived context; else the temp file buffers
>>> * will get messed up.
>>>
>>>
>>>   But these are not explanation of why BufFile related allocations must be
>>> under a per-query context. 
>>>   
>>
>> Doesn't that simply describe the current (unpatched) behavior where
>> BufFile is allocated in the per-query context? 
> 
> I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
> rule about «**always** call buffile.c in per-query context». But maybe it 
> ought
> to be «always call buffile.c from one of the sub-query context»? I assume the
> aim is to enforce the tmp files removal on query end/error?
> 

I don't think we need this info for tempfile cleanup - CleanupTempFiles
relies on the VfdCache, which does malloc/realloc (so it's not using
memory contexts at all).

I'm not very familiar with this part of the code, so I might be missing
something. But you can try that - just stick an elog(ERROR) somewhere
into the hashjoin code, and see if that breaks the cleanup.

Not an explicit proof, but if there was some hard-wired requirement in
which memory context to allocate BufFile stuff, I'd expect it to be
documented in buffile.c. But that actually says this:

 * Note that BufFile structs are allocated with palloc(), and therefore
 * will go away automatically at query/transaction end.  Since the
underlying
 * virtual Files are made with OpenTemporaryFile, all resources for
 * the file are certain to be cleaned up even if processing is aborted
 * by ereport(ERROR).  The data structures required are made in the
 * palloc context that was current when the BufFile was created, and
 * any external resources such as temp files are owned by the ResourceOwner
 * that was current at that time.

which I take as confirmation that it's legal to allocate BufFile in any
memory context, and that cleanup is handled by the cache in fd.c.


>> I mean, the current code calls BufFileCreateTemp() without switching the
>> context, so it's in the ExecutorState. But with the patch it very clearly is
>> not.
>>
>> And I'm pretty sure the patch should do
>>
>> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>>  "HashBatchFiles",
>>  ALLOCSET_DEFAULT_SIZES);
>>
>> and it'd still work. Or why do you think we *must* allocate it under
>> ExecutorState?
> 
> That was actually my very first patch and it indeed worked. But I was confused
> about the previous quoted code comments. That's why I kept your original code
> and decided to rise the discussion here.
> 

IIRC I was just lazy when writing the experimental patch, there was not
much thought about stuff like this.

> Fixed in new patch in attachment.
> 
>> FWIW The comment in hashjoin.h needs updating to reflect the change.
> 
> Done in the last patch. Is my rewording accurate?
> 
>>>   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
>>> switch seems fragile as it could be forgotten in futur code path/changes.
>>> So I added an Assert() in the function to make sure the current memory
>>> context is "HashBatchFiles" as expected.
>>>   Another way to tie this up might be to pass the memory context as
>>> argument to the function.
>>>   ... Or maybe I'm over precautionary.
>>>   
>>
>> I'm not sure I'd call that fragile, we have plenty other code that
>> expects the memory context to be set correctly. Not sure about the
>> assert, but we don't have similar asserts anywhere else.
> 
> I mostly sticked it there to stimulate the discussion around this as I needed
> to scratch that itch.
> 
>> But I think it's just ugly and overly verbose
> 
> +1
> 
> Your patch was just a demo/debug patch by the time. It needed some cleanup now
> :)
> 
>> it'd be much nicer to e.g. pass the memory context as a parameter, and do
>> the switch inside.
> 
> That was a proposition in my previous mail, so I did it in the new patch. 
> Let's
> see what other

Re: Initial Schema Sync for Logical Replication

2023-03-28 Thread Masahiko Sawada

On Tue, Mar 28, 2023 at 6:47 PM Amit Kapila  wrote:
>
> On Mon, Mar 27, 2023 at 8:17 AM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin  wrote:
> > >
> > > > From: Amit Kapila 
> > > > > I think we won't be able to use same snapshot because the transaction 
> > > > > will
> > > > > be committed.
> > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure
> > > > > about this part maybe walrcv_disconnect() calls the commits 
> > > > > internally ?).
> > > > > So somehow we need to keep this snapshot alive, even after transaction
> > > > > is committed(or delay committing the transaction , but we can have
> > > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart
> > > > > before tableSync is able to use the same snapshot.)
> > > > >
> > > >
> > > > Can we think of getting the table data as well along with schema via
> > > > pg_dump? Won't then both schema and initial data will correspond to the
> > > > same snapshot?
> > >
> > > Right , that will work, Thanks!
> >
> > While it works, we cannot get the initial data in parallel, no?
> >
>
> Another possibility is that we dump/restore the schema of each table
> along with its data. One thing we can explore is whether the parallel
> option of dump can be useful here. Do you have any other ideas?

A downside of the idea of dumping both table schema and table data
would be that we need to temporarily store data twice the size of the
table (the dump file and the table itself) during the load. One might
think that we can redirect the pg_dump output into the backend so that
it can load it via SPI, but it doesn't work since "COPY tbl FROM
stdin;" doesn't work via SPI. The --inserts option of pg_dump could
help it out but it makes restoration very slow.

>
> One related idea is that currently, we fetch the table list
> corresponding to publications in subscription and create the entries
> for those in pg_subscription_rel during Create Subscription, can we
> think of postponing that work till after the initial schema sync? We
> seem to be already storing publications list in pg_subscription, so it
> appears possible if we somehow remember the value of copy_data. If
> this is feasible then I think that may give us the flexibility to
> perform the initial sync at a later point by the background worker.

It sounds possible. With this idea, we will be able to have the apply
worker restore the table schemas (and create pg_subscription_rel
entries) as the first thing. Another point we might need to consider
is that the initial schema sync (i.e. creating tables) and creating
pg_subscription_rel entries need to be done in the same transaction.
Otherwise, we could end up committing either one change. I think it
depends on how we restore the schema data.

>
> > >
> > > > > I think we can have same issues as you mentioned New table t1 is added
> > > > > to the publication , User does a refresh publication.
> > > > > pg_dump / pg_restore restores the table definition. But before
> > > > > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > > > > 5. Insert t1 (3, 3, 3); --LSN 130
> > > > > And table sync errors out
> > > > > There can be one more issue , since we took the pg_dump without
> > > > snapshot (wrt to replication slot).
> > > > >
> > > >
> > > > To avoid both the problems mentioned for Refresh Publication, we can do
> > > > one of the following: (a) create a new slot along with a snapshot for 
> > > > this
> > > > operation and drop it afterward; or (b) using the existing slot, 
> > > > establish a
> > > > new snapshot using a technique proposed in email [1].
> > > >
> > >
> > > Thanks, I think option (b) will be perfect, since we don’t have to create 
> > > a new slot.
> >
> > Regarding (b), does it mean that apply worker stops streaming,
> > requests to create a snapshot, and then resumes the streaming?
> >
>
> Shouldn't this be done by the backend performing a REFRESH publication?

Hmm, I might be missing something but the idea (b) uses the existing
slot to establish a new snapshot, right? What existing replication
slot do we use for that? I thought it was the one used by the apply
worker.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Jehan-Guillaume de Rorthais

Hi,

Sorry for the late answer, I was reviewing the first patch and it took me some
time to study and dig around.

On Thu, 23 Mar 2023 08:07:04 -0400
Melanie Plageman  wrote:

> On Fri, Mar 10, 2023 at 1:51 PM Jehan-Guillaume de Rorthais
>  wrote:
> > > So I guess the best thing would be to go through these threads, see what
> > > the status is, restart the discussion and propose what to do. If you do
> > > that, I'm happy to rebase the patches, and maybe see if I could improve
> > > them in some way.  
> >
> > OK! It took me some time, but I did it. I'll try to sum up the situation as
> > simply as possible.  
> 
> Wow, so many memories!
> 
> I'm excited that someone looked at this old work (though it is sad that
> a customer faced this issue). And, Jehan, I really appreciate your great
> summarization of all these threads. This will be a useful reference.

Thank you!

> > 1. "move BufFile stuff into separate context"
> > [...]
> >I suppose this simple one has been forgotten in the fog of all other
> >discussions. Also, this probably worth to be backpatched.  
> 
> I agree with Jehan-Guillaume and Tomas that this seems fine to commit
> alone.

This is a WIP.

> > 2. "account for size of BatchFile structure in hashJoin"
> > [...] 
> 
> I think I would have to see a modern version of a patch which does this
> to assess if it makes sense. But, I probably still agree with 2019
> Melanie :)

I volunteer to work on this after the memory context patch, unless someone grab
it in the meantime.

> [...]
> On Mon, Mar 20, 2023 at 10:12 AM Jehan-Guillaume de Rorthais
>  wrote:
> > BNJL and/or other considerations are for 17 or even after. In the meantime,
> > Melanie, who authored BNLJ, +1 the balancing patch as it can coexists with
> > other discussed solutions. No one down vote since then. Melanie, what is
> > your opinion today on this patch? Did you change your mind as you worked
> > for many months on BNLJ since then?  
> 
> So, in order to avoid deadlock, my design of adaptive hash join/block
> nested loop hash join required a new parallelism concept not yet in
> Postgres at the time -- the idea of a lone worker remaining around to do
> work when others have left.
> 
> See: BarrierArriveAndDetachExceptLast()
> introduced in 7888b09994
> 
> Thomas Munro had suggested we needed to battle test this concept in a
> more straightforward feature first, so I implemented parallel full outer
> hash join and parallel right outer hash join with it.
> 
> https://commitfest.postgresql.org/42/2903/
> 
> This has been stalled ready-for-committer for two years. It happened to
> change timing such that it made an existing rarely hit parallel hash
> join bug more likely to be hit. Thomas recently committed our fix for
> this in 8d578b9b2e37a4d (last week). It is my great hope that parallel
> full outer hash join goes in before the 16 feature freeze.

This is really interesting to follow. I kinda feel/remember how this could
be useful for your BNLJ patch. It's good to see things are moving, step by
step.

Thanks for the pointers.

> If it does, I think it could make sense to try and find committable
> smaller pieces of the adaptive hash join work. As it is today, parallel
> hash join does not respect work_mem, and, in some sense, is a bit broken.
> 
> I would be happy to work on this feature again, or, if you were
> interested in picking it up, to provide review and any help I can if for
> you to work on it.

I don't think I would be able to pick up such a large and complex patch. But I'm
interested to help, test and review, as far as I can!

Regards,

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2023-03-28 Thread Denis Laxalde

Hi Jelte,

I had a look into your patchset (v16), did a quick review and played a
bit with the feature.

Patch 2 is missing the documentation about PQcancelSocket() and contains
a few typos; please find attached a (fixup) patch to correct these.


--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -321,16 +328,28 @@ extern PostgresPollingStatusType PQresetPoll(PGconn 
*conn);
 /* Synchronous (blocking) */
 extern void PQreset(PGconn *conn);
 
+/* issue a cancel request */
+extern PGcancelConn * PQcancelSend(PGconn *conn);
[...]

Maybe I'm missing something, but this function above seems a bit
strange. Namely, I wonder why it returns a PGcancelConn and what's the
point of requiring the user to call PQcancelStatus() to see if something
got wrong. Maybe it could be defined as:

  int PQcancelSend(PGcancelConn *cancelConn);

where the return value would be status? And the user would only need to
call PQcancelErrorMessage() in case of error. This would leave only one
single way to create a PGcancelConn value (i.e. PQcancelConn()), which
seems less confusing to me.

Jelte Fennema wrote:
> Especially since I ran into another use case that I would want to use
> this patch for recently: Adding an async cancel function to Python
> it's psycopg3 library. This library exposes both a Connection class
> and an AsyncConnection class (using python its asyncio feature). But
> one downside of the AsyncConnection type is that it doesn't have a
> cancel method.

As part of my testing, I've implemented non-blocking cancellation in
Psycopg, based on v16 on this patchset. Overall this worked fine and
seems useful; if you want to try it:

  https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel

(The only thing I found slightly inconvenient is the need to convey the
connection encoding (from PGconn) when handling error message from the
PGcancelConn.)

Cheers,
Denis
>From a5f9cc680ffa520b05fe34b7cac5df2e60a6d4ad Mon Sep 17 00:00:00 2001
From: Denis Laxalde 
Date: Tue, 28 Mar 2023 16:06:42 +0200
Subject: [PATCH] fixup! Add non-blocking version of PQcancel

---
 doc/src/sgml/libpq.sgml  | 16 +++-
 src/interfaces/libpq/fe-connect.c|  2 +-
 src/test/modules/libpq_pipeline/libpq_pipeline.c |  2 +-
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 29f08a4317..aa404c4d15 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -5795,7 +5795,7 @@ PGcancelConn *PQcancelSend(PGconn *conn);
options as the the original PGconn. So when the
original connection is encrypted (using TLS or GSS), the connection for
the cancel request is encrypted in the same way. Any connection options
-   that only are only used during authentication or after authentication of
+   that are only used during authentication or after authentication of
the client are ignored though, because cancellation requests do not
require authentication and the connection is closed right after the
cancellation request is submitted.
@@ -5912,6 +5912,20 @@ ConnStatusType PQcancelStatus(const PGcancelConn *conn);
  
 
 
+
+ PQcancelSocketPQcancelSocket
+
+ 
+  
+   A version of  that can be used for
+   cancellation connections.
+
+int PQcancelSocket(PGcancelConn *conn);
+
+  
+ 
+
+
 
  PQcancelPollPQcancelPoll
 
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 74e337fddf..16af7303d4 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -808,7 +808,7 @@ PQcancelConn(PGconn *conn)
 	/*
 	 * Cancel requests should not iterate over all possible hosts. The request
 	 * needs to be sent to the exact host and address that the original
-	 * connection used. So we we manually create the host and address arrays
+	 * connection used. So we manually create the host and address arrays
 	 * with a single element after freeing the host array that we generated
 	 * from the connection options.
 	 */
diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c
index e8e904892c..6764ab513b 100644
--- a/src/test/modules/libpq_pipeline/libpq_pipeline.c
+++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c
@@ -89,7 +89,7 @@ pg_fatal_impl(int line, const char *fmt,...)
 /*
  * Check that the query on the given connection got cancelled.
  *
- * This is a function wrapped in a macrco to make the reported line number
+ * This is a function wrapped in a macro to make the reported line number
  * in an error match the line number of the invocation.
  */
 #define confirm_query_cancelled(conn) confirm_query_cancelled_impl(__LINE__, conn)
-- 
2.30.2

Re: Support logical replication of DDLs

2023-03-28 Thread Jonathan S. Katz

On 3/27/23 2:37 AM, Amit Kapila wrote:

On Mon, Mar 27, 2023 at 2:52 AM Tom Lane wrote:

And TBH, I don't think that I quite believe the premise in the
first place. The whole point of using logical rather than physical
replication is that the subscriber installation(s) aren't exactly like
the publisher. Given that, how can we expect that automated DDL
replication is going to do the right thing often enough to be a useful
tool rather than a disastrous foot-gun?

One of the major use cases as mentioned in the initial email was for
online version upgrades. And also, people would be happy to
automatically sync the schema for cases where the logical replication
is set up to get a subset of the data via features like row filters.
Having said that, I agree with you that it is very important to define
the scope of this feature if we want to see it becoming reality.

To echo Amit, this is actually one area where PostgreSQL replication
lags behind (no pun intended) other mature RDBMSes. As Amit says, the
principal use case is around major version upgrades, but also migration
between systems or moving data/schemas between systems that speak the
PostgreSQL protocol. All of these are becoming more increasingly common
as PostgreSQL is taking on more workloads that are sensitive to downtime
or are distributed in nature.

There are definitely footguns with logical replication of DDL -- I've
seen this from reading other manuals that support this feature and in my
own experiments. However, like many features, users have strategies thy
use to avoid footgun scenarios. For example, in systems that use logical
replication as part of their HA, users will either:

* Not replicate DDL, but use some sort of rolling orchestration process
to place it on each instance

* Replicate DDL, but coordinate it with some kind of global lock
* Replica only a subset of DDL, possibly with lock coordination

I'll comment on the patch scope further downthread. I agree it's very
big -- I had given some of that feedback privately a few month back --
and it could benefit from the "step back, holistic review." For example,
I was surprised that a fairly common pattern[1] did not work due to
changes we made when addressing a CVE (some follow up work was proposed
but we haven't done it yet).

I do agree this patch would benefit from stepping back, and I do think
we can work many of the issues. From listening to users and prospective
users, it's pretty clear we need to support DDL replication in some
capacity.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/263bea1c-a897-417d-3765-ba6e1e24711e%40postgresql.org

OpenPGP_signature
Description: OpenPGP digital signature

Re: "variable not found in subplan target list"

I wrote:
> The planner is reducing the scan of target_parted to
> a dummy scan, as is reasonable, but it forgets to
> provide ctid as an output from that scan; then the
> parent join node is unhappy because it does have
> a ctid output.  So it looks like the problem is some
> shortcut we take while creating the dummy scan.

Oh, actually the problem is in distribute_row_identity_vars,
which is supposed to handle this case, but it thinks it doesn't
have to back-fill the rel's reltarget.  Wrong.  Now that I see
the problem, I wonder if we can't reproduce a similar symptom
without MERGE, which would mean that v14 has the issue too.

The attached seems to fix it, but I'm going to look for a
non-MERGE test case before pushing.

regards, tom lane

diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 9d377385f1..c1b1557570 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -21,6 +21,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/appendinfo.h"
 #include "optimizer/pathnode.h"
+#include "optimizer/planmain.h"
 #include "parser/parsetree.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
@@ -994,9 +995,10 @@ distribute_row_identity_vars(PlannerInfo *root)
 	 * certainly process no rows.  Handle this edge case by re-opening the top
 	 * result relation and adding the row identity columns it would have used,
 	 * as preprocess_targetlist() would have done if it weren't marked "inh".
-	 * (This is a bit ugly, but it seems better to confine the ugliness and
-	 * extra cycles to this unusual corner case.)  We needn't worry about
-	 * fixing the rel's reltarget, as that won't affect the finished plan.
+	 * Then re-run build_base_rel_tlists() to ensure that the added columns
+	 * get propagated to the relation's reltarget.  (This is a bit ugly, but
+	 * it seems better to confine the ugliness and extra cycles to this
+	 * unusual corner case.)
 	 */
 	if (root->row_identity_vars == NIL)
 	{
@@ -1006,6 +1008,8 @@ distribute_row_identity_vars(PlannerInfo *root)
 		add_row_identity_columns(root, result_relation,
  target_rte, target_relation);
 		table_close(target_relation, NoLock);
+		build_base_rel_tlists(root, root->processed_tlist);
+		/* There are no ROWID_VAR Vars in this case, so we're done. */
 		return;
 	}

Re: Missing update of all_hasnulls in BRIN opclasses

Hi,

It took me a while but I finally got back to reworking this to use the
bt_info bit, as proposed by Alvaro. And it turned out to work great,
because (a) it's a tuple-level flag, i.e. the right place, and (b) it
does not overload existing flags.

This greatly simplified the code in add_values_to_range and (especially)
union_tuples, making it much easier to understand, I think.

One disadvantage is we are unable to see which ranges are empty in
current pageinspect, but 0002 addresses that by adding "empty" column to
the brin_page_items() output. That's a matter for master only, though.
It's a trivial patch and it makes it easier/possible to test this, so we
should consider to squeeze it into PG16.

I did quite a bit of testing - the attached 0003 adds extra tests, but I
don't propose to get this committed as is - it's rather overkill. Maybe
some reduced version of it ...

The hardest thing to test is the union_tuples() part, as it requires
concurrent operations with "correct" timing. Easy to simulate by
breakpoints in GDB, not so much in plain regression/TAP tests.

There's also a stress tests, doing a lot of randomized summarizations,
etc. Without the fix this failed in maybe 30% of runs, now I did ~100
runs without a single failure.

I haven't done any backporting, but I think it should be simpler than
with the earlier approach. I wonder if we need to care about starting to
use the previously unused bit - I don't think so, in the worst case
we'll just ignore it, but maybe I'm missing something (e.g. when using
physical replication).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 10efaf9964806e5a30818994e3cfda879bb90171 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 8 Jan 2023 16:43:06 +0100
Subject: [PATCH 1/3] Fix handling of NULLs in BRIN indexes

BRIN indexes did not properly distinguish between summaries for empty
(no rows) and all-NULL ranges. All summaries were initialized with
allnulls=true, and the opclasses simply reset allnulls to false when
processing the first non-NULL value. This however fails if the range
starts with a NULL value (or a sequence of NULL values), in which case
we forget the range contains NULL values.

This happens because the allnulls flag is used for two separate
purposes - to mark empty ranges (not representing any rows yet) and
ranges containing only NULL values.

Opclasses don't know which of these cases it is, and so don't know
whether to set hasnulls=true. Setting hasnulls=true in both cases would
make it correct, but it would also make BRIN indexes useless for queries
with IS NULL clauses - all ranges start empty (and thus allnulls=true),
so all ranges would end up with either allnulls=true or hasnulls=true.

The severity of the issue is somewhat reduced by the fact that it only
happens when adding values to an existing summary with allnulls=true,
not when the summarization is processing values in bulk (e.g. during
CREATE INDEX or automatic summarization). In this case the flags were
updated in a slightly different way, not forgetting the NULL values.

This introduces a new a new flag marking index tuples representing
ranges with no rows. Luckily we have an unused tuple in the BRIN tuple
header that we can use for this.

We still store index tuples for empty ranges, because otherwise we'd not
be able to say whether a range is empty or not yet summarized, and we'd
have to process them for any query.

Backpatch to 11. The issue exists since BRIN indexes were introduced in
9.5, but older releases are already EOL.

Backpatch-through: 11
Reviewed-by: Alvaro Herrera, Justin Pryzby, Matthias van de Meent
Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80aff...@enterprisedb.com
---
 src/backend/access/brin/brin.c| 115 +-
 src/backend/access/brin/brin_tuple.c  |  15 ++-
 src/include/access/brin_tuple.h   |   6 +-
 ...summarization-and-inprogress-insertion.out |   8 +-
 ...ummarization-and-inprogress-insertion.spec |   1 +
 5 files changed, 137 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 53e4721a54e..162a0c052aa 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -592,6 +592,17 @@ bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 
 	bval = >bt_columns[attno - 1];
 
+	/*
+	 * If the BRIN tuple indicates that this range is empty,
+	 * we can skip it: there's nothing to match.  We don't
+	 * need to examine the next columns.
+	 */
+	if (dtup->bt_empty_range)
+	{
+		addrange = false;
+		break;
+	}
+
 	/*
 	 * First check if there are any IS [NOT] NULL scan keys,
 	 * and if we're violating them. In that case we can
@@ -1590,6 +1601,8 @@ form_and_insert_tuple(BrinBuildState *state)
 /*
  * Given two deformed tuples, adjust the first one so that it's consistent
  * with the summary values in

Re: Kerberos delegation support in libpq and postgres_fdw

2023-03-28 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Greg Stark (st...@mit.edu) wrote:
> > The CFBot says there's a function be_gssapi_get_proxy() which is
> > undefined. Presumably this is a missing #ifdef or a definition that
> > should be outside an #ifdef.
> 
> Yup, just a couple of missing #ifdef's.
> 
> Updated and rebased patch attached.

... and a few more.  Apparently hacking on a plane without enough sleep
leads to changing ... and unchanging configure flags before testing.

Thanks,

Stephen
From 9808fd4eb4920e40468709af7325a27b18066cec Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 28 Feb 2022 20:17:55 -0500
Subject: [PATCH] Add support for Kerberos credential delegation

Support GSSAPI/Kerberos credentials being delegated to the server by a
client.  With this, a user authenticating to PostgreSQL using Kerberos
(GSSAPI) credentials can choose to delegate their credentials to the
PostgreSQL server (which can choose to accept them, or not), allowing
the server to then use those delegated credentials to connect to
another service, such as with postgres_fdw or dblink or theoretically
any other service which is able to be authenticated using Kerberos.

Both postgres_fdw and dblink are changed to allow non-superuser
password-less connections but only when GSSAPI credentials have been
proxied to the server by the client and GSSAPI is used to authenticate
to the remote system.

Authors: Stephen Frost, Peifeng Qiu
Discussion: https://postgr.es/m/co1pr05mb8023cc2cb575e0faad7df4f8a8...@co1pr05mb8023.namprd05.prod.outlook.com
---
 contrib/dblink/dblink.c   | 127 ---
 contrib/dblink/expected/dblink.out|   4 +-
 contrib/postgres_fdw/connection.c |  72 +++-
 .../postgres_fdw/expected/postgres_fdw.out|  19 +-
 contrib/postgres_fdw/option.c |   3 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |   3 +-
 doc/src/sgml/config.sgml  |  17 +
 doc/src/sgml/dblink.sgml  |   5 +-
 doc/src/sgml/libpq.sgml   |  41 +++
 doc/src/sgml/monitoring.sgml  |   9 +
 doc/src/sgml/postgres-fdw.sgml|   5 +-
 src/backend/catalog/system_views.sql  |   3 +-
 src/backend/foreign/foreign.c |   1 +
 src/backend/libpq/auth.c  |  13 +-
 src/backend/libpq/be-gssapi-common.c  |  51 +++
 src/backend/libpq/be-secure-gssapi.c  |  26 +-
 src/backend/utils/activity/backend_status.c   |   1 +
 src/backend/utils/adt/pgstatfuncs.c   |  20 +-
 src/backend/utils/init/postinit.c |   8 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/libpq/auth.h  |   1 +
 src/include/libpq/be-gssapi-common.h  |   3 +
 src/include/libpq/libpq-be.h  |   2 +
 src/include/utils/backend_status.h|   1 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-auth.c|  15 +-
 src/interfaces/libpq/fe-connect.c |  17 +
 src/interfaces/libpq/fe-secure-gssapi.c   |  23 +-
 src/interfaces/libpq/libpq-fe.h   |   1 +
 src/interfaces/libpq/libpq-int.h  |   2 +
 src/test/kerberos/Makefile|   3 +
 src/test/kerberos/t/001_auth.pl   | 335 --
 src/test/perl/PostgreSQL/Test/Utils.pm|  27 ++
 src/test/regress/expected/rules.out   |  11 +-
 36 files changed, 752 insertions(+), 135 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 78a8bcee6e..d4dd338055 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -48,6 +48,7 @@
 #include "funcapi.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
+#include "libpq/libpq-be.h"
 #include "libpq/libpq-be-fe-helpers.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -111,7 +112,8 @@ static HeapTuple get_tuple_of_interest(Relation rel, int *pkattnums, int pknumat
 static Relation get_rel_from_relname(text *relname_text, LOCKMODE lockmode, AclMode aclmode);
 static char *generate_relation_name(Relation rel);
 static void dblink_connstr_check(const char *connstr);
-static void dblink_security_check(PGconn *conn, remoteConn *rconn);
+static bool dblink_connstr_has_pw(const char *connstr);
+static void dblink_security_check(PGconn *conn, remoteConn *rconn, const char *connstr);
 static void dblink_res_error(PGconn *conn, const char *conname, PGresult *res,
 			 bool fail, const char *fmt,...) pg_attribute_printf(5, 6);
 static char *get_connect_string(const char *servername);
@@ -213,7 +215,7 @@ dblink_get_conn(char *conname_or_str,
 	 errmsg("could not establish connection"),
 	 errdetail_internal("%s", msg)));
 		}
-		dblink_security_check(conn, rconn);
+		dblink_security_check(conn, rconn, connstr);
 		if

Re: Support logical replication of global object commands

2023-03-28 Thread Zheng Li

> I think that there are some (possibly) tricky challenges that haven't
> been discussed yet to support replicating global objects.
>
> First, as for publications having global objects (roles, databases,
> and tablespaces), but storing them in database specific tables like
> pg_publication doesn't make sense, because it should be at some shared
> place where all databases can have access to it. Maybe we need to have
> a shared catalog like pg_shpublication or pg_publication_role to store
> publications related to global objects or the relationship between
> such publications and global objects. Second, we might need to change
> the logical decoding infrastructure so that it's aware of shared
> catalog changes.

Thanks for the feedback. This is insightful.

> Currently we need to scan only db-specific catalogs.
> Finally, since we process CREATE DATABASE in a different way than
> other DDLs (by cloning another database such as template1), simply
> replicating the CREATE DATABASE statement would not produce the same
> results as the publisher. Also, since event triggers are not fired on
> DDLs for global objects, always WAL-logging such DDL statements like
> the proposed patch does is not a good idea.

> Given that there seems to be some tricky problems and there is a
> discussion for cutting the scope to make the initial patch small[1], I
> think it's better to do this work after the first version.

Agreed.

Regards,
Zane

Re: Request for comment on setting binary format output per session

2023-03-28 Thread Dave Cramer

On Sun, 26 Mar 2023 at 21:30, Tom Lane  wrote:

> Dave Cramer  writes:
> > On Sun, 26 Mar 2023 at 18:12, Tom Lane  wrote:
> >> I would not expect DISCARD ALL to reset a session-level property.
>
> > Well if we can't reset it with DISCARD ALL how would that work with
> > pgbouncer, or any pool for that matter since it doesn't know which client
> > asked for which (if any) OID's to be binary.
>
> Well, it'd need to know that, just like it already needs to know
> which clients asked for which database or which login role.
>

OK, IIUC what you are proposing here is that there would be a separate pool
for
database, user, and OIDs. This doesn't seem too flexible. For instance if I
create a UDT and then want it to be returned
as binary then I have to reconfigure the pool to be able to accept a new
list of OID's.

Am I mis-understanding how this would potentially work?

Dave

>
>

Re: Move definition of standard collations from initdb to pg_collation.dat

2023-03-28 Thread Peter Eisentraut


On 28.03.23 13:33, Tom Lane wrote:

While we're here, do we want to adopt some other spelling of "the
root locale" than "und", in view of recent discoveries about the
instability of that on old ICU versions?


That issue was fixed by 3b50275b12, so we can keep using the "und" spelling.

Re: TAP output format in pg_regress

2023-03-28 Thread Daniel Gustafsson

> On 28 Mar 2023, at 15:26, Daniel Gustafsson  wrote:

> I think the attached is a good candidate for going in, but I would like to 
> see it
> for another spin in the CF bot first.

Another candidate due to a thinko which raised a compiler warning.

--
Daniel Gustafsson



v19-0001-pg_regress-Emit-TAP-compliant-output.patch
Description: Binary data

Re: Infinite Interval

On Mon, Mar 20, 2023 at 3:16 AM Joseph Koshakow  wrote:
>
>
>
> On Sun, Mar 19, 2023 at 5:13 PM Tom Lane  wrote:
> >
> >Did you actually write "if TIMESTAMP_IS_NOBEGIN(dt2)" and not
> >"if (TIMESTAMP_IS_NOBEGIN(dt2))"?  If the former, I'm not surprised
> >that pgindent gets confused.  The parentheses are required by the
> >C standard.  Your code might accidentally work because the macro
> >has parentheses internally, but call sites have no business
> >knowing that.  For example, it would be completely legit to change
> >TIMESTAMP_IS_NOBEGIN to be a plain function, and then this would be
> >syntactically incorrect.
>
> Oh duh. I've been doing too much Rust development and did this without
> thinking. I've attached a patch with a fix.
>

Thanks for fixing this.

On this latest patch, I have one code comment

@@ -3047,7 +3180,30 @@ timestamptz_pl_interval_internal(TimestampTz timestamp,
TimestampTz result;
int tz;

-   if (TIMESTAMP_NOT_FINITE(timestamp))
+   /*
+* Adding two infinites with the same sign results in an infinite
+* timestamp with the same sign. Adding two infintes with different signs
+* results in an error.
+*/
+   if (INTERVAL_IS_NOBEGIN(span))
+   {
+   if TIMESTAMP_IS_NOEND(timestamp)
+   ereport(ERROR,
+   (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+errmsg("interval out of range")));
+   else
+   TIMESTAMP_NOBEGIN(result);
+   }
+   else if (INTERVAL_IS_NOEND(span))
+   {
+   if TIMESTAMP_IS_NOBEGIN(timestamp)
+   ereport(ERROR,
+   (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
+errmsg("interval out of range")));
+   else
+   TIMESTAMP_NOEND(result);
+   }
+   else if (TIMESTAMP_NOT_FINITE(timestamp))

This code is duplicated in timestamp_pl_interval(). We could create a function
to encode the infinity handling rules and then call it in these two places. The
argument types are different, Timestamp and TimestampTz viz. which map to in64,
so shouldn't be a problem. But it will be slightly unreadable. Or use macros
but then it will be difficult to debug.

What do you think?

Next I will review the test changes and also make sure that every
operator that interval as one of its operands or the result has been
covered in the code. This is the following list

#select oprname, oprcode from pg_operator where oprleft =
'interval'::regtype or oprright = 'interval'::regtype or oprresult =
'interval'::regtype;
 oprname | oprcode
-+-
 +   | date_pl_interval
 -   | date_mi_interval
 +   | timestamptz_pl_interval
 -   | timestamptz_mi
 -   | timestamptz_mi_interval
 =   | interval_eq
 <>  | interval_ne
 <   | interval_lt
 <=  | interval_le
 >   | interval_gt
 >=  | interval_ge
 -   | interval_um
 +   | interval_pl
 -   | interval_mi
 -   | time_mi_time
 *   | interval_mul
 *   | mul_d_interval
 /   | interval_div
 +   | time_pl_interval
 -   | time_mi_interval
 +   | timetz_pl_interval
 -   | timetz_mi_interval
 +   | interval_pl_time
 +   | timestamp_pl_interval
 -   | timestamp_mi
 -   | timestamp_mi_interval
 +   | interval_pl_date
 +   | interval_pl_timetz
 +   | interval_pl_timestamp
 +   | interval_pl_timestamptz
(30 rows)

-- 
Best Wishes,
Ashutosh Bapat

Re: Infinite Interval

On Sun, Mar 26, 2023 at 1:28 AM Tom Lane  wrote:
>
> I think you can take it as read that simple C test programs on modern
> platforms will exhibit IEEE-compliant handling of float infinities.
>

For the record, I tried the attached. It gives a warning at compilation time.

$gcc float_inf.c
float_inf.c: In function ‘main’:
float_inf.c:10:17: warning: division by zero [-Wdiv-by-zero]
   10 |  float inf = 1.0/0;
  | ^
float_inf.c:11:20: warning: division by zero [-Wdiv-by-zero]
   11 |  float n_inf = -1.0/0;
  |^
$ ./a.out
inf = inf
-inf = -inf
inf + inf = inf
inf + -inf = -nan
-inf + inf = -nan
-inf + -inf = -inf
inf - inf = -nan
inf - -inf = inf
-inf - inf = -inf
-inf - -inf = -nan
float 0.0 equals 0.0
float 1.0 equals 1.0
 5.0 * inf = inf
 5.0 * - inf = -inf
 5.0 / inf = 0.00
 5.0 / - inf = -0.00
 inf / 5.0 = inf
 - inf / 5.0 = -inf

The changes in the patch are compliant with the observations above.

-- 
Best Wishes,
Ashutosh Bapat
#include 


void
test_flag(int flag, char *flag_name);

int
main(void)
{
	float inf = 1.0/0;
	float n_inf = -1.0/0;

	printf("inf = %f\n", inf);
	printf("-inf = %f\n", n_inf);

	/* Additions */
	printf("inf + inf = %f\n", inf + inf);
	printf("inf + -inf = %f\n", inf + n_inf);
	printf("-inf + inf = %f\n", n_inf + inf);
	printf("-inf + -inf = %f\n", n_inf + n_inf);

	/* Subtractions */
	printf("inf - inf = %f\n", inf - inf);
	printf("inf - -inf = %f\n", inf - n_inf);
	printf("-inf - inf = %f\n", n_inf - inf);
	printf("-inf - -inf = %f\n", n_inf - n_inf);

	if (0.0 == 0.0)
		printf("float 0.0 equals 0.0\n");

	if (0.5 + 0.5 == .64 + .36)
		printf("float 1.0 equals 1.0\n");

	/* Multiplication */
	printf(" 5.0 * inf = %f\n", 5.0 * inf);
	printf(" 5.0 * - inf = %f\n", 5.0 * n_inf);

	/* Division */
	printf(" 5.0 / inf = %f\n", 5.0 / inf);
	printf(" 5.0 / - inf = %f\n", 5.0 / n_inf);
	printf(" inf / 5.0 = %f\n", inf / 5.0);
	printf(" - inf / 5.0 = %f\n", n_inf / 5.0);
}

Re: SQL/JSON revisited

2023-03-28 Thread Erik Rijkers


Op 3/27/23 om 20:54 schreef Alvaro Herrera:

Docs amended as I threatened.  Other than that, this has required more


> [v12-0001-SQL-JSON-constructors.patch]
> [v12-0001-delta-uniqueifyJsonbObject-bugfix.patch]

In doc/src/sgml/func.sgml, some minor stuff:

'which specify the data type returned'  should be
'which specifies the data type returned'

In the json_arrayagg() description, it says:
'If ABSENT ON NULL is specified, any NULL values are omitted.'
That's true, but as omitting NULL values is the default (i.e., also 
without that clause) maybe it's better to say:

'Any NULL values are omitted unless NULL ON NULL is specified'


I've found no bugs in functionality.

Thanks,

Erik Rijkers

Re: Infinite Interval

On Sun, Mar 19, 2023 at 12:18 AM Joseph Koshakow  wrote:
>
> The problem is that `secs = rint(secs)` rounds the seconds too early
> and loses any fractional seconds. Do we have an overflow detecting
> multiplication function for floats?

We have float8_mul() which checks for overflow. typedef double float8;

>
> >+if (INTERVAL_NOT_FINITE(result))
> >+ereport(ERROR,
> >+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
> >+ errmsg("interval out of range")));
> >
> >Probably, I added these kind of checks. But I don't remember if those are
> >defensive checks or whether it's really possible that the arithmetic 
> > above
> >these lines can yield an non-finite interval.
>
> These checks appear in `make_interval`, `justify_X`,
> `interval_um_internal`, `interval_pl`, `interval_mi`, `interval_mul`,
> `interval_div`. For all of these it's possible that the interval
> overflows/underflows the non-finite ranges, but does not
> overflow/underflow the data type. For example
> `SELECT INTERVAL '2147483646 months' + INTERVAL '1 month'` would error
> on this check.

Without this patch
postgres@64807=#SELECT INTERVAL '2147483646 months' + INTERVAL '1 month';
?column?

 178956970 years 7 mons
(1 row)

That result looks correct

postgres@64807=#select 178956970 * 12 + 7;
  ?column?

 2147483647
(1 row)

So some backward compatibility break. I don't think we can avoid the
backward compatibility break without expanding interval structure and
thus causing on-disk breakage. But we can reduce the chances of
breaking, if we change INTERVAL_NOT_FINITE to check all the three
fields, instead of just month.

>
>
> >+else
> >+{
> >+result->time = -interval->time;
> >+result->day = -interval->day;
> >+result->month = -interval->month;
> >+
> >+if (INTERVAL_NOT_FINITE(result))
> >+ereport(ERROR,
> >+(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
> >+ errmsg("interval out of range")));
> >
> >If this error ever gets to the user, it could be confusing. Can we 
> > elaborate by
> >adding context e.g. errcontext("while negating an interval") or some 
> > such?
>
> Done.

Thanks. Can we add relevant contexts at similar other places?

Also if we use all the three fields, we will need to add such checks
in interval_justify_hours()

>
> I replaced these checks with the following:
>
> + else if (interval->time == PG_INT64_MIN || interval->day == PG_INT32_MIN || 
> interval->month == PG_INT32_MIN)
> + ereport(ERROR,
> + (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
> + errmsg("interval out of range")));
>
> I think this covers the same overflow check but is maybe a bit more
> obvious. Unless, there's something I'm missing?

Thanks. Your current version is closer to int4um().

Some more review comments in the following email.

--
Best Wishes,
Ashutosh Bapat

Re: Add standard collation UNICODE

On Tue, 2023-03-28 at 08:46 -0400, Joe Conway wrote:
> > As long as we still have to initialize the libc locale fields to
> > some
> > language, I think it would be less confusing to keep the ICU locale
> > on
> > the same language.
> 
> I definitely agree with that.

Sounds good -- no changes then.

Regards,
Jeff Davis

>

Re: TAP output format in pg_regress

2023-03-28 Thread Daniel Gustafsson

> On 15 Mar 2023, at 11:36, Peter Eisentraut 
>  wrote:
> 
> On 24.02.23 10:49, Daniel Gustafsson wrote:
>> Another rebase on top of 337903a16f.  Unless there are conflicting reviews, I
>> consider this patch to be done and ready for going in during the next CF.
> 
> I think this is just about as good as it's going to get, so I think we can 
> consider committing this soon.
> 
> A few comments along the way:
> 
> 1) We can remove the gettext markers _() inside calls like note(), bail(), 
> etc.  If we really wanted to do translation, we would do that inside those 
> functions (similar to errmsg() etc.).

Fixed.

The attached also removes all explicit \n from output and leaves the decision
on when to add a linebreak to the TAP emitting function.  I think this better
match how we typically handle printing of output like this.  It also ensures
that all bail messages follow the same syntax.

> 2) There are a few fprintf(stderr, ...) calls left.  Should those be changed 
> to something TAP-enabled?

Initially the patch kept errors happening before testing started a non-TAP
output, there were leftovers which are now converted.

> 3) Maybe these lines
> 
> +++ isolation check in src/test/isolation +++
> 
> should be changed to TAP format?  Arguably, if I run "make -s check", then 
> everything printed should be valid TAP, right?

Fixed.

> 4) From the introduction lines
> 
> == creating temporary instance==
> == initializing database system   ==
> == starting postmaster==
> running on port 61696 with PID 85346
> == creating database "regression" ==
> == running regression test queries==
> 
> you have kept
> 
> # running on port 61696 with PID 85346
> 
> which, well, is that the most interesting of those?
> 
> The first three lines (creating, initializing, starting) take some noticeable 
> amount of time, so they could be kept as a progress indicator.  Or just 
> delete all of them?  I suppose some explicit indication about temp-instance 
> versus existing instance would be useful.

This was discussed in 20220221164736.rq3ornzjdkmwk...@alap3.anarazel.de which
concluded that this was about the only thing of interest, and even at that it
was more of a maybe than a definite yes. 

As this patch stands, it prints the above for a temp install and the host/port
for an existing install, but I don't have ny problems removing that as well.

> 5) About the output format.  Obviously, this will require some retraining of 
> the eye.  But my first impression is that there is a lot of whitespace 
> without any guiding lines, so to speak, horizontally or vertically.  It makes 
> me a bit dizzy. ;-)
> 
> I think instead
> 
> # parallel group (2 tests):  event_trigger oidjoins
> ok 210  event_trigger  131 ms
> ok 211  oidjoins   190 ms
> ok 212   fast_default  158 ms
> ok 213   tablespace319 ms
> 
> Something like this would be less dizzy-making:
> 
> # parallel group (2 tests):  event_trigger oidjoins
> ok 210 - + event_trigger   131 ms
> ok 211 - + oidjoins190 ms
> ok 212 - fast_default  158 ms
> ok 213 - tablespace319 ms

The current format is chosen to be close to the old format, while also adding
sufficient padding that it won't yield ragged columns.  The wide padding is
needed to cope with long names in the isolation and ecgp test suites and not so
much regress suite.

In the attached I've dialled back the padding a little bit to make it a bit
more compact, but I doubt it's enough.

> Just an idea, we don't have to get this exactly perfect right now, but it's 
> something to think about.

I'm quite convinced that this will be revisited once this lands and is in front
of developers.

I think the attached is a good candidate for going in, but I would like to see 
it
for another spin in the CF bot first.

--
Daniel Gustafsson

v18-0001-pg_regress-Emit-TAP-compliant-output.patch
Description: Binary data

Re: Memory leak from ExecutorState context?

2023-03-28 Thread Jehan-Guillaume de Rorthais

On Tue, 28 Mar 2023 00:43:34 +0200
Tomas Vondra  wrote:

> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
> > Please, find in attachment a patch to allocate bufFiles in a dedicated
> > context. I picked up your patch, backpatch'd it, went through it and did
> > some minor changes to it. I have some comment/questions thought.
> > 
> >   1. I'm not sure why we must allocate the "HashBatchFiles" new context
> >   under ExecutorState and not under hashtable->hashCxt?
> > 
> >   The only references I could find was in hashjoin.h:30:
> > 
> >/* [...]
> > * [...] (Exception: data associated with the temp files lives in the
> > * per-query context too, since we always call buffile.c in that
> > context.)
> > 
> >   And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
> >   original comment in the patch):
> > 
> >/* [...]
> > * Note: it is important always to call this in the regular executor
> > * context, not in a shorter-lived context; else the temp file buffers
> > * will get messed up.
> > 
> > 
> >   But these are not explanation of why BufFile related allocations must be
> > under a per-query context. 
> >   
> 
> Doesn't that simply describe the current (unpatched) behavior where
> BufFile is allocated in the per-query context? 

I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
rule about «**always** call buffile.c in per-query context». But maybe it ought
to be «always call buffile.c from one of the sub-query context»? I assume the
aim is to enforce the tmp files removal on query end/error?

> I mean, the current code calls BufFileCreateTemp() without switching the
> context, so it's in the ExecutorState. But with the patch it very clearly is
> not.
> 
> And I'm pretty sure the patch should do
> 
> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>  "HashBatchFiles",
>  ALLOCSET_DEFAULT_SIZES);
> 
> and it'd still work. Or why do you think we *must* allocate it under
> ExecutorState?

That was actually my very first patch and it indeed worked. But I was confused
about the previous quoted code comments. That's why I kept your original code
and decided to rise the discussion here.

Fixed in new patch in attachment.

> FWIW The comment in hashjoin.h needs updating to reflect the change.

Done in the last patch. Is my rewording accurate?

> >   2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
> > switch seems fragile as it could be forgotten in futur code path/changes.
> > So I added an Assert() in the function to make sure the current memory
> > context is "HashBatchFiles" as expected.
> >   Another way to tie this up might be to pass the memory context as
> > argument to the function.
> >   ... Or maybe I'm over precautionary.
> >   
> 
> I'm not sure I'd call that fragile, we have plenty other code that
> expects the memory context to be set correctly. Not sure about the
> assert, but we don't have similar asserts anywhere else.

I mostly sticked it there to stimulate the discussion around this as I needed
to scratch that itch.

> But I think it's just ugly and overly verbose

+1

Your patch was just a demo/debug patch by the time. It needed some cleanup now
:)

> it'd be much nicer to e.g. pass the memory context as a parameter, and do
> the switch inside.

That was a proposition in my previous mail, so I did it in the new patch. Let's
see what other reviewers think.

> >   3. You wrote:
> >   
> >>> A separate BufFile memory context helps, although people won't see it
> >>> unless they attach a debugger, I think. Better than nothing, but I was
> >>> wondering if we could maybe print some warnings when the number of batch
> >>> files gets too high ...  
> > 
> >   So I added a WARNING when batches memory are exhausting the memory size
> >   allowed.
> > 
> >+   if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
> >+   elog(WARNING, "Growing number of hash batch is exhausting
> > memory");
> > 
> >   This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
> >   overflows the memory budget. I realize now I should probably add the
> > memory limit, the number of current batch and their memory consumption.
> >   The message is probably too cryptic for a user. It could probably be
> >   reworded, but some doc or additionnal hint around this message might help.
> >   
> 
> Hmmm, not sure is WARNING is a good approach, but I don't have a better
> idea at the moment.

I stepped it down to NOTICE and added some more infos.

Here is the output of the last patch with a 1MB work_mem:

  =# explain analyze select * from small join large using (id);
  WARNING:  increasing number of batches from 1 to 2
  WARNING:  increasing number of batches from 2 to 4
  WARNING:  increasing number of batches from 4 to 8
  WARNING:  increasing number of batches from 8 to 16
  WARNING:  increasing

Re: "variable not found in subplan target list"

I wrote:
> I reduced this down to

> MERGE INTO public.target_parted as target_0
> USING public.itest1 as ref_0
> ON target_0.b = ref_0.a
> WHEN NOT MATCHED
>THEN INSERT VALUES (42, 13);

> The critical moving part seems to just be that the MERGE target
> is a partitioned table ... but surely somebody tested that before?

Oh, it's not just any partitioned table:

regression=# \d+ target_parted
Partitioned table "public.target_parted"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | 
Stats target | Description 
+-+---+--+-+-+-+--+-
 a  | integer |   |  | | plain   | |
  | 
 b  | integer |   |  | | plain   | |
  | 
Partition key: LIST (a)
Number of partitions: 0

The planner is reducing the scan of target_parted to
a dummy scan, as is reasonable, but it forgets to
provide ctid as an output from that scan; then the
parent join node is unhappy because it does have
a ctid output.  So it looks like the problem is some
shortcut we take while creating the dummy scan.

I suppose that without the planner bug, this'd fail at
runtime for lack of a partition to put (42,13) into.
Because of that, the case isn't really interesting
for production, which may explain the lack of reports.

regards, tom lane

Re: Hash table scans outside transactions

Bumping it to attract some attention.

On Tue, Mar 21, 2023 at 12:51 PM Ashutosh Bapat
 wrote:
>
> Hi,
> Hash table scans (seq_scan_table/level) are cleaned up at the end of a
> transaction in AtEOXact_HashTables(). If a hash seq scan continues
> beyond transaction end it will meet "ERROR: no hash_seq_search scan
> for hash table" in deregister_seq_scan(). That seems like a limiting
> the hash table usage.
>
> Our use case is
> 1. Add/update/remove entries in hash table
> 2. Scan the existing entries and perform one transaction per entry
> 3. Close scan
>
> repeat above steps in an infinite loop. Note that we do not
> add/modify/delete entries in step 2. We can't use linked lists since
> the entries need to be updated or deleted using hash keys. Because the
> hash seq scan is cleaned up at the end of the transaction, we
> encounter error in the 3rd step. I don't see that the actual hash
> table scan depends upon the seq_scan_table/level[] which is cleaned up
> at the end of the transaction.
>
> I have following questions
> 1. Is there a way to avoid cleaning up seq_scan_table/level() when the
> transaction ends?
> 2. Is there a way that we can use hash table implementation in
> PostgreSQL code for our purpose?
>
>
> --
> Best Wishes,
> Ashutosh Bapat



-- 
Best Wishes,
Ashutosh Bapat

Re: "variable not found in subplan target list"

Alvaro Herrera  writes:
> I have to run now so can't dissect it, but while running sqlsmith on the
> SQL/JSON patch after Justin's report, I got $SUBJECT in this query:

I reduced this down to

MERGE INTO public.target_parted as target_0
USING public.itest1 as ref_0
ON target_0.b = ref_0.a
WHEN NOT MATCHED
   THEN INSERT VALUES (42, 13);

The critical moving part seems to just be that the MERGE target
is a partitioned table ... but surely somebody tested that before?

regards, tom lane

Re: Add standard collation UNICODE

2023-03-28 Thread Joe Conway


On 3/28/23 06:07, Peter Eisentraut wrote:

On 23.03.23 21:16, Jeff Davis wrote:

Another thought: for ICU, do we want the default collation to be
UNICODE (root collation)? What we have now gets the default from the
environment, which is consistent with the libc provider.

But now that we have the UNICODE collation, it makes me wonder if we
should just default to that. The server's environment doesn't
necessarily say much about the locale of the data stored in it or the
locale of the applications accessing it.


As long as we still have to initialize the libc locale fields to some
language, I think it would be less confusing to keep the ICU locale on
the same language.


I definitely agree with that.


If we ever manage to get rid of that, then I would also support making
the ICU locale the root collation by default.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGdoc: add missing ID attribute to create_subscription.sgml

2023-03-28 Thread Amit Kapila

On Tue, Mar 28, 2023 at 1:00 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Amit,
>
> Thank you for reviewing! PSA new version.
>
> > Isn't it better to move the link-related part to the next line
> > wherever possible? Currently, it looks bit odd.
>
> Previously I preferred not to add a new line inside the  tag, but it 
> caused
> long-line. So I adjusted them not to be too short/long length.
>

There is no need to break the link line. See attached.

-- 
With Regards,
Amit Kapila.


v7-0001-Add-XML-ID-attributes-to-create_subscription.sgml.patch
Description: Binary data

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-03-28 Thread Damir Belyalov

>
> I might misunderstand something, but I believe the v5 patch uses
> copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a
> keyword.
> It modifies neither kwlist.h nor gram.y.
>

Sorry, didn't notice that. I think everything is alright now.

Regards,
Damir Belyalov
Postgres Professional

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-03-28 Thread torikoshia


On 2023-03-27 23:28, Damir Belyalov wrote:

Hi!

I made the specified changes and my patch turned out the same as
yours. The performance measurements were the same too.


Thanks for your review and measurements.


The only thing left to do is how not to add IGNORE_DATATYPE_ERRORS as
a keyword. See how this is done for parameters such as FORCE_NOT_NULL,
FORCE_NULL, FORCE_QUOTE. They are not in kwlist.h and are not as
keywords in gram.y.


I might misunderstand something, but I believe the v5 patch uses 
copy_generic_opt_list and it does not add IGNORE_DATATYPE_ERRORS as a 
keyword.

It modifies neither kwlist.h nor gram.y.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
> I have created one in the January commitfest,
> https://commitfest.postgresql.org/41/
> and rebased the patch on current master.  (I have not reviewed this.)

I have spent some time on that, and here are some comments with an
updated version of the patch attached.

The checks in XLogRegisterData() seemed overcomplicated to me.  In
this context, I think that we should just care about making sure that
mainrdata_len does not overflow depending on the length given by the
caller, which is where pg_add_u32_overflow() becomes handy.

XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
we already check for overflow a couple of lines down.  This is not
necessary, it seems.

@@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecord *rechdr;
char   *scratch = hdr_scratch;

+   /* ensure that any assembled record can be decoded */
+   Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));

A hardcoded check like that has no need to be in a code path triggered
each time a WAL record is assembled.  One place where this could be is
InitXLogInsert().  It still means that it is called one time for each
backend, but seeing it where the initialization of xloginsert.c feels
natural, at least.  A postmaster location would be enough, as well.

XLogRecordMaxSize just needs to be checked once IMO, around the end of
XLogRecordAssemble() once we know the total size of the record that
will be fed to a XLogReader.  One thing that we should be more careful
of is to make sure that total_len does not overflow its uint32 value
while assembling the record, as well.

I have removed XLogErrorDataLimitExceeded(), replacing it with more
context about the errors happening.  Perhaps this has no need to be
that much verbose, but it can be really useful for developers.

Some comments had no need to be updated, and there were some typos.

I am on board with the idea of a XLogRecordMaxSize that's bounded at
1020MB, leaving 4MB as room for the extra data needed by a
XLogReader.

At the end, I think that this is quite interesting long-term.  For
example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
adding buffer data or main data separately.

Thoughts about this version?
--
Michael
From 16b40324a5bc5bbe6e06cbfb86251c907f400151 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 28 Mar 2023 20:34:31 +0900
Subject: [PATCH v10] Add protections in xlog record APIs against overflows

Before this, it was possible for an extension to create malicious WAL
records that were too large to replay; or that would overflow the
xl_tot_len field, causing potential corruption in WAL record IO ops.

Emitting invalid records was also possible through
pg_logical_emit_message(), which allowed you to emit arbitrary amounts
of data up to 2GB, much higher than the replay limit of approximately
1GB.

This patch adds a limit to the size of an XLogRecord (1020MiB), checks
against overflows, and ensures that the reader infrastructure can read
any max-sized XLogRecords, such that the wal-generating backend will
fail when it attempts to insert unreadable records, instead of that
insertion succeeding but breaking any replication streams.

Author: Matthias van de Meent 
---
 src/include/access/xlogrecord.h | 11 
 src/backend/access/transam/xloginsert.c | 73 +
 2 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 0d576f7883..0a7b0bb6fc 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -54,6 +54,17 @@ typedef struct XLogRecord

 #define SizeOfXLogRecord	(offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))

+/*
+ * XLogReader needs to allocate all the data of an xlog record in a single
+ * chunk.  This means that a single XLogRecord cannot exceed MaxAllocSize
+ * in length if we ignore any allocation overhead of the XLogReader.
+ *
+ * To accommodate some overhead, this value allows for 4M of allocation
+ * overhead, that should be plenty enough for what
+ * DecodeXLogRecordRequiredSpace() expects as extra.
+ */
+#define XLogRecordMaxSize	(1020 * 1024 * 1024)
+
 /*
  * The high 4 bits in xl_info may be used freely by rmgr. The
  * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 008612e032..86fa6a5276 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -33,6 +33,7 @@
 #include "access/xloginsert.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -351,11 +352,13 @@ void
 XLogRegisterData(char *data, uint32 len)
 {
 	XLogRecData *rdata;
+	uint32		result = 0;

Re: "variable not found in subplan target list"

Alvaro Herrera  writes:
> I have to run now so can't dissect it, but while running sqlsmith on the
> SQL/JSON patch after Justin's report, I got $SUBJECT in this query:

Reproduces in HEAD and v15 too (once you replace pg_catalog.system_user
with some function that exists in v15).  So it's not the fault of the
JSON patch, nor of my outer-join hacking which had been my first thought.

regards, tom lane

Re: Move definition of standard collations from initdb to pg_collation.dat