Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-19 Thread Michael Paquier
On Wed, Jun 19, 2024 at 08:37:17AM -0500, Nathan Bossart wrote:
> My understanding is that you basically have to restart the upgrade from
> scratch if that happens.  I suppose there could be a problem if you try to
> use the half-upgraded cluster after a crash, but I imagine you have a good
> chance of encountering other problems if you do that, too.  So I don't
> think we care...

It's never been assumed that it would be safe to redo a
pg_upgradeafter a crash on a cluster initdb'd for the upgrade, so I
don't think we need to care about that, as well.

One failure I suspect would quickly be faced is OIDs getting reused
again as these are currently kept consistent.
--
Michael


signature.asc
Description: PGP signature


Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-06-19 Thread Michael Paquier
On Wed, Jun 19, 2024 at 11:00:00PM +0300, Alexander Lakhin wrote:
> Starting from af0e7deb4, the following script:
> [...]
> triggers a segfault:
> 2024-06-19 19:22:49.009 UTC [1607210:6] LOG:  server process (PID
> 1607671) was terminated by signal 11: Segmentation fault

Open item added for this issue.
--
Michael


signature.asc
Description: PGP signature


Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Michael Paquier
On Thu, Jun 20, 2024 at 10:11:37AM +0900, Masahiko Sawada wrote:
> I agree with radixtree-fix-proposed.patch. Even if we zero more fields
> in the node it would not add noticeable overheads.

This needs to be tracked as an open item, so I have added one now.
--
Michael


signature.asc
Description: PGP signature


Re: State of pg_createsubscriber

2024-06-19 Thread Amit Kapila
On Thu, Jun 20, 2024 at 2:35 AM Euler Taveira  wrote:
>
> I will open new threads soon if you don't.
>

Okay, thanks. I'll wait for you to start new threads and then we can
discuss the respective problems in those threads.

-- 
With Regards,
Amit Kapila.




Re: Pgoutput not capturing the generated columns

2024-06-19 Thread Peter Smith
Hi, here are my review comments for v8-0001.

==
Commit message.

1.
It seems like the patch name was accidentally omitted, so it became a
mess when it defaulted to the 1st paragraph of the commit message.

==
contrib/test_decoding/test_decoding.c

2.
+ data->include_generated_columns = true;

I previously posted a comment [1, #4] that this should default to
false; IMO it is unintuitive for the test_decoding to have an
*opposite* default behaviour compared to CREATE SUBSCRIPTION.

==
doc/src/sgml/ref/create_subscription.sgml

NITPICK - remove the inconsistent blank line in SGML

==
src/backend/commands/subscriptioncmds.c

3.
+#define SUBOPT_include_generated_columns 0x0001

I previously posted a comment [1, #6] that this should be UPPERCASE,
but it is not yet fixed.

==
src/bin/psql/describe.c

NITPICK - move and reword the bogus comment

~

4.
+ if (pset.sversion >= 17)
+ appendPQExpBuffer(,
+ ", subincludegencols AS \"%s\"\n",
+ gettext_noop("include_generated_columns"));

4a.
For consistency with every other parameter, that column title should
be written in words "Include generated columns" (not
"include_generated_columns").

~

4b.
IMO this new column belongs with the other subscription parameter
columns (e.g. put it ahead of the "Conninfo" column).

==
src/test/subscription/t/011_generated.pl

NITPICK - fixed a comment

5.
IMO, it would be better for readability if all the matching CREATE
TABLE for publisher and subscriber are kept together, instead of the
current code which is creating all publisher tables and then creating
all subscriber tables.

~~~

6.
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab2");
+is( $result, qq(4|8
+5|10), 'confirm generated columns ARE replicated when the
subscriber-side column is not generated');
+
...
+$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab3");
+is( $result, qq(4|24
+5|25), 'confirm generated columns are NOT replicated when the
subscriber-side column is also generated');
+

6a.
These SELECT all need ORDER BY to protect against the SELECT *
returning rows in some unexpected order.

~

6b.
IMO there should be more comments here to explain how you can tell the
column was NOT replicated. E.g. it is because the result value of 'b'
is the subscriber-side computed value (which you made deliberately
different to the publisher-side computed value).

==

99.
Please also refer to the attached nitpicks top-up patch for minor
cosmetic stuff.

==
[1] 
https://www.postgresql.org/message-id/CAHv8RjLeZtTeXpFdoY6xCPO41HtuOPMSSZgshVdb%2BV%3Dp2YHL8Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/doc/src/sgml/ref/create_subscription.sgml 
b/doc/src/sgml/ref/create_subscription.sgml
index e8779dc..ee27a58 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -437,7 +437,6 @@ CREATE SUBSCRIPTION subscription_namefalse.
  
-
  
   If the subscriber-side column is also a generated column then this 
option
   has no effect; the subscriber column will be filled as normal with 
the
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 75a52ce..663015d 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -6609,12 +6609,12 @@ describeSubscriptions(const char *pattern, bool verbose)
appendPQExpBuffer(,
  ", subskiplsn AS 
\"%s\"\n",
  gettext_noop("Skip 
LSN"));
+
+   /* include_generated_columns is only supported in v18 and 
higher */
if (pset.sversion >= 17)
appendPQExpBuffer(,
", 
subincludegencols AS \"%s\"\n",

gettext_noop("include_generated_columns"));
-
- // 
include_generated_columns
}
 
/* Only display subscriptions in current database. */
diff --git a/src/test/subscription/t/011_generated.pl 
b/src/test/subscription/t/011_generated.pl
index 92b3dbf..cbd5015 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -24,7 +24,7 @@ $node_publisher->safe_psql('postgres',
"CREATE TABLE tab1 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 
2) STORED)"
 );
 
-# publisher-side tab2 has generated col 'b' but subscriber-side tab2 has 
NON-gnerated col 'b'.
+# publisher-side tab2 has generated col 'b' but subscriber-side tab2 has 
NON-generated col 'b'.
 $node_publisher->safe_psql('postgres',
"CREATE TABLE tab2 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 
2) STORED)"
 );


datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

2024-06-19 Thread Noah Misch
https://postgr.es/m/20240512232923.aa.nmi...@google.com wrote:
> Separable, nontrivial things not fixed in the attached patch stack:

> - Trouble is possible, I bet, if the system crashes between the inplace-update
>   memcpy() and XLogInsert().  See the new XXX comment below the memcpy().

That comment:

/*--
 * XXX A crash here can allow datfrozenxid() to get ahead of 
relfrozenxid:
 *
 * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
 * ["R" is a VACUUM tbl]
 * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
 * D: systable_getnext() returns pg_class tuple of tbl
 * R: memcpy() into pg_class tuple of tbl
 * D: raise pg_database.datfrozenxid, XLogInsert(), finish
 * [crash]
 * [recovery restores datfrozenxid w/o relfrozenxid]
 */

>   Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
>   finally issuing memcpy() into the buffer.

That fix worked.  Along with that, I'm attaching a not-for-commit patch with a
test case and one with the fix rebased on that test case.  Apply on top of the
v2 patch stack from https://postgr.es/m/20240617235854.f8.nmi...@google.com.
This gets key testing from 027_stream_regress.pl; when I commented out some
memcpy lines of the heapam.c change, that test caught it.

This resolves the last inplace update defect known to me.

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

WAL-log inplace update before revealing it to other sessions.

A buffer lock won't stop a reader having already checked tuple
visibility.  If a vac_update_datfrozenid() and then a crash happened
during inplace update of a relfrozenxid value, datfrozenxid could
overtake relfrozenxid.  Back-patch to v12 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index fb06ff2..aec8dcc 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -201,6 +201,4 @@ Inplace updates create an exception to the rule that tuple 
data won't change
 under a reader holding a pin.  A reader of a heap_fetch() result tuple may
 witness a torn read.  Current inplace-updated fields are aligned and are no
 wider than four bytes, and current readers don't need consistency across
-fields.  Hence, they get by with just fetching each field once.  XXX such a
-caller may also read a value that has not reached WAL; see
-heap_inplace_update_finish().
+fields.  Hence, they get by with just fetching each field once.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d7e417f..2a5fea5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6305,6 +6305,8 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
Relationrelation = scan->heap_rel;
uint32  oldlen;
uint32  newlen;
+   char   *dst;
+   char   *src;
int nmsgs = 0;
SharedInvalidationMessage *invalMessages = NULL;
boolRelcacheInitFileInval = false;
@@ -6315,6 +6317,9 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
elog(ERROR, "wrong tuple length");
 
+   dst = (char *) htup + htup->t_hoff;
+   src = (char *) tuple->t_data + tuple->t_data->t_hoff;
+
/*
 * Construct shared cache inval if necessary.  Note that because we only
 * pass the new version of the tuple, this mustn't be used for any
@@ -6338,15 +6343,15 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 */
PreInplace_Inval();
 
-   /* NO EREPORT(ERROR) from here till changes are logged */
-   START_CRIT_SECTION();
-
-   memcpy((char *) htup + htup->t_hoff,
-  (char *) tuple->t_data + tuple->t_data->t_hoff,
-  newlen);
-
/*--
-* XXX A crash here can allow datfrozenxid() to get ahead of 
relfrozenxid:
+* NO EREPORT(ERROR) from here till changes are complete
+*
+* Our buffer lock won't stop a reader having already pinned and checked
+* visibility for this tuple.  Hence, we write WAL first, then mutate 
the
+* buffer.  Like in MarkBufferDirtyHint() or RecordTransactionCommit(),
+* checkpoint delay makes that acceptable.  With the usual order of
+* changes, a crash after memcpy() and before XLogInsert() could allow
+* datfrozenxid to overtake relfrozenxid:
 *
 * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
 * ["R" is a VACUUM tbl]
@@ -6356,14 +6361,28 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 * D: raise pg_database.datfrozenxid, XLogInsert(), finish
 * [crash]
 * 

Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Masahiko Sawada
Hi,

On Thu, Jun 20, 2024 at 7:54 AM Tom Lane  wrote:
>
> I wrote:
> > I've reproduced what looks like about the same thing on
> > my Pi 4 using Fedora 38: just run "make installcheck-parallel"
> > under valgrind, and kaboom.  Definitely needs investigation.
>
> The problem appears to be that RT_ALLOC_NODE doesn't bother to
> initialize the chunks[] array when making a RT_NODE_16 node.
> If we fill fewer than RT_FANOUT_16_MAX of the chunks[] entries,
> then when RT_NODE_16_SEARCH_EQ applies vector operations that
> read the entire array, it's operating partially on uninitialized
> data.  Now, that's actually OK because of the "mask off invalid
> entries" step, but aarch64 valgrind complains anyway.
>
> I hypothesize that the reason we're not seeing equivalent failures
> on x86_64 is one of
>
> 1. x86_64 valgrind is stupider than aarch64, and fails to track that
> the contents of the SIMD registers are only partially defined.
>
> 2. x86_64 valgrind is smarter than aarch64, and is able to see
> that the "mask off invalid entries" step removes all the
> potentially-uninitialized bits.
>
> The first attached patch, "radixtree-fix-minimal.patch", is enough
> to stop the aarch64 valgrind failure for me.  However, I think
> that the coding here is pretty penny-wise and pound-foolish,
> and that what we really ought to do is the second patch,
> "radixtree-fix-proposed.patch".  I do not believe that asking
> memset to zero the three-byte RT_NODE structure produces code
> that's either shorter or faster than having it zero 8 bytes
> (as for RT_NODE_4) or having it do that and then separately
> zero some more stuff (as for the larger node types).  Leaving
> RT_NODE_4's chunks[] array uninitialized is going to bite us
> someday, too, even if it doesn't right now.  So I think we
> ought to just zero the whole fixed-size part of the nodes,
> which is what radixtree-fix-proposed.patch does.

I agree with radixtree-fix-proposed.patch. Even if we zero more fields
in the node it would not add noticeable overheads.

>
> (The RT_NODE_48 case could be optimized a bit if we cared
> to swap the order of its slot_idxs[] and isset[] arrays;
> then the initial zeroing could just go up to slot_idxs[].
> I don't know if there's any reason why the current order
> is preferable.)

IIUC there is no particular reason for the current order in RT_NODE_48.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: DOCS: Generated table columns are skipped by logical replication

2024-06-19 Thread Peter Smith
On Wed, Jun 19, 2024 at 2:21 PM Amit Kapila  wrote:
>
...
>
> Thanks for sharing examples. Your proposed patch as-is looks good to
> me. We should back-patch this unless you or someone else thinks
> otherwise.
>

Hi Amit.

I modified the patch text slightly according to Peter E's suggestion [1].

I also tested the above examples against all older PostgreSQL versions
12,13,14,15,16,17. The logical replication behaviour of skipping
generated columns is the same for all of them.

Note that CREATE PUBLICATION column lists did not exist until PG15, so
a modified patch is needed for the versions before that.

~

The attached "HEAD" patch is appropriate for HEAD, PG17, PG16, PG15
The attached "PG14" patch is appropriate for PG14, PG13, PG12

==
[1] 
https://www.postgresql.org/message-id/2b291af9-929f-49ab-b378-5cbc029d348f%40eisentraut.org

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-DOCS-logical-replication-of-generated-columns-PG14
Description: Binary data


v1-0001-DOCS-logical-replication-of-generated-columns-HEAD
Description: Binary data


Re: Pluggable cumulative statistics

2024-06-19 Thread Michael Paquier
On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> - How should the persistence of the custom stats be achieved?
> Callbacks to give custom stats kinds a way to write/read their data,
> push everything into a single file, or support both?
> - Should this do like custom RMGRs and assign to each stats kinds ID
> that are set in stone rather than dynamic ones?

These two questions have been itching me in terms of how it would work
for extension developers, after noticing that custom RMGRs are used
more than I thought:
https://wiki.postgresql.org/wiki/CustomWALResourceManagers

The result is proving to be nicer, shorter by 300 lines in total and
much simpler when it comes to think about the way stats are flushed
because it is possible to achieve the same result as the first patch
set without manipulating any of the code paths doing the read and
write of the pgstats file.

In terms of implementation, pgstat.c's KindInfo data is divided into
two parts, for efficiency:
- The exiting in-core stats with designated initializers, renamed as
built-in stats kinds.
- The custom stats kinds are saved in TopMemoryContext, and can only
be registered with shared_preload_libraries.  The patch reserves a set
of 128 harcoded slots for all the custom kinds making the lookups for
the KindInfos quite cheap.  Upon registration, a custom stats kind
needs to assign a unique ID, with uniqueness on the names and IDs
checked at registration.

The backend code does ID -> information lookups in the hotter paths,
meaning that the code only checks if an ID is built-in or custom, then
redirects to the correct array where the information is stored.
There is one code path that does a name -> information lookup for the
undocumented SQL function pg_stat_have_stats() used in the tests,
which is a bit less efficient now, but that does not strike me as an
issue.

modules/injection_points/ works as previously as a template to show
how to use these APIs, with tests for the whole.

With that in mind, the patch set is more pleasant to the eye, and the
attached v2 consists of:
- 0001 and 0002 are some cleanups, same as previously to prepare for
the backend-side APIs.
- 0003 adds the backend support to plug-in custom stats.
- 0004 includes documentation.
- 0005 is an example of how to use them, with a TAP test providing
coverage.

Note that the patch I've proposed to make stats persistent at
checkpoint so as we don't discard everything after a crash is able to
work with the custom stats proposed on this thread:
https://commitfest.postgresql.org/48/5047/
--
Michael
From 4c065f73cb744d1735c01e6f276d658853810f2e Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 13 Jun 2024 13:25:05 +0900
Subject: [PATCH v2 1/5] Switch PgStat_Kind from enum to uint32

A follow-up patch is planned to make this counter extensible, and
keeping a trace of the kind behind a type is useful in the internal
routines used by pgstats.  While on it, switch pgstat_is_kind_valid() to
use PgStat_Kind, to be more consistent with its callers.
---
 src/include/pgstat.h| 35 ++---
 src/backend/utils/activity/pgstat.c |  6 ++---
 2 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710..2d30fadaf1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,26 +32,25 @@
 #define PG_STAT_TMP_DIR		"pg_stat_tmp"
 
 /* The types of statistics entries */
-typedef enum PgStat_Kind
-{
-	/* use 0 for INVALID, to catch zero-initialized data */
-	PGSTAT_KIND_INVALID = 0,
+#define PgStat_Kind uint32
 
-	/* stats for variable-numbered objects */
-	PGSTAT_KIND_DATABASE,		/* database-wide statistics */
-	PGSTAT_KIND_RELATION,		/* per-table statistics */
-	PGSTAT_KIND_FUNCTION,		/* per-function statistics */
-	PGSTAT_KIND_REPLSLOT,		/* per-slot statistics */
-	PGSTAT_KIND_SUBSCRIPTION,	/* per-subscription statistics */
+/* use 0 for INVALID, to catch zero-initialized data */
+#define PGSTAT_KIND_INVALID 0
 
-	/* stats for fixed-numbered objects */
-	PGSTAT_KIND_ARCHIVER,
-	PGSTAT_KIND_BGWRITER,
-	PGSTAT_KIND_CHECKPOINTER,
-	PGSTAT_KIND_IO,
-	PGSTAT_KIND_SLRU,
-	PGSTAT_KIND_WAL,
-} PgStat_Kind;
+/* stats for variable-numbered objects */
+#define PGSTAT_KIND_DATABASE	1	/* database-wide statistics */
+#define PGSTAT_KIND_RELATION	2	/* per-table statistics */
+#define PGSTAT_KIND_FUNCTION	3	/* per-function statistics */
+#define PGSTAT_KIND_REPLSLOT	4	/* per-slot statistics */
+#define PGSTAT_KIND_SUBSCRIPTION	5	/* per-subscription statistics */
+
+/* stats for fixed-numbered objects */
+#define PGSTAT_KIND_ARCHIVER	6
+#define PGSTAT_KIND_BGWRITER	7
+#define PGSTAT_KIND_CHECKPOINTER	8
+#define PGSTAT_KIND_IO	9
+#define PGSTAT_KIND_SLRU	10
+#define PGSTAT_KIND_WAL	11
 
 #define PGSTAT_KIND_FIRST_VALID PGSTAT_KIND_DATABASE
 #define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 

Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tom Lane
I wrote:
> I hypothesize that the reason we're not seeing equivalent failures
> on x86_64 is one of

> 1. x86_64 valgrind is stupider than aarch64, and fails to track that
> the contents of the SIMD registers are only partially defined.

> 2. x86_64 valgrind is smarter than aarch64, and is able to see
> that the "mask off invalid entries" step removes all the
> potentially-uninitialized bits.

Hah: it's the second case.  If I patch radixtree.h as attached,
then x86_64 valgrind complains about 

==00:00:00:32.759 247596== Conditional jump or move depends on uninitialised 
value(s)
==00:00:00:32.759 247596==at 0x52F668: local_ts_node_16_search_eq 
(radixtree.h:1018)

showing that it knows that the result of vector8_highbit_mask is
only partly defined.  Kind of odd though that aarch64 valgrind
is getting the hard part right and not the easy(?) part.

regards, tom lane

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 338e1d741d..267ec6de03 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1015,12 +1015,15 @@ RT_NODE_16_SEARCH_EQ(RT_NODE_16 * node, uint8 chunk)
 	/* convert comparison to a bitfield */
 	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
 
-	/* mask off invalid entries */
-	bitfield &= ((UINT64CONST(1) << count) - 1);
-
-	/* convert bitfield to index by counting trailing zeros */
 	if (bitfield)
-		slot_simd = >children[pg_rightmost_one_pos32(bitfield)];
+	{
+		/* mask off invalid entries */
+		bitfield &= ((UINT64CONST(1) << count) - 1);
+
+		/* convert bitfield to index by counting trailing zeros */
+		if (bitfield)
+			slot_simd = >children[pg_rightmost_one_pos32(bitfield)];
+	}
 
 	Assert(slot_simd == slot);
 	return slot_simd;


Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tom Lane
I wrote:
> I hypothesize that the reason we're not seeing equivalent failures
> on x86_64 is one of

> 1. x86_64 valgrind is stupider than aarch64, and fails to track that
> the contents of the SIMD registers are only partially defined.

> 2. x86_64 valgrind is smarter than aarch64, and is able to see
> that the "mask off invalid entries" step removes all the
> potentially-uninitialized bits.

Side note: it struck me that this could also be a valgrind version
skew issue.  But the machine I'm seeing the failure on is running
valgrind-3.22.0-1.fc38.aarch64, which is the same upstream version
as valgrind-3.22.0-2.el8.x86_64, where I don't see it.  So apparently
not.  (There is a 3.23.0 out recently, but its release notes don't
mention anything that seems related.)

regards, tom lane




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tom Lane
I wrote:
> I've reproduced what looks like about the same thing on
> my Pi 4 using Fedora 38: just run "make installcheck-parallel"
> under valgrind, and kaboom.  Definitely needs investigation.

The problem appears to be that RT_ALLOC_NODE doesn't bother to
initialize the chunks[] array when making a RT_NODE_16 node.
If we fill fewer than RT_FANOUT_16_MAX of the chunks[] entries,
then when RT_NODE_16_SEARCH_EQ applies vector operations that
read the entire array, it's operating partially on uninitialized
data.  Now, that's actually OK because of the "mask off invalid
entries" step, but aarch64 valgrind complains anyway.

I hypothesize that the reason we're not seeing equivalent failures
on x86_64 is one of

1. x86_64 valgrind is stupider than aarch64, and fails to track that
the contents of the SIMD registers are only partially defined.

2. x86_64 valgrind is smarter than aarch64, and is able to see
that the "mask off invalid entries" step removes all the
potentially-uninitialized bits.

The first attached patch, "radixtree-fix-minimal.patch", is enough
to stop the aarch64 valgrind failure for me.  However, I think
that the coding here is pretty penny-wise and pound-foolish,
and that what we really ought to do is the second patch,
"radixtree-fix-proposed.patch".  I do not believe that asking
memset to zero the three-byte RT_NODE structure produces code
that's either shorter or faster than having it zero 8 bytes
(as for RT_NODE_4) or having it do that and then separately
zero some more stuff (as for the larger node types).  Leaving
RT_NODE_4's chunks[] array uninitialized is going to bite us
someday, too, even if it doesn't right now.  So I think we
ought to just zero the whole fixed-size part of the nodes,
which is what radixtree-fix-proposed.patch does.

(The RT_NODE_48 case could be optimized a bit if we cared
to swap the order of its slot_idxs[] and isset[] arrays;
then the initial zeroing could just go up to slot_idxs[].
I don't know if there's any reason why the current order
is preferable.)

regards, tom lane

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 338e1d741d..e21f7be3f9 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -849,8 +849,14 @@ RT_ALLOC_NODE(RT_RADIX_TREE * tree, const uint8 kind, const RT_SIZE_CLASS size_c
 	switch (kind)
 	{
 		case RT_NODE_KIND_4:
-		case RT_NODE_KIND_16:
 			break;
+		case RT_NODE_KIND_16:
+			{
+RT_NODE_16 *n16 = (RT_NODE_16 *) node;
+
+memset(n16->chunks, 0, sizeof(n16->chunks));
+break;
+			}
 		case RT_NODE_KIND_48:
 			{
 RT_NODE_48 *n48 = (RT_NODE_48 *) node;
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 338e1d741d..4510f7c4cd 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -845,27 +845,25 @@ RT_ALLOC_NODE(RT_RADIX_TREE * tree, const uint8 kind, const RT_SIZE_CLASS size_c
 
 	/* initialize contents */
 
-	memset(node, 0, sizeof(RT_NODE));
 	switch (kind)
 	{
 		case RT_NODE_KIND_4:
+			memset(node, 0, offsetof(RT_NODE_4, children));
+			break;
 		case RT_NODE_KIND_16:
+			memset(node, 0, offsetof(RT_NODE_16, children));
 			break;
 		case RT_NODE_KIND_48:
 			{
 RT_NODE_48 *n48 = (RT_NODE_48 *) node;
 
-memset(n48->isset, 0, sizeof(n48->isset));
+memset(n48, 0, offsetof(RT_NODE_48, children));
 memset(n48->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n48->slot_idxs));
 break;
 			}
 		case RT_NODE_KIND_256:
-			{
-RT_NODE_256 *n256 = (RT_NODE_256 *) node;
-
-memset(n256->isset, 0, sizeof(n256->isset));
-break;
-			}
+			memset(node, 0, offsetof(RT_NODE_256, children));
+			break;
 		default:
 			pg_unreachable();
 	}


Re: IPC::Run accepts bug reports

2024-06-19 Thread Noah Misch
On Tue, Jun 18, 2024 at 08:07:27PM -0700, Andres Freund wrote:
> > > > 1) Sometimes hangs hard on windows if started processes have not been 
> > > > shut
> > > >down before script exits.

> It reliably reproduces if I comment out
> the lines below
>   # explicitly shut down psql instances gracefully - to avoid hangs
>   # or worse on windows
> in 021_row_visibility.pl
> 
> The logfile ends in
> Warning: unable to close filehandle GEN25 properly: Bad file descriptor 
> during global destruction.
> Warning: unable to close filehandle GEN20 properly: Bad file descriptor 
> during global destruction.
> 
> 
> Even if I cancel the test, I can't rerun it because due to a leftover psql
> a) a new temp install can't be made (could be solved by rm -rf)
> b) the test's logfile can't be removed (couldn't even rename the directory)
> 
> The psql instance needs to be found and terminated first.

Thanks for that recipe.  I've put that in my queue to fix.

On Tue, Jun 18, 2024 at 12:00:13PM -0700, Andres Freund wrote:
> On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
> > On Mon, Jun 17, 2024 at 11:11:17AM -0700, Andres Freund wrote:
> > > 2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
> > >Broken pipe:" (in _do_filters()). There's plenty reports of this on the
> > >list, and I've hit this several times personally. It seems to be timing
> > >dependent, I've encountered it after seemingly irrelevant ordering 
> > > changes.
> > > 
> > >I suspect I could create a reproducer with a bit of time.
> > 
> > I've seen that one.  If the harness has data to write to a child, the child
> > exiting before the write is one way to reach that.  Perhaps before exec(),
> > IPC::Run should do a non-blocking write from each pending IO.  That way, 
> > small
> > writes never experience the timing-dependent behavior.
> 
> I think the question is rather, why is ipc run choosing to die in this
> situation and can that be fixed?

With default signal handling, the process would die to SIGPIPE.  Since
PostgreSQL::Test ignores SIGPIPE, this happens instead.  The IPC::Run source
tree has no discussion of ignoring SIGPIPE, so I bet it didn't get a conscious
decision.  Perhaps it can do better.




Re: Remove dependence on integer wrapping

2024-06-19 Thread Joseph Koshakow
On Thu, Jun 13, 2024 at 10:56 PM Joseph Koshakow  wrote:

On Thu, Jun 13, 2024 at 10:48 PM Joseph Koshakow 
wrote:
>I've attached
>v4-0002-Handle-overflow-in-money-arithmetic.patch which adds some
>overflow checks and tests. I didn't address the float
multiplication
>because I didn't see any helper methods in int.h. I did some some
>useful helpers in float.h, but they raise an error directly instead
>of returning a bool. Would those be appropriate for use with the
>money type? If not I can refactor out the inner parts into a new
method
>that returns a bool.

>v4-0001-Remove-dependence-on-integer-wrapping.patch is unchanged, I
>just incremented the version number.

I added overflow handling for float arithmetic to the `money` type.
v6-0002-Handle-overflow-in-money-arithmetic.patch is ready for review.

v6-0001-Remove-dependence-on-integer-wrapping.patch is unchanged, I
just incremented the version number.

Thanks,
Joe Koshakow
From 6eec604618ee76227ee33fcddcc121d9915ff0ab Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Sat, 8 Jun 2024 22:16:46 -0400
Subject: [PATCH 1/2] Remove dependence on integer wrapping

This commit updates various parts of the code to no longer rely on
integer wrapping for correctness. Not all compilers support -fwrapv, so
it's best not to rely on it.
---
 src/backend/utils/adt/cash.c   |   7 +-
 src/backend/utils/adt/numeric.c|   5 +-
 src/backend/utils/adt/numutils.c   |  34 ---
 src/backend/utils/adt/timestamp.c  |  28 +-
 src/include/common/int.h   | 105 +
 src/interfaces/ecpg/pgtypeslib/timestamp.c |  11 +--
 src/test/regress/expected/timestamp.out|  13 +++
 src/test/regress/expected/timestamptz.out  |  13 +++
 src/test/regress/sql/timestamp.sql |   4 +
 src/test/regress/sql/timestamptz.sql   |   4 +
 10 files changed, 169 insertions(+), 55 deletions(-)

diff --git a/src/backend/utils/adt/cash.c b/src/backend/utils/adt/cash.c
index 32fbad2f57..f6f095a57b 100644
--- a/src/backend/utils/adt/cash.c
+++ b/src/backend/utils/adt/cash.c
@@ -352,8 +352,11 @@ cash_out(PG_FUNCTION_ARGS)
 
 	if (value < 0)
 	{
-		/* make the amount positive for digit-reconstruction loop */
-		value = -value;
+		/*
+		 * make the amount positive for digit-reconstruction loop, we can
+		 * leave INT64_MIN unchanged
+		 */
+		pg_neg_s64_overflow(value, );
 		/* set up formatting data */
 		signsymbol = (*lconvert->negative_sign != '\0') ? lconvert->negative_sign : "-";
 		sign_posn = lconvert->n_sign_posn;
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5510a203b0..4ea2d9b0b4 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8110,15 +8110,14 @@ int64_to_numericvar(int64 val, NumericVar *var)
 
 	/* int64 can require at most 19 decimal digits; add one for safety */
 	alloc_var(var, 20 / DEC_DIGITS);
+	uval = pg_abs_s64(val);
 	if (val < 0)
 	{
 		var->sign = NUMERIC_NEG;
-		uval = -val;
 	}
 	else
 	{
 		var->sign = NUMERIC_POS;
-		uval = val;
 	}
 	var->dscale = 0;
 	if (val == 0)
@@ -11222,7 +11221,7 @@ power_var_int(const NumericVar *base, int exp, int exp_dscale,
 	 * Now we can proceed with the multiplications.
 	 */
 	neg = (exp < 0);
-	mask = abs(exp);
+	mask = pg_abs_s32(exp);
 
 	init_var(_prod);
 	set_var_from_var(base, _prod);
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index adc1e8a4cb..a3d7d6bf01 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 
+#include "common/int.h"
 #include "port/pg_bitutils.h"
 #include "utils/builtins.h"
 
@@ -131,6 +132,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 	uint16		tmp = 0;
 	bool		neg = false;
 	unsigned char digit;
+	int16		result;
 
 	/*
 	 * The majority of cases are likely to be base-10 digits without any
@@ -190,10 +192,9 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 
 	if (neg)
 	{
-		/* check the negative equivalent will fit without overflowing */
-		if (unlikely(tmp > (uint16) (-(PG_INT16_MIN + 1)) + 1))
+		if (pg_neg_u16_overflow(tmp, ))
 			goto out_of_range;
-		return -((int16) tmp);
+		return result;
 	}
 
 	if (unlikely(tmp > PG_INT16_MAX))
@@ -333,10 +334,9 @@ slow:
 
 	if (neg)
 	{
-		/* check the negative equivalent will fit without overflowing */
-		if (tmp > (uint16) (-(PG_INT16_MIN + 1)) + 1)
+		if (pg_neg_u16_overflow(tmp, ))
 			goto out_of_range;
-		return -((int16) tmp);
+		return result;
 	}
 
 	if (tmp > PG_INT16_MAX)
@@ -393,6 +393,7 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	uint32		tmp = 0;
 	bool		neg = false;
 	unsigned char digit;
+	int32		result;
 
 	/*
 	 * The majority of cases are likely to be base-10 digits without any
@@ -452,10 +453,9 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 
 	if (neg)
 	{
-		/* check the 

Re: fix pg_upgrade comment

2024-06-19 Thread Nathan Bossart
On Tue, Jun 18, 2024 at 10:20:06PM +0200, Daniel Gustafsson wrote:
> Nice catch, +1 for committing. 

Committed.

-- 
nathan




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tom Lane
Tomas Vondra  writes:
> On 6/19/24 17:11, Tom Lane wrote:
>> What's the test scenario that triggers this?

> I haven't investigated that yet, I just ran "make check".

I've reproduced what looks like about the same thing on
my Pi 4 using Fedora 38: just run "make installcheck-parallel"
under valgrind, and kaboom.  Definitely needs investigation.

regards, tom lane

==00:00:04:08.988 16548== Memcheck, a memory error detector
==00:00:04:08.988 16548== Copyright (C) 2002-2022, and GNU GPL'd, by Julian 
Seward et al.
==00:00:04:08.988 16548== Using Valgrind-3.22.0 and LibVEX; rerun with -h for 
copyright info
==00:00:04:08.988 16548== Command: postgres -F
==00:00:04:08.988 16548== Parent PID: 16278
==00:00:04:08.988 16548== 
==00:00:04:13.280 16548== Conditional jump or move depends on uninitialised 
value(s)
==00:00:04:13.280 16548==at 0x4AD2CC: local_ts_node_16_search_eq 
(radixtree.h:1025)
==00:00:04:13.280 16548==by 0x4AD2CC: local_ts_node_search 
(radixtree.h:1057)
==00:00:04:13.280 16548==by 0x4AF173: local_ts_get_slot_recursive 
(radixtree.h:1667)
==00:00:04:13.280 16548==by 0x4AF173: local_ts_set (radixtree.h:1744)
==00:00:04:13.280 16548==by 0x4AF173: TidStoreSetBlockOffsets 
(tidstore.c:427)
==00:00:04:13.280 16548==by 0x4FCEEF: dead_items_add (vacuumlazy.c:2892)
==00:00:04:13.280 16548==by 0x4FE5EF: lazy_scan_prune (vacuumlazy.c:1500)
==00:00:04:13.280 16548==by 0x4FE5EF: lazy_scan_heap (vacuumlazy.c:975)
==00:00:04:13.280 16548==by 0x4FE5EF: heap_vacuum_rel (vacuumlazy.c:497)
==00:00:04:13.280 16548==by 0x681527: table_relation_vacuum (tableam.h:1720)
==00:00:04:13.280 16548==by 0x681527: vacuum_rel (vacuum.c:2210)
==00:00:04:13.281 16548==by 0x682957: vacuum (vacuum.c:622)
==00:00:04:13.281 16548==by 0x7B7E77: autovacuum_do_vac_analyze 
(autovacuum.c:3100)
==00:00:04:13.281 16548==by 0x7B7E77: do_autovacuum (autovacuum.c:2417)
==00:00:04:13.281 16548==by 0x7B830B: AutoVacWorkerMain (autovacuum.c:1569)
==00:00:04:13.281 16548==by 0x7BB937: postmaster_child_launch 
(launch_backend.c:265)
==00:00:04:13.281 16548==by 0x7BD64F: StartChildProcess (postmaster.c:3928)
==00:00:04:13.281 16548==by 0x7BF9FB: StartAutovacuumWorker 
(postmaster.c:3997)
==00:00:04:13.281 16548==by 0x7BF9FB: process_pm_pmsignal 
(postmaster.c:3809)
==00:00:04:13.281 16548==by 0x7BF9FB: ServerLoop.isra.0 (postmaster.c:1667)
==00:00:04:13.281 16548==by 0x7C0EBF: PostmasterMain (postmaster.c:1372)
==00:00:04:13.281 16548== 
==00:00:04:13.673 16548== Conditional jump or move depends on uninitialised 
value(s)
==00:00:04:13.673 16548==at 0x4AD2CC: local_ts_node_16_search_eq 
(radixtree.h:1025)
==00:00:04:13.673 16548==by 0x4AD2CC: local_ts_node_search 
(radixtree.h:1057)
==00:00:04:13.673 16548==by 0x4AFD4F: local_ts_find (radixtree.h:)
==00:00:04:13.673 16548==by 0x4AFD4F: TidStoreIsMember (tidstore.c:443)
==00:00:04:13.673 16548==by 0x5110D7: btvacuumpage (nbtree.c:1235)
==00:00:04:13.673 16548==by 0x51171B: btvacuumscan (nbtree.c:1023)
==00:00:04:13.673 16548==by 0x51185B: btbulkdelete (nbtree.c:824)
==00:00:04:13.673 16548==by 0x683D3B: vac_bulkdel_one_index (vacuum.c:2498)
==00:00:04:13.673 16548==by 0x4FDA9B: lazy_vacuum_one_index 
(vacuumlazy.c:2443)
==00:00:04:13.673 16548==by 0x4FDA9B: lazy_vacuum_all_indexes 
(vacuumlazy.c:2026)
==00:00:04:13.673 16548==by 0x4FDA9B: lazy_vacuum (vacuumlazy.c:1944)
==00:00:04:13.673 16548==by 0x4FE7F3: lazy_scan_heap (vacuumlazy.c:1050)
==00:00:04:13.674 16548==by 0x4FE7F3: heap_vacuum_rel (vacuumlazy.c:497)
==00:00:04:13.674 16548==by 0x681527: table_relation_vacuum (tableam.h:1720)
==00:00:04:13.674 16548==by 0x681527: vacuum_rel (vacuum.c:2210)
==00:00:04:13.674 16548==by 0x682957: vacuum (vacuum.c:622)
==00:00:04:13.674 16548==by 0x7B7E77: autovacuum_do_vac_analyze 
(autovacuum.c:3100)
==00:00:04:13.674 16548==by 0x7B7E77: do_autovacuum (autovacuum.c:2417)
==00:00:04:13.674 16548==by 0x7B830B: AutoVacWorkerMain (autovacuum.c:1569)
==00:00:04:13.674 16548== 
==00:00:04:13.681 16548== Conditional jump or move depends on uninitialised 
value(s)
==00:00:04:13.681 16548==at 0x4AD2CC: local_ts_node_16_search_eq 
(radixtree.h:1025)
==00:00:04:13.681 16548==by 0x4AD2CC: local_ts_node_search 
(radixtree.h:1057)
==00:00:04:13.681 16548==by 0x4AFD4F: local_ts_find (radixtree.h:)
==00:00:04:13.681 16548==by 0x4AFD4F: TidStoreIsMember (tidstore.c:443)
==00:00:04:13.681 16548==by 0x51126B: btreevacuumposting (nbtree.c:1405)
==00:00:04:13.681 16548==by 0x51126B: btvacuumpage (nbtree.c:1249)
==00:00:04:13.681 16548==by 0x51171B: btvacuumscan (nbtree.c:1023)
==00:00:04:13.681 16548==by 0x51185B: btbulkdelete (nbtree.c:824)
==00:00:04:13.681 16548==by 0x683D3B: vac_bulkdel_one_index (vacuum.c:2498)
==00:00:04:13.681 16548==by 0x4FDA9B: lazy_vacuum_one_index 

Re: State of pg_createsubscriber

2024-06-19 Thread Euler Taveira
On Wed, Jun 19, 2024, at 12:51 AM, Amit Kapila wrote:
> Ideally, invalidated slots shouldn't create any problems but it is
> better that we discuss this also as a separate problem in new thread.

Ok.

> > Do you have any other scenarios in mind?
> >
> 
> No, so we have three issues to discuss (a) some unwarranted messages
> in --dry-run mode; (b) whether to remove pre-existing subscriptions
> during conversion; (c) whether to remove pre-existing replication
> slots.
> 
> Would you like to start three new threads for each of these or would
> you like Kuroda-San or me to start some or all?

I will open new threads soon if you don't.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-06-19 Thread Alexander Korotkov
On Fri, Jun 14, 2024 at 3:46 PM Alexander Korotkov  wrote:
> On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
>  wrote:
> >
> > Hi, Alexander, Here, I made some improvements according to your
> > discussion with Heikki.
> >
> > On 2024-04-11 18:09, Alexander Korotkov wrote:
> > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas 
> > > wrote:
> > >> In a nutshell, it's possible for the loop in WaitForLSN to exit
> > >> without
> > >> cleaning up the process from the heap. I was able to hit that by
> > >> adding
> > >> a delay after the addLSNWaiter() call:
> > >>
> > >> > TRAP: failed Assert("!procInfo->inHeap"), File: 
> > >> > "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> > >> > postgres: heikki postgres [local] 
> > >> > CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> > >> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> > >> > postgres: heikki postgres [local] 
> > >> > CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> > >> > postgres: heikki postgres [local] 
> > >> > CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> > >> > postgres: heikki postgres [local] 
> > >> > CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> > >> > postgres: heikki postgres [local] 
> > >> > CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
> > >>
> > >> I think there's a similar race condition if the timeout is reached at
> > >> the same time that the startup process wakes up the process.
> > >
> > > Thank you for catching this.  I think WaitForLSN() just needs to call
> > > deleteLSNWaiter() unconditionally after exit from the loop.
> >
> > Fix and add injection point test on this race condition.
> >
> > > On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas 
> > > wrote:
> > >> The docs could use some-copy-editing, but just to point out one issue:
> > >>
> > >> > There are also procedures to control the progress of recovery.
> > >>
> > >> That's copy-pasted from an earlier sentence at the table that lists
> > >> functions like pg_promote(), pg_wal_replay_pause(), and
> > >> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
> > >> the
> > >> progress of recovery like those functions do, it only causes the
> > >> calling
> > >> backend to wait.
> >
> > Fix documentation and add extra tests on multi-standby replication
> > and cascade replication.
>
> Thank you for the revised patch.
>
> I see couple of items which are not addressed in this revision.
>  * As Heikki pointed, that it's currently not possible in one round
> trip to call call pg_wal_replay_wait() and do other job.  The attached
> patch addresses this.  It milds the requirement.  Now, it's not
> necessary to be in atomic context.  It's only necessary to have no
> active snapshot.  This new requirement works for me so far.  I
> appreciate a feedback on this.
>  * As Alvaro pointed, multiple waiters case isn't covered by the test
> suite.  That leads to no coverage of some code paths.  The attached
> patch addresses this by adding a test case with multiple waiters.
>
> The rest looks good to me.

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all.  It seems useless.
2. The target LSN is set to pg_current_wal_insert_lsn() + 1.  This
location seems to be unachievable in this test.  So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki.  That is to insert the injection point at the
beginning of WaitLSNSetLatches().

Links.
1. 
https://www.postgresql.org/message-id/flat/CAPpHfdvGRssjqwX1%2Bidm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw%40mail.gmail.com#557900e860457a9e24256c93a2ad4920

--
Regards,
Alexander Korotkov
Supabase




Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-06-19 Thread Alexander Lakhin

Hello Heikki,

11.03.2024 10:09, Heikki Linnakangas wrote:

On 10/03/2024 22:59, Thomas Munro wrote:

On Mon, Mar 11, 2024 at 9:30 AM Heikki Linnakangas  wrote:

Barring objections, I'll commit the attached.


+1


Pushed, thanks!


Please look at a new anomaly, that I've discovered in master.

Starting from af0e7deb4, the following script:
numjobs=80
for ((i=1;i<=50;i++)); do
echo "I $i"

for ((j=1;j<=numjobs;j++)); do createdb db$j; done

for ((j=1;j<=numjobs;j++)); do
echo "
VACUUM FULL pg_class;
REINDEX INDEX pg_class_oid_index;
" | psql -d db$j >/dev/null 2>&1 &

echo "
CREATE TABLE tbl1 (t text);
DROP TABLE tbl1;
" | psql -d db$j >/dev/null 2>&1 &
done
wait

grep 'was terminated' server.log && break;
for ((j=1;j<=numjobs;j++)); do dropdb db$j; done

done

triggers a segfault:
2024-06-19 19:22:49.009 UTC [1607210:6] LOG:  server process (PID 1607671) was 
terminated by signal 11: Segmentation fault

with the following stack trace:
Core was generated by `postgres: law db50 [local] CREATE TABLE  
 '.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1607671' in core file too small.
#0  0x55d04cb8232e in RelationReloadNailed (relation=0x7f7d0a1b1fd8) at 
relcache.c:2415
2415    relp = (Form_pg_class) 
GETSTRUCT(pg_class_tuple);
(gdb) bt
#0  0x55d04cb8232e in RelationReloadNailed (relation=0x7f7d0a1b1fd8) at 
relcache.c:2415
#1  0x55d04cb8278c in RelationClearRelation (relation=0x7f7d0a1b1fd8, 
rebuild=true) at relcache.c:2560
#2  0x55d04cb834e9 in RelationCacheInvalidate (debug_discard=false) at 
relcache.c:3048
#3  0x55d04cb72e3c in InvalidateSystemCachesExtended (debug_discard=false) 
at inval.c:680
#4  0x55d04cb73190 in InvalidateSystemCaches () at inval.c:794
#5  0x55d04c9754ad in ReceiveSharedInvalidMessages (
    invalFunction=0x55d04cb72eee ,
    resetFunction=0x55d04cb7317e ) at sinval.c:105
#6  0x55d04cb731b4 in AcceptInvalidationMessages () at inval.c:808
#7  0x55d04c97d2ed in LockRelationOid (relid=2662, lockmode=1) at lmgr.c:136
#8  0x55d04c404b1f in relation_open (relationId=2662, lockmode=1) at 
relation.c:55
#9  0x55d04c47941e in index_open (relationId=2662, lockmode=1) at 
indexam.c:137
#10 0x55d04c4787c2 in systable_beginscan (heapRelation=0x7f7d0a1b1fd8, 
indexId=2662, indexOK=true, snapshot=0x0,
    nkeys=1, key=0x7ffd456a8570) at genam.c:396
#11 0x55d04cb7e93d in ScanPgRelation (targetRelId=3466, indexOK=true, 
force_non_historic=false) at relcache.c:381
#12 0x55d04cb7fe15 in RelationBuildDesc (targetRelId=3466, insertIt=true) 
at relcache.c:1093
#13 0x55d04cb81c93 in RelationIdGetRelation (relationId=3466) at 
relcache.c:2108
#14 0x55d04c404b29 in relation_open (relationId=3466, lockmode=1) at 
relation.c:58
#15 0x55d04cb720a6 in BuildEventTriggerCache () at evtcache.c:129
#16 0x55d04cb71f6a in EventCacheLookup (event=EVT_SQLDrop) at evtcache.c:68
#17 0x55d04c61c037 in trackDroppedObjectsNeeded () at event_trigger.c:1342
#18 0x55d04c61bf02 in EventTriggerBeginCompleteQuery () at 
event_trigger.c:1284
#19 0x55d04c9ac744 in ProcessUtilitySlow (pstate=0x55d04d04aca0, 
pstmt=0x55d04d021540,
    queryString=0x55d04d020830 "CREATE TABLE tbl1 (t text);", 
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
    queryEnv=0x0, dest=0x55d04d021800, qc=0x7ffd456a8fb0) at utility.c:1107
#20 0x55d04c9ac64d in standard_ProcessUtility (pstmt=0x55d04d021540,
    queryString=0x55d04d020830 "CREATE TABLE tbl1 (t text);", 
readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL,
    params=0x0, queryEnv=0x0, dest=0x55d04d021800, qc=0x7ffd456a8fb0) at 
utility.c:1067
#21 0x55d04c9ab54d in ProcessUtility (pstmt=0x55d04d021540,
    queryString=0x55d04d020830 "CREATE TABLE tbl1 (t text);", 
readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL,
    params=0x0, queryEnv=0x0, dest=0x55d04d021800, qc=0x7ffd456a8fb0) at 
utility.c:523
#22 0x55d04c9a9dc6 in PortalRunUtility (portal=0x55d04d0c1020, 
pstmt=0x55d04d021540, isTopLevel=true,
    setHoldSnapshot=false, dest=0x55d04d021800, qc=0x7ffd456a8fb0) at 
pquery.c:1158
#23 0x55d04c9aa03d in PortalRunMulti (portal=0x55d04d0c1020, 
isTopLevel=true, setHoldSnapshot=false,
    dest=0x55d04d021800, altdest=0x55d04d021800, qc=0x7ffd456a8fb0) at 
pquery.c:1315
#24 0x55d04c9a9487 in PortalRun (portal=0x55d04d0c1020, 
count=9223372036854775807, isTopLevel=true, run_once=true,
    dest=0x55d04d021800, altdest=0x55d04d021800, qc=0x7ffd456a8fb0) at 
pquery.c:791
#25 0x55d04c9a2150 in exec_simple_query (query_string=0x55d04d020830 "CREATE 
TABLE tbl1 (t text);")
    at postgres.c:1273
#26 0x55d04c9a71f5 in PostgresMain (dbname=0x55d04d05dbf0 "db50", 
username=0x55d04d05dbd8 "law") at postgres.c:4675
#27 0x55d04c8bd8c8 in BackendRun (port=0x55d04d0540c0) at postmaster.c:4475
#28 0x55d04c8bcedd in BackendStartup (port=0x55d04d0540c0) at 
postmaster.c:4151
#29 

Re: Missing docs for new enable_group_by_reordering GUC

2024-06-19 Thread Alexander Korotkov
On Wed, Jun 19, 2024 at 6:02 PM Pavel Borisov  wrote:
> On Wed, 19 Jun 2024 at 05:27, Alexander Korotkov  wrote:
>>
>> On Tue, Jun 18, 2024 at 4:14 PM Pavel Borisov  wrote:
>> >> Controls if the query planner will produce a plan which will provide 
>> >> GROUP BY keys presorted in the order of keys of a 
>> >> child node of the plan, such as an index scan. When disabled, the query 
>> >> planner will produce a plan with GROUP BY keys only 
>> >> reordered to match
>> >> the ORDER BY clause, if any. When enabled, the planner 
>> >> will try to produce a more efficient plan. The default value is on.
>> > A correction of myself: presorted -> sorted, reordered ->sorted
>>
>> Thank you for your review.  I think all of this make sense.  Please,
>> check the revised patch attached.
>
> To me patch v3 looks good.

Ok, thank you. I'm going to push this if no objections.

--
Regards,
Alexander Korotkov
Supabase




Re: remove check hooks for GUCs that contribute to MaxBackends

2024-06-19 Thread Nathan Bossart
On Wed, Jun 19, 2024 at 03:14:16PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> The attached patch removes these hooks and enhances the error message to
>> look like this:
> 
>>  FATAL:  too many backends configured
>>  DETAIL:  "max_connections" (262100) plus "autovacuum_max_workers" (3) 
>> plus "max_worker_processes" (8) plus "max_wal_senders" (1) must be less 
>> than 262142.
> 
> BTW, I suggest writing it as "too many server processes configured",
> or perhaps "too many server processes required".  "Backend" is too
> much of an insider term.

Will do, thanks for reviewing.

-- 
nathan




Re: remove check hooks for GUCs that contribute to MaxBackends

2024-06-19 Thread Tom Lane
Nathan Bossart  writes:
> The attached patch removes these hooks and enhances the error message to
> look like this:

>   FATAL:  too many backends configured
>   DETAIL:  "max_connections" (262100) plus "autovacuum_max_workers" (3) 
> plus "max_worker_processes" (8) plus "max_wal_senders" (1) must be less 
> than 262142.

BTW, I suggest writing it as "too many server processes configured",
or perhaps "too many server processes required".  "Backend" is too
much of an insider term.

regards, tom lane




Re: remove check hooks for GUCs that contribute to MaxBackends

2024-06-19 Thread Tom Lane
Nathan Bossart  writes:
> While working on an idea from another thread [0], I noticed that each of
> max_connections, max_worker_process, max_autovacuum_workers, and
> max_wal_senders have a check hook that verifies the sum of those GUCs does
> not exceed a certain value.  Then, in InitializeMaxBackends(), we do the
> same check once more.  Not only do the check hooks seem redundant, but I
> think they might sometimes be inaccurate since some values might not yet be
> initialized.

Yeah, these per-variable checks are inherently bogus.  If we can get
of them and make the net user experience actually better, that's a
win-win.

It seems easier to do for these because they can't change after server
start, so there can be one well-defined time to apply the consistency
check.  IIRC, we have some similar issues in other hooks for variables
that aren't PGC_POSTMASTER, so it's harder to see how we might get rid
of their cross-checks.  That doesn't make them less bogus though.

regards, tom lane




Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-19 Thread Jelte Fennema-Nio
On Wed, 19 Jun 2024 at 19:55, Tom Lane  wrote:
>
> Jelte Fennema-Nio  writes:
> > On Wed, 19 Jun 2024 at 17:28, Robert Haas  wrote:
> >> I have a feeling that this might be pretty annoying to implement, and
> >> if that is true, then never mind.
>
> > Based on a quick look it's not trivial, but also not super bad.
> > Basically it seems like in src/backend/catalog/namespace.c, every time
> > we loop over activeSearchPath and CurrentExtensionObject is set, then
> > we should skip any item that's not stored in pg_catalog, unless
> > there's a DEPENDENCY_EXTENSION pg_depend entry for the item (and that
> > pg_depend entry references the extension or the requires list).
>
> We could change the lookup rules that apply during execution of
> an extension script, but we already restrict search_path at that
> time so I'm not sure how much further this'd move the goalposts.

The point I tried to make in my first email is that this restricted
search_path you mention, is not very helpful at preventing privilege
escalations. Since it's often possible for a non-superuser to create
functions in one of the schemas in this search_path, e.g. by having
the non-superuser first create the schema & create some functions in
it, and then asking the DBA/control plane to create the extension in
that schema.

My patch tries to address that problem by creating the schema of the
extension during extension creation, and failing if it already exists.
Thus implicitly ensuring that a non-superuser cannot mess with the
schema.

The proposal from Robert tries to instead address by changing the
lookup rules during execution of an extension script to be more strict
than they would be outside of it (i.e. even if a function/operator
matches search_path it might still not be picked).

> The *real* problem IMO is that if you create a PL function or
> (old-style) SQL function within an extension, execution of that
> function is not similarly protected.

That's definitely a big problem too, and that's the problem that [1]
tries to fix. But first the lookup in extension scripts would need to
be made secure, because it doesn't seem very useful (security wise) to
use the same lookup mechanism in functions as we do in extension
scripts, if the lookup in extension scripts is not secure in the first
place. I think the method of making the lookup secure in my patch
would transfer over well, because it adds a way for a safe search_path
to exist, so all that's needed is for the PL function to use that
search_path. Robbert's proposal would be more difficult I think. When
executing a PL function from an extension we'd need to use the same
changed lookup rules that we'd use during the extension script of that
extension. I think that should be possible, but it's definitely more
involved.

[1]: 
https://www.postgresql.org/message-id/flat/CAE9k0P%253DFcZ%253Dao3ZpEq29BveF%252B%253D27KBcRT2HFowJxoNCv02dHLA%2540mail.gmail.com




remove check hooks for GUCs that contribute to MaxBackends

2024-06-19 Thread Nathan Bossart
While working on an idea from another thread [0], I noticed that each of
max_connections, max_worker_process, max_autovacuum_workers, and
max_wal_senders have a check hook that verifies the sum of those GUCs does
not exceed a certain value.  Then, in InitializeMaxBackends(), we do the
same check once more.  Not only do the check hooks seem redundant, but I
think they might sometimes be inaccurate since some values might not yet be
initialized.  Furthermore, the error message is not exactly the most
descriptive:

$ pg_ctl -D . start -o "-c max_connections=262100 -c 
max_wal_senders=1"

FATAL:  invalid value for parameter "max_wal_senders": 1

The attached patch removes these hooks and enhances the error message to
look like this:

FATAL:  too many backends configured
DETAIL:  "max_connections" (262100) plus "autovacuum_max_workers" (3) 
plus "max_worker_processes" (8) plus "max_wal_senders" (1) must be less 
than 262142.

The downside of this change is that server startup progresses a little
further before it fails, but that might not be too concerning given this
_should_ be a relatively rare occurrence.

Thoughts?

[0] https://postgr.es/m/20240618213331.ef2spg3nasksisbi%40awork3.anarazel.de

-- 
nathan
>From 2ab34581879d2122f03be0e3f9d0d15edb501a7c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 19 Jun 2024 13:39:18 -0500
Subject: [PATCH v1 1/1] remove check hooks for GUCs that contribute to
 MaxBackends

---
 src/backend/utils/init/postinit.c   | 57 -
 src/backend/utils/misc/guc_tables.c |  8 ++--
 src/include/utils/guc_hooks.h   |  6 ---
 3 files changed, 11 insertions(+), 60 deletions(-)

diff --git a/src/backend/utils/init/postinit.c 
b/src/backend/utils/init/postinit.c
index 0805398e24..8a629982c4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -580,57 +580,14 @@ InitializeMaxBackends(void)
MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
max_worker_processes + max_wal_senders;
 
-   /* internal error because the values were all checked previously */
if (MaxBackends > MAX_BACKENDS)
-   elog(ERROR, "too many backends configured");
-}
-
-/*
- * GUC check_hook for max_connections
- */
-bool
-check_max_connections(int *newval, void **extra, GucSource source)
-{
-   if (*newval + autovacuum_max_workers + 1 +
-   max_worker_processes + max_wal_senders > MAX_BACKENDS)
-   return false;
-   return true;
-}
-
-/*
- * GUC check_hook for autovacuum_max_workers
- */
-bool
-check_autovacuum_max_workers(int *newval, void **extra, GucSource source)
-{
-   if (MaxConnections + *newval + 1 +
-   max_worker_processes + max_wal_senders > MAX_BACKENDS)
-   return false;
-   return true;
-}
-
-/*
- * GUC check_hook for max_worker_processes
- */
-bool
-check_max_worker_processes(int *newval, void **extra, GucSource source)
-{
-   if (MaxConnections + autovacuum_max_workers + 1 +
-   *newval + max_wal_senders > MAX_BACKENDS)
-   return false;
-   return true;
-}
-
-/*
- * GUC check_hook for max_wal_senders
- */
-bool
-check_max_wal_senders(int *newval, void **extra, GucSource source)
-{
-   if (MaxConnections + autovacuum_max_workers + 1 +
-   max_worker_processes + *newval > MAX_BACKENDS)
-   return false;
-   return true;
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("too many backends configured"),
+errdetail("\"max_connections\" (%d) plus 
\"autovacuum_max_workers\" (%d) plus \"max_worker_processes\" (%d) plus 
\"max_wal_senders\" (%d) must be less than %d.",
+  MaxConnections, 
autovacuum_max_workers,
+  max_worker_processes, 
max_wal_senders,
+  MAX_BACKENDS - 1)));
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 46c258be28..07b575894d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2207,7 +2207,7 @@ struct config_int ConfigureNamesInt[] =
},
,
100, 1, MAX_BACKENDS,
-   check_max_connections, NULL, NULL
+   NULL, NULL, NULL
},
 
{
@@ -2923,7 +2923,7 @@ struct config_int ConfigureNamesInt[] =
},
_wal_senders,
10, 0, MAX_BACKENDS,
-   check_max_wal_senders, NULL, NULL
+   NULL, NULL, NULL
},
 
{
@@ -3153,7 +3153,7 @@ struct config_int ConfigureNamesInt[] =
},
_worker_processes,
8, 0, MAX_BACKENDS,
-   

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-06-19 Thread Melanie Plageman
Thanks for taking a look at my patches, Álvaro!

On Wed, Jun 19, 2024 at 12:51 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-14, Melanie Plageman wrote:
>
> > Subject: [PATCH v21 12/20] Update variable names in bitmap scan descriptors
> >
> > The previous commit which added BitmapTableScanDesc and
> > BitmapHeapScanDesc used the existing member names from TableScanDescData
> > and HeapScanDescData for diff clarity. This commit renames the members
> > -- in many cases by removing the rs_ prefix which is not relevant or
> > needed here.
>
> *Cough*  Why?  It makes grepping for struct members useless.  I'd rather
> keep these prefixes, as they allow easier code exploration.  (Sometimes
> when I need to search for uses of some field with a name that's too
> common, I add a prefix to the name and let the compiler guide me to
> them.  But that's a waste of time ...)

If we want to make it possible to use no tools and only manually grep
for struct members, that means we can never reuse struct member names.
Across a project of our size, that seems like a very serious
restriction. Adding prefixes in struct members makes it harder to read
code -- both because it makes the names longer and because people are
more prone to abbreviate the meaningful parts of the struct member
name to make the whole name shorter.

While I understand we as a project want to make it possible to hack on
Postgres without an IDE or a set of vim plugins a mile long, I also
think we have to make some compromises for readability. Most commonly
used text editors have LSP (language server protocol) support and
should allow for meaningful identification of the usages of a struct
member even if it has the same name as a member of another struct.

That being said, I'm not unreasonable. If we have decided we can not
reuse struct member names, I will change my patch.

- Melanie




Re: Report runtimes in pg_upgrade verbose mode

2024-06-19 Thread Daniel Gustafsson
> On 19 Jun 2024, at 17:09, Nathan Bossart  wrote:

> I've been using 'ts -i' as Peter suggested

Oh nice, I had forgotten about that one, thanks for the reminder!

--
Daniel Gustafsson





Re: BitmapHeapScan streaming read user and prelim refactoring

2024-06-19 Thread Melanie Plageman
On Wed, Jun 19, 2024 at 12:38 PM Tomas Vondra
 wrote:
>
>
>
> On 6/19/24 17:55, Melanie Plageman wrote:
> > On Tue, Jun 18, 2024 at 6:02 PM Tomas Vondra
> >  wrote:
>
> > From your v22b-0017-review.patch
> >
> > diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
> > index 036ef29e7d5..9c711ce0eb0 100644
> > --- a/src/include/access/relscan.h
> > +++ b/src/include/access/relscan.h
> > @@ -52,6 +52,13 @@ typedef struct TableScanDescData
> >  } TableScanDescData;
> >  typedef struct TableScanDescData *TableScanDesc;
> >
> > +/*
> > + * XXX I don't understand why we should have this special node if we
> > + * don't have special nodes for other scan types.
> >
> > In this case, up until the final commit (using the read stream
> > interface), there are six fields required by bitmap heap scan that are
> > not needed by any other user of HeapScanDescData. There are also
> > several members of HeapScanDescData that are not needed by bitmap heap
> > scans and all of the setup in initscan() for those fields is not
> > required for bitmap heap scans.
> >
> > Also, because the BitmapHeapScanDesc needs information like the
> > ParallelBitmapHeapState and prefetch_maximum (for use in prefetching),
> > the scan_begin() callback would have to take those as parameters. I
> > thought adding so much bitmap table scan-specific information to the
> > generic table scan callbacks was a bad idea.
> >
> > Once we add the read stream API code, the number of fields required
> > for bitmap heap scan that are in the scan descriptor goes down to
> > three. So, perhaps we could justify having that many bitmap heap
> > scan-specific fields in the HeapScanDescData.
> >
> > Though, I actually think we could start moving toward having
> > specialized scan descriptors if the requirements for that scan type
> > are appreciably different. I can't think of new code that would be
> > added to the HeapScanDescData that would have to be duplicated over to
> > specialized scan descriptors.
> >
> > With the final read stream state, I can see the argument for bloating
> > the HeapScanDescData with three extra members and avoiding making new
> > scan descriptors. But, for the intermediate patches which have all of
> > the bitmap prefetch members pushed down into the HeapScanDescData, I
> > think it is really not okay. Six members only for bitmap heap scans
> > and two bitmap-specific members to begin_scan() seems bad.
> >
> > What I thought we plan to do is commit the refactoring patches
> > sometime after the branch for 18 is cut and leave the final read
> > stream patch uncommitted so we can do performance testing on it. If
> > you think it is okay to have the six member bloated HeapScanDescData
> > in master until we push the read stream code, I am okay with removing
> > the BitmapTableScanDesc and BitmapHeapScanDesc.
> >
>
> I admit I don't have a very good idea what the ideal / desired state
> look like. My comment is motivated solely by the feeling that it seems
> strange to have one struct serving most scan types, and then a special
> struct for one particular scan type ...

I see what you are saying. We could make BitmapTableScanDesc inherit
from TableScanDescData which would be similar to what we do with other
things like the executor scan nodes themselves. We would waste space
and LOC with initializing the unneeded members, but it might seem less
weird.

Whether we want the specialized scan descriptors at all is probably
the bigger question, though.

The top level BitmapTableScanDesc is motivated by wanting fewer bitmap
table scan specific members passed to scan_begin(). And the
BitmapHeapScanDesc is motivated by this plus wanting to avoid bloating
the HeapScanDescData.

If you look at at HEAD~1 (with my patches applied) and think you would
be okay with
1) the contents of the BitmapHeapScanDesc being in the HeapScanDescData and
2) the extra bitmap table scan-specific parameters in scan_begin_bm()
being passed to scan_begin()

then I will remove the specialized scan descriptors.

The final state (with the read stream) will still have three bitmap
heap scan-specific members in the HeapScanDescData.

Would it help if I do a version like this so you can see what it is like?

> > + * XXX Also, maybe this should do the naming convention with Data at
> > + * the end (mostly for consistency).
> > + */
> >  typedef struct BitmapTableScanDesc
> >  {
> >  Relationrs_rd;/* heap relation descriptor */
> >
> > I really want to move away from these Data typedefs. I find them so
> > confusing as a developer, but it's hard to justify ripping out the
> > existing ones because of code churn. If we add new scan descriptors, I
> > had really hoped to start using a different pattern.
> >
>
> Perhaps, I understand that. I'm not a huge fan of Data structs myself,
> but I'm not sure it's a great idea to do both things in the same area of
> code. That's guaranteed to be confusing for everyone ...
>
> If we want to move away 

Re: meson vs Cygwin

2024-06-19 Thread Andrew Dunstan



On 2024-02-13 Tu 7:00 AM, Marco Atzeri wrote:

On 22/11/2023 13:02, Andrew Dunstan wrote:


I've been trying to get meson working with Cygwin. On a fresh install 
(Cygwin 3.4.9, gcc 11.4.0, meson 1.0.2, ninja 1.11.1) I get a bunch 
of errors like this:


ERROR:  incompatible library 
"/home/andrew/bf/buildroot/HEAD/pgsql.build/tmp_install/home/andrew/bf/buildroot/HEAD/inst/lib/postgresql/plperl.dll": 
missing magic block


Similar things happen if I try to build with python.

I'm not getting the same on a configure/make build. Not sure what 
would be different.



cheers


andrew



Hi Andrew,
sorry for jumping on this request so late

how are you configuring the build ?




Sorry for not replying in turn :-(

I just got this error again. All I did was:

    meson setup build .

    meson compile -C build

    meson test -C build


I don't get the error if I build using

    ./configure --with-perl --with-python

    make world-bin

    make check-world


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-06-19 Thread Peter Geoghegan
On Wed, Jun 19, 2024 at 1:39 PM Alvaro Herrera  wrote:
> FWIW I don't think HEAP_XMAX_INVALID as purely a hint.
> HEAP_XMAX_COMMITTED is a hint, for sure, as is HEAP_XMIN_COMMITTED on
> its own; but as far as I recall, the INVALID flags must persist once
> set.

Seems we disagree on some pretty fundamental things in this area, then.

> Consider the HEAP_XMIN_COMMITTED+ HEAP_XMIN_INVALID combination,
> which we use to represent HEAP_XMIN_FROZEN; if that didn't persist, we'd
> have a pretty serious data corruption issue, because we don't reset the
> Xmin field when freezing the tuple.

That's definitely true, but that's a case where the setting happens during a
WAL-logged operation. It doesn't involve HEAP_XMAX_INVALID at all.

FWIW I also don't think that even HEAP_XMIN_INVALID should be
considered anything more than a hint when it appears on its own. Only
HEAP_XMIN_FROZEN (by which I mean HEAP_XMIN_COMMITTED +
HEAP_XMIN_INVALID) are non-hint xact status infomask bits.

It's slightly annoying that we have this HEAP_XMIN_FROZEN case where a
hint bit isn't just a hint, but that's just a historical detail. And I
suppose that the same thing can be said of HEAP_XMAX_IS_MULTI itself
(we have no other way of distinguishing a Multi from an Xid, so
clearly that also has to be treated as a persistent non-hint by everybody).

> So if we fail to keep the flag, the
> tuple is no longer frozen.  (My point here is that some infomask bits
> are hints, but not all them are only hints.)   So XMAX_INVALID gives
> certainty that the Xmax value must not be read.

"Certainty" seems far too strong here.

> That is to say, I think
> there are (or there were) situations in which we set the bit but don't
> bother to reset the actual Xmax field.

I'm sure that that's true, but that doesn't seem at all equivalent to
what you said about XMAX_INVALID "giving certainty" about the tuple.

> We should never try to read the
> Xmax flag if the bit is set.

But that's exactly what FreezeMultiXactId does. It doesn't pay
attention to XMAX_INVALID (only to !MultiXactIdIsValid()).

Yura is apparently arguing that FreezeMultiXactId should notice
XMAX_INVALID and then tell its caller to "FRM_INVALIDATE_XMAX". That
does seem like a valid optimization. But if we were to do that then
we'd likely do it in a way that still resulted in ordinary processing
of the multi (it would not work by immediately setting
"FRM_INVALIDATE_XMAX" in the !MultiXactIdIsValid() path). That
approach to the optimization makes the most sense, because we'd likely
want to preserve the existing FreezeMultiXactId sanity checks.

> I think the problem being investigated in this thread is that
> HEAP_XMAX_IS_MULTI is being treated as persistent, that is, it can only
> be set if the xmax is not invalid, but apparently that's not always the
> case (or we wouldn't be having this conversation).

A multixact/HEAP_XMAX_IS_MULTI xmax doesn't start out as invalid.
Treating HEAP_XMAX_IS_MULTI as persistent doesn't mean that we should
treat XMAX_INVALID as consistent. In particular, XMAX_INVALID isn't
equivalent to !MultiXactIdIsValid() (you can make a similar statement
about xmax XIDs).


--
Peter Geoghegan




Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-19 Thread Tom Lane
Jelte Fennema-Nio  writes:
> On Wed, 19 Jun 2024 at 17:28, Robert Haas  wrote:
>> I have a feeling that this might be pretty annoying to implement, and
>> if that is true, then never mind.

> Based on a quick look it's not trivial, but also not super bad.
> Basically it seems like in src/backend/catalog/namespace.c, every time
> we loop over activeSearchPath and CurrentExtensionObject is set, then
> we should skip any item that's not stored in pg_catalog, unless
> there's a DEPENDENCY_EXTENSION pg_depend entry for the item (and that
> pg_depend entry references the extension or the requires list).

We could change the lookup rules that apply during execution of
an extension script, but we already restrict search_path at that
time so I'm not sure how much further this'd move the goalposts.

The *real* problem IMO is that if you create a PL function or
(old-style) SQL function within an extension, execution of that
function is not similarly protected.

regards, tom lane




Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-19 Thread Jelte Fennema-Nio
On Wed, 19 Jun 2024 at 17:28, Robert Haas  wrote:
> But I wonder if there might also be another possible approach: could
> we, somehow, prevent object references in extension scripts from
> resolving to anything other than the system catalogs and the contents
> of that extension?

This indeed does sound like the behaviour that pretty much every
existing extension wants to have. One small addition/clarification
that I would make to your definition: fully qualified references to
other objects should still be allowed.

I do think, even if we have this, there would be other good reasons to
use "owned schemas" for extension authors. At least the following two:
1. To have a safe search_path that can be used in SET search_path on a
function (see also [1]).
2. To make it easy for extension authors to avoid conflicts with other
extensions/UDFs.

> Perhaps with a control file setting to specify a
> list of trusted extensions which we're also allowed to reference?

I think we could simply use the already existing "requires" field from
the control file. i.e. you're allowed to reference only your own
extension

> I have a feeling that this might be pretty annoying to implement, and
> if that is true, then never mind.

Based on a quick look it's not trivial, but also not super bad.
Basically it seems like in src/backend/catalog/namespace.c, every time
we loop over activeSearchPath and CurrentExtensionObject is set, then
we should skip any item that's not stored in pg_catalog, unless
there's a DEPENDENCY_EXTENSION pg_depend entry for the item (and that
pg_depend entry references the extension or the requires list).

There's quite a few loops over activeSearchPath in namespace.c, but
they all seem pretty similar. So while a bunch of code would need to
be changed, the changes could probably be well encapsulated in a
function.

[1]: 
https://www.postgresql.org/message-id/flat/00d8f046156e355ec0eb49585408bafc8012e4a5.camel%40j-davis.com#3ad7a8073d5ef50cfe44e305c38d




Docs: Order of json aggregate functions

2024-06-19 Thread Wolfgang Walther
The order of json related aggregate functions in the docs is currently 
like this:


[...]
json_agg
json_objectagg
json_object_agg
json_object_agg_strict
json_object_agg_unique
json_arrayagg
json_object_agg_unique_strict
max
min
range_agg
range_intersect_agg
json_agg_strict
[...]

json_arrayagg and json_agg_strict are out of place.

Attached patch puts them in the right spot. This is the same down to v16.

Best,

WolfgangFrom ad857a824d893a3e421c6c577c1215f71c1ebfe3 Mon Sep 17 00:00:00 2001
From: Wolfgang Walther 
Date: Wed, 19 Jun 2024 19:40:49 +0200
Subject: [PATCH v1] Fix order of json aggregate functions in docs

---
 doc/src/sgml/func.sgml | 96 +-
 1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2609269610b..c3b342d832f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21790,6 +21790,54 @@ SELECT NULLIF(value, '(none)') ...
No
   
 
+  
+   
+
+ json_agg_strict
+
+json_agg_strict ( anyelement )
+json
+   
+   
+
+ jsonb_agg_strict
+
+jsonb_agg_strict ( anyelement )
+jsonb
+   
+   
+Collects all the input values, skipping nulls, into a JSON array.
+Values are converted to JSON as per to_json
+or to_jsonb.
+   
+   No
+  
+
+  
+   
+json_arrayagg
+json_arrayagg (
+ value_expression 
+ ORDER BY sort_expression 
+ { NULL | ABSENT } ON NULL 
+ RETURNING data_type  FORMAT JSON  ENCODING UTF8   )
+   
+   
+Behaves in the same way as json_array
+but as an aggregate function so it only takes one
+value_expression parameter.
+If ABSENT ON NULL is specified, any NULL
+values are omitted.
+If ORDER BY is specified, the elements will
+appear in the array in that order rather than in the input order.
+   
+   
+SELECT json_arrayagg(v) FROM (VALUES(2),(1)) t(v)
+[2, 1]
+   
+   No
+  
+
   

  json_objectagg
@@ -21900,31 +21948,6 @@ SELECT NULLIF(value, '(none)') ...
No
   
 
-  
-   
-json_arrayagg
-json_arrayagg (
- value_expression 
- ORDER BY sort_expression 
- { NULL | ABSENT } ON NULL 
- RETURNING data_type  FORMAT JSON  ENCODING UTF8   )
-   
-   
-Behaves in the same way as json_array
-but as an aggregate function so it only takes one
-value_expression parameter.
-If ABSENT ON NULL is specified, any NULL
-values are omitted.
-If ORDER BY is specified, the elements will
-appear in the array in that order rather than in the input order.
-   
-   
-SELECT json_arrayagg(v) FROM (VALUES(2),(1)) t(v)
-[2, 1]
-   
-   No
-  
-
   

 
@@ -22033,29 +22056,6 @@ SELECT NULLIF(value, '(none)') ...
No
   
 
-  
-   
-
- json_agg_strict
-
-json_agg_strict ( anyelement )
-json
-   
-   
-
- jsonb_agg_strict
-
-jsonb_agg_strict ( anyelement )
-jsonb
-   
-   
-Collects all the input values, skipping nulls, into a JSON array.
-Values are converted to JSON as per to_json
-or to_jsonb.
-   
-   No
-  
-
   

 
-- 
2.45.1



Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-06-19 Thread Alvaro Herrera
On 2024-Jun-19, Peter Geoghegan wrote:

> On Wed, Jun 19, 2024 at 1:00 PM Yura Sokolov  wrote:
> > So it is quite different code paths, and one could not be used
> > to decline or justify other.
> 
> The point is that we shouldn't need to rely on what is formally a
> hint. It might be useful to use the hint to decide whether or not
> freezing now actually makes sense, but that isn't the same thing as
> relying on the hint (we could make the same decision for a number of
> different reasons).

FWIW I don't think HEAP_XMAX_INVALID as purely a hint.
HEAP_XMAX_COMMITTED is a hint, for sure, as is HEAP_XMIN_COMMITTED on
its own; but as far as I recall, the INVALID flags must persist once
set.  Consider the HEAP_XMIN_COMMITTED+ HEAP_XMIN_INVALID combination,
which we use to represent HEAP_XMIN_FROZEN; if that didn't persist, we'd
have a pretty serious data corruption issue, because we don't reset the
Xmin field when freezing the tuple.  So if we fail to keep the flag, the
tuple is no longer frozen.  (My point here is that some infomask bits
are hints, but not all them are only hints.)   So XMAX_INVALID gives
certainty that the Xmax value must not be read.  That is to say, I think
there are (or there were) situations in which we set the bit but don't
bother to reset the actual Xmax field.  We should never try to read the
Xmax flag if the bit is set.

I think the problem being investigated in this thread is that
HEAP_XMAX_IS_MULTI is being treated as persistent, that is, it can only
be set if the xmax is not invalid, but apparently that's not always the
case (or we wouldn't be having this conversation).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-06-19 Thread Peter Geoghegan
On Wed, Jun 19, 2024 at 1:00 PM Yura Sokolov  wrote:
> So it is quite different code paths, and one could not be used
> to decline or justify other.

The point is that we shouldn't need to rely on what is formally a
hint. It might be useful to use the hint to decide whether or not
freezing now actually makes sense, but that isn't the same thing as
relying on the hint (we could make the same decision for a number of
different reasons).

> More over, certainly test on HEAP_XMAX_INVALID could be used
> there in heap_prepare_freeze_tuple to set
> freeze_xmax = true;
> Why didn't you do it?

You might as well ask me why I didn't do any number of other things. I
actually wrote a patch that made FreezeMultiXactId() more aggressive
about this sort of thing (setting HEAP_XMAX_INVALID) that targeted
Postgres 16. That worked by noticing that every member XID was at
least before OldestXmin, even when the MXID itself was >= OldestMxact.
That didn't go anywhere, even though it was a perfectly valid
optimization.

It's true that FreezeMultiXactId() optimizations like this are
possible. So what?

> It is not a bug per se.
> But:
> - it is missed opportunity for optimization,
> - it is inconsistency in data handling.
> Inconsistency leads to bugs when people attempt to modify code.

In what sense is there an inconsistency?

I think that you must mean that we need to do the same thing for the
!MultiXactIdIsValid() case and the already-HEAP_XMAX_INVALID case. But
I don't think that that's any meaningful kind of inconsistency. It's
*also* what we do with plain XIDs. If anything, the problem is that
we're *too* consistent (ideally we *would* treat MultiXacts
differently).

> Yes, we changed completely different place mistakenly relying on
> consistent reaction on this "hint", and that leads to bug in our
> patch.

Oooops!

> > HEAP_XMAX_INVALID is just a hint.
> >
>
> WTF is "just a hint"?
> I thought, hint is "yep, you can ignore it. But we already did some job
> and stored its result as this hint. And if you don't ignore this hint,
> then you can skip doing the job we did already".
>
> So every time we ignore hint, we miss opportunity for optimization.
> Why the hell we shouldn't optimize when we safely can?

This is the first email that anybody has used the word "optimization".
We've been discussing this as if it was a bug. You introduced the
topic of optimization 3 seconds ago. Remember?

> If we couldn't rely on hint, then hint is completely meaningless.

We don't actually trust the hints in any way. We always run checks
inside heap_pre_freeze_checks(), rather than assuming that the hints
are accurate.

-- 
Peter Geoghegan




Re: Don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid.

2024-06-19 Thread Alvaro Herrera
On 2024-Jun-14, Anton A. Melnikov wrote:

> Hello!
> 
> The src/backend/access/heap/README.tuplock says about HEAP_XMAX_INVALID bit
> that "Any tuple with this bit set does not have a valid value stored in XMAX."
> 
> Found that FreezeMultiXactId() tries to process such an invalid multi xmax
> and may looks for an update xid in the pg_multixact for it.
> 
> Maybe not do this work in FreezeMultiXactId() and exit immediately if the
> bit HEAP_XMAX_INVALID was already set?
> 
> For instance, like that:
> 
> master
> @@ -6215,6 +6215,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
> /* We should only be called in Multis */
> Assert(t_infomask & HEAP_XMAX_IS_MULTI);
> +   /* Xmax is already marked as invalid */
> +   if (MultiXactIdIsValid(multi) &&
> +   (t_infomask & HEAP_XMAX_INVALID))

Hmm, but why are we calling FreezeMultiXactId at all if the
HEAP_XMAX_INVALID bit is set?  We shouldn't do that.  I think the fix
should appear in heap_prepare_freeze_tuple() to skip work completely if
HEAP_XMAX_INVALID is set.  Then in FreezeMultiXactId you could simply
Assert() that the given tuple does not have HEAP_XMAX_INVALID set.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html




Re: generic plans and "initial" pruning

2024-06-19 Thread Alvaro Herrera
I had occasion to run the same benchmark you described in the initial
email in this thread.  To do so I applied patch series v49 on top of
07cb29737a4e, which is just one that happened to have the same date as
v49.

I then used a script like this (against a server having
plan_cache_mode=force_generic_mode)

for numparts in 0 1 2 4 8 16 32 48 64 80 81 96 127 128 160 200 256 257 288 300 
384 512 1024 1536 2048;  do
pgbench testdb -i --partitions=$numparts 2>/dev/null
echo -ne "$numparts\t"
pgbench -n testdb -S -T30 -Mprepared | grep "^tps" | sed -e 's/^tps = 
\([0-9.]*\) .*/\1/'
done

and did the same with the commit mentioned above (that is, unpatched).
I got this table as result

 partitions │   patched│  07cb29737a  
┼──┼──
  0 │ 65632.090431 │ 68967.712741
  1 │ 68096.641831 │ 65356.587223
  2 │ 59456.507575 │ 60884.679464
  4 │62097.426 │ 59698.747104
  8 │ 58044.311175 │ 57817.104562
 16 │ 59741.926563 │ 52549.916262
 32 │ 59261.693449 │ 44815.317215
 48 │ 59047.125629 │ 38362.123652
 64 │ 59748.738797 │ 34051.158525
 80 │ 59276.839183 │ 32026.135076
 81 │ 62318.572932 │ 30418.122933
 96 │ 59678.857163 │ 28478.113651
127 │ 58761.960028 │ 24272.303742
128 │ 59934.268306 │ 24275.214593
160 │ 56688.790899 │ 21119.043564
200 │ 56323.188599 │ 18111.212849
256 │  55915.22466 │ 14753.953709
257 │ 57810.530461 │ 15093.497575
288 │ 56874.780092 │ 13873.332162
300 │ 57222.056549 │ 13463.768946
384 │  54073.77295 │ 11183.558339
512 │ 37503.766847 │   8114.32532
   1024 │ 42746.866448 │   4468.41359
   1536 │  39500.58411 │  3049.984599
   2048 │ 36988.519486 │  2269.362006

where already at 16 partitions we can see that things are going downhill
with the unpatched code.  (However, what happens when the table is not
partitioned looks a bit funny.)

I hope we can get this new executor code in 18.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)




Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-06-19 Thread Yura Sokolov

18.06.2024 18:47, Peter Geoghegan пишет:

On Tue, Jun 18, 2024 at 10:29 AM Maxim Orlov  wrote:

Maybe, I'm too bold, but looks like a kinda bug to me.  At least, I don't 
understand why we do not check the HEAP_XMAX_INVALID flag.
My guess is nobody noticed, that MultiXactIdIsValid call does not check the mentioned 
flag in the "first" condition, but it's all my speculation.


A related code path was changed in commit 02d647bbf0. That change made
the similar xmax handling that covers XIDs (not MXIDs) *stop* doing
what you're now proposing to do in the Multi path.


I don't agree commit 02d647bbf0 is similar to suggested change.
Commit 02d647bbf0 fixed decision to set
freeze_xmax = false;
xmax_already_frozen = true;

Suggested change is for decision to set
*flags |= FRM_INVALIDATE_XMAX;
pagefrz->freeze_required = true;
Which leads to
freeze_xmax = true;

So it is quite different code paths, and one could not be used
to decline or justify other.

More over, certainly test on HEAP_XMAX_INVALID could be used
there in heap_prepare_freeze_tuple to set
freeze_xmax = true;
Why didn't you do it?



Why do you think this is a bug?


It is not a bug per se.
But:
- it is missed opportunity for optimization,
- it is inconsistency in data handling.
Inconsistency leads to bugs when people attempt to modify code.

Yes, we changed completely different place mistakenly relying on 
consistent reaction on this "hint", and that leads to bug in our

patch.


Does anyone know if there are reasons to deliberately ignore the HEAP_XMAX 
INVALID flag? Or this is just an unfortunate oversight.


HEAP_XMAX_INVALID is just a hint.



WTF is "just a hint"?
I thought, hint is "yep, you can ignore it. But we already did some job 
and stored its result as this hint. And if you don't ignore this hint, 
then you can skip doing the job we did already".


So every time we ignore hint, we miss opportunity for optimization.
Why the hell we shouldn't optimize when we safely can?

If we couldn't rely on hint, then hint is completely meaningless.



have a nice day

Yura Sokolov




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-06-19 Thread Alvaro Herrera
On 2024-Jun-14, Melanie Plageman wrote:

> Subject: [PATCH v21 12/20] Update variable names in bitmap scan descriptors
>
> The previous commit which added BitmapTableScanDesc and
> BitmapHeapScanDesc used the existing member names from TableScanDescData
> and HeapScanDescData for diff clarity. This commit renames the members
> -- in many cases by removing the rs_ prefix which is not relevant or
> needed here.

*Cough*  Why?  It makes grepping for struct members useless.  I'd rather
keep these prefixes, as they allow easier code exploration.  (Sometimes
when I need to search for uses of some field with a name that's too
common, I add a prefix to the name and let the compiler guide me to
them.  But that's a waste of time ...)

Thanks,

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-06-19 Thread Tomas Vondra



On 6/19/24 17:55, Melanie Plageman wrote:
> On Tue, Jun 18, 2024 at 6:02 PM Tomas Vondra
>  wrote:
>>
>> I went through v22 to remind myself of what the patches do and do some
>> basic review - I have some simple questions / comments for now, nothing
>> major. I've kept the comments in separate 'review' patches, it does not
>> seem worth copying here.
> 
> Thanks so much for the review!
> 
> I've implemented your feedback in attached v23 except for what I
> mention below. I have not gone through each patch in the new set very
> carefully after making the changes because I think we should resolve
> the question of adding the new table scan descriptor before I do that.
> A change there will require a big rebase. Then I can go through each
> patch very carefully.
> 
> From your v22b-0005-review.patch:
> 
>  src/backend/executor/nodeBitmapHeapscan.c | 14 ++
>  src/include/access/tableam.h  |  2 ++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/src/backend/executor/nodeBitmapHeapscan.c
> b/src/backend/executor/nodeBitmapHeapscan.c
> index e8b4a754434..6d7ef9ced19 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -270,6 +270,20 @@ new_page:
> 
>  BitmapAdjustPrefetchIterator(node);
> 
> +/*
> + * XXX I'm a bit unsure if this needs to be handled using
> goto. Wouldn't
> + * it be simpler / easier to understand to have two nested loops?
> + *
> + * while (true)
> + *if (!table_scan_bitmap_next_block(...)) { break; }
> + *while (table_scan_bitmap_next_tuple(...)) {
> + *... process tuples ...
> + *}
> + *
> + * But I haven't tried implementing this.
> + */
>  if (!table_scan_bitmap_next_block(scan, >blockno, 
> >recheck,
>>lossy_pages,
> >exact_pages))
>  break;
> 
> We need to call table_scan_bimtap_next_block() the first time we call
> BitmapHeapNext() on each scan but all subsequent invocations of
> BitmapHeapNext() must call table_scan_bitmap_next_tuple() first each
> -- because we return from BitmapHeapNext() to yield a tuple even when
> there are more tuples on the page. I tried refactoring this a few
> different ways and personally found the goto most clear.
> 

OK, I haven't tried refactoring this myself, so you're probably right.

> From your v22b-0010-review.patch:
> 
> @@ -557,6 +559,8 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
>  table_rescan(node->ss.ss_currentScanDesc, NULL);
> 
>  /* release bitmaps and buffers if any */
> +/* XXX seems it should not be right after the comment, also shouldn't
> + * we still reset the prefetch_iterator field to NULL? */
>  tbm_end_iterate(>prefetch_iterator);
>  if (node->tbm)
>  tbm_free(node->tbm);
> 
> prefetch_iterator is a TBMIterator which is stored in the struct (as
> opposed to having a pointer to it stored in the struct).
> tbm_end_iterate() sets the actual private and shared iterator pointers
> to NULL.
> 

Ah, right.

> From your v22b-0017-review.patch
> 
> diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
> index 036ef29e7d5..9c711ce0eb0 100644
> --- a/src/include/access/relscan.h
> +++ b/src/include/access/relscan.h
> @@ -52,6 +52,13 @@ typedef struct TableScanDescData
>  } TableScanDescData;
>  typedef struct TableScanDescData *TableScanDesc;
> 
> +/*
> + * XXX I don't understand why we should have this special node if we
> + * don't have special nodes for other scan types.
> 
> In this case, up until the final commit (using the read stream
> interface), there are six fields required by bitmap heap scan that are
> not needed by any other user of HeapScanDescData. There are also
> several members of HeapScanDescData that are not needed by bitmap heap
> scans and all of the setup in initscan() for those fields is not
> required for bitmap heap scans.
> 
> Also, because the BitmapHeapScanDesc needs information like the
> ParallelBitmapHeapState and prefetch_maximum (for use in prefetching),
> the scan_begin() callback would have to take those as parameters. I
> thought adding so much bitmap table scan-specific information to the
> generic table scan callbacks was a bad idea.
> 
> Once we add the read stream API code, the number of fields required
> for bitmap heap scan that are in the scan descriptor goes down to
> three. So, perhaps we could justify having that many bitmap heap
> scan-specific fields in the HeapScanDescData.
> 
> Though, I actually think we could start moving toward having
> specialized scan descriptors if the requirements for that scan type
> are appreciably different. I can't think of new code that would be
> added to the HeapScanDescData that would have to be duplicated over to
> specialized scan descriptors.
> 
> With the final read stream state, I can see the argument 

Re: What is a typical precision of gettimeofday()?

2024-06-19 Thread Tom Lane
Peter Eisentraut  writes:
> AFAICT, pg_test_timing doesn't use gettimeofday(), so this doesn't 
> really address the original question.

It's not exactly hard to make it do so (see attached).

I tried this on several different machines, and my conclusion is that
gettimeofday() reports full microsecond precision on any platform
anybody is likely to be running PG on today.  Even my one surviving
pet dinosaur, NetBSD 10 on PowerPC Mac (mamba), shows results like
this:

$ ./pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 901.41 ns
Histogram of timing durations:
  < us   % of total  count
 1 10.46074 348148
 2 89.514952979181
 4  0.00574191
 8  0.00430143
16  0.00691230
32  0.00376125
64  0.00012  4
   128  0.00303101
   256  0.00027  9
   512  0.9  3
  1024  0.9  3

I also modified pg_test_timing to measure nanoseconds not
microseconds (second patch attached), and got this:

$ ./pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 805.50 ns
Histogram of timing durations:
  < ns   % of total  count
 1 19.84234 739008
 2  0.0  0
 4  0.0  0
 8  0.0  0
16  0.0  0
32  0.0  0
64  0.0  0
   128  0.0  0
   256  0.0  0
   512  0.0  0
  1024 80.140132984739
  2048  0.00078 29
  4096  0.00658245
  8192  0.00290108
 16384  0.00252 94
 32768  0.00250 93
 65536  0.00016  6
131072  0.00185 69
262144  0.8  3
524288  0.8  3
1048576  0.8  3

confirming that when the result changes it generally does so by 1usec.

Applying just the second patch, I find that clock_gettime on this
old hardware seems to be limited to 1us resolution, but on my more
modern machines (mac M1, x86_64) it can tick at 40ns or less.
Even a raspberry pi 4 shows

$ ./pg_test_timing 
Testing timing overhead for 3 seconds.
Per loop time including overhead: 69.12 ns
Histogram of timing durations:
  < ns   % of total  count
 1  0.0  0
 2  0.0  0
 4  0.0  0
 8  0.0  0
16  0.0  0
32  0.0  0
64 37.59583   16317040
   128 62.38568   27076131
   256  0.01674   7265
   512  0.2  8
  1024  0.0  0
  2048  0.0  0
  4096  0.00153662
  8192  0.00019 83
 16384  0.1  3
 32768  0.1  5

suggesting that the clock_gettime resolution is better than 64 ns.

So I concur with Hannu that it's time to adjust pg_test_timing to
resolve nanoseconds not microseconds.  I gather he's created a
patch that does more than mine below, so I'll wait for that.

regards, tom lane

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index a6fc1922f2..5509d23d2f 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -84,7 +84,7 @@ typedef struct instr_time
 
 /* Use clock_gettime() */
 
-#include 
+#include 
 
 /*
  * The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
@@ -111,10 +111,10 @@ static inline instr_time
 pg_clock_gettime_ns(void)
 {
 	instr_time	now;
-	struct timespec tmp;
+	struct timeval tmp;
 
-	clock_gettime(PG_INSTR_CLOCK, );
-	now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+	gettimeofday(, NULL);
+	now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_usec * 1000;
 
 	return now;
 }
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..ea2b565b14 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -133,7 +133,7 @@ test_timing(unsigned int duration)
 	total_time = duration > 0 ? duration * INT64CONST(100) : 0;
 
 	INSTR_TIME_SET_CURRENT(start_time);
-	cur = INSTR_TIME_GET_MICROSEC(start_time);
+	cur = INSTR_TIME_GET_NANOSEC(start_time);
 
 	while (time_elapsed < total_time)
 	{
@@ -142,7 +142,7 @@ test_timing(unsigned int duration)
 
 		prev = cur;
 		INSTR_TIME_SET_CURRENT(temp);
-		cur = INSTR_TIME_GET_MICROSEC(temp);
+		cur = INSTR_TIME_GET_NANOSEC(temp);
 		diff = cur - prev;
 
 		/* Did time go backwards? */
@@ -183,7 +183,7 @@ output(uint64 loop_count)
 {
 	int64		max_bit = 31,
 i;
-	char	   *header1 = _("< us");
+	char	   *header1 = _("< ns");
 	char	   *header2 = /* xgettext:no-c-format */ _("% of total");
 	char	   *header3 = _("count");
 	int			len1 = strlen(header1);


Re: Pgoutput not capturing the generated columns

2024-06-19 Thread Peter Eisentraut

On 19.06.24 13:22, Shubham Khanna wrote:

All the comments are handled.

The attached Patch contains all the suggested changes.


Please also take a look at the proposed patch for virtual generated 
columns [0] and consider how that would affect your patch.  I think your 
feature can only replicate *stored* generated columns.  So perhaps the 
documentation and terminology in your patch should reflect that.



[0]: 
https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b...@eisentraut.org






Re: SQL/JSON query functions context_item doc entry and type requirement

2024-06-19 Thread David G. Johnston
On Wed, Jun 19, 2024 at 8:29 AM jian he  wrote:

> On Mon, Jun 17, 2024 at 9:05 PM Chapman Flack  wrote:
> >
> > Hi,
> >
> > On 06/17/24 02:43, Amit Langote wrote:
> > > context_item expression can be a value of
> > > any type that can be cast to jsonb. This includes types
> > > such as char,  text, bpchar,
> > > character varying, and bytea (with
> > > ENCODING UTF8), as well as any domains over these types.
> >
> > Reading this message in conjunction with [0] makes me think that we are
> > really talking about a function that takes a first parameter of type
> jsonb,
> > and behaves exactly that way (so any cast required is applied by the
> system
> > ahead of the call). Under those conditions, this seems like an unusual
> > sentence to add in the docs, at least until we have also documented that
> > tan's argument can be of any type that can be cast to double precision.
> >
>
> I guess it would be fine to add an unusual sentence to the docs.
>
> imagine a function: array_avg(anyarray) returns anyelement.
> array_avg calculate an array's elements's avg. like
> array('{1,2,3}'::int[]) returns 2.
> but array_avg won't make sense if the input argument is a date array.
> so mentioning in the doc: array_avg can accept anyarray, but anyarray
> cannot date array.
> seems ok.
>

There is existing wording for this:

"The expression can be of any JSON type, any character string type, or
bytea in UTF8 encoding."

If you add this sentence to the paragraph the link that already exists,
which simply points the reader to this sentence, becomes redundant and
should be removed.

As for table 9.16.3 - it is unwieldy already.  Lets try and make the core
syntax shorter, not longer.  We already have precedence in the subsequent
json_table section - give each major clause item a name then below the
table define the syntax and meaning for those names.  Unlike in that
section - which probably should be modified too - context_item should have
its own description line.

David J.


Re: Better error message when --single is not the first arg to postgres executable

2024-06-19 Thread Nathan Bossart
On Wed, Jun 19, 2024 at 05:34:52PM +0200, Peter Eisentraut wrote:
> I'm not really sure all this here is worth solving.  I think requiring
> things like --single or --boot to be first seems ok, and the alternatives
> just make things more complicated.

Yeah, I'm fine with doing something more like what Greg originally
proposed at this point.

-- 
nathan




RE: AIX support

2024-06-19 Thread Srirama Kucherlapati
Thanks Hikki, for going through the changes.


> +/* Commenting for XLC
> + * "IBM XL C/C++ for AIX, V12.1" miscompiles, for 32-bit, some inline
> + * expansions of ginCompareItemPointers() "long long" arithmetic.  To take
> + * advantage of inlining, build a 64-bit PostgreSQL.
> +#if defined(__ILP32__) && defined(__IBMC__)
> +#define PG_FORCE_DISABLE_INLINE
> +#endif
> + */
I can remove these unwanted comments.

I have to analyze the changes for the rest of your comment and will get back to 
you.

Warm regards,
Sriram.



From: Heikki Linnakangas 
Date: Wednesday, 19 June 2024 at 8:45 PM
To: Srirama Kucherlapati , Laurenz Albe 
, Bruce Momjian , Heikki 
Linnakangas 
Cc: Peter Eisentraut , Alvaro Herrera 
, pgsql-hack...@postgresql.org 
, Noah Misch , Michael Paquier 
, Andres Freund , Tom Lane 
, Thomas Munro , tvk1...@gmail.com 
, postgres-ibm-...@wwpdl.vnet.ibm.com 

Subject: [EXTERNAL] Re: AIX support
On 19/06/2024 17:55, Srirama Kucherlapati wrote:
> +/* Commenting for XLC
> + * "IBM XL C/C++ for AIX, V12.1" miscompiles, for 32-bit, some inline
> + * expansions of ginCompareItemPointers() "long long" arithmetic.  To take
> + * advantage of inlining, build a 64-bit PostgreSQL.
> +#if defined(__ILP32__) && defined(__IBMC__)
> +#define PG_FORCE_DISABLE_INLINE
> +#endif
> + */

This seems irrelevant.

> + * Ordinarily, we'd code the branches here using GNU-style local symbols, 
> that
> + * is "1f" referencing "1:" and so on.  But some people run gcc on AIX with
> + * IBM's assembler as backend, and IBM's assembler doesn't do local symbols.
> + * So hand-code the branch offsets; fortunately, all PPC instructions are
> + * exactly 4 bytes each, so it's not too hard to count.

Could you use GCC assembler to avoid this?

> @@ -662,6 +666,21 @@ tas(volatile slock_t *lock)
>
>  #if !defined(HAS_TEST_AND_SET)   /* We didn't trigger above, let's try 
> here */
>
> +#if defined(_AIX)/* AIX */
> +/*
> + * AIX (POWER)
> + */
> +#define HAS_TEST_AND_SET
> +
> +#include 
> +
> +typedef int slock_t;
> +
> +#define TAS(lock)_check_lock((slock_t *) (lock), 0, 1)
> +#define S_UNLOCK(lock)   _clear_lock((slock_t *) (lock), 0)
> +#endif/* _AIX */
> +
> +
>  /* These are in sunstudio_(sparc|x86).s */
>
>  #if defined(__SUNPRO_C) && (defined(__i386) || defined(__x86_64__) || 
> defined(__sparc__) || defined(__sparc))

What CPI/compiler/OS configuration is this for, exactly? Could we rely
on GCC-provided __sync_lock_test_and_set() builtin function instead?

> +# Allow platforms with buggy compilers to force restrict to not be
> +# used by setting $FORCE_DISABLE_RESTRICT=yes in the relevant
> +# template.

Surely we don't need that anymore? Or is the compiler still buggy?

Do you still care about 32-bit binaries on AIX? If not, let's make that
the default in configure or a check for it, and remove the instructions
on building 32-bit binaries from the docs.

Please try hard to remove any changes from the diff that are not
absolutely necessary.

- Heikki


Re: DOCS: Generated table columns are skipped by logical replication

2024-06-19 Thread Peter Eisentraut

On 18.06.24 08:40, Peter Smith wrote:

For now, I have provided just a simple patch for the "Generated
Columns" section [3]. Perhaps it is enough.


Makes sense.

+ Generated columns are skipped for logical replication, and cannot be
+ specified in a CREATE PUBLICATION column-list.

Maybe remove the comma, and change "column-list" to "column list".




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tomas Vondra



On 6/19/24 17:11, Tom Lane wrote:
> Tomas Vondra  writes:
>> I tried running master under valgrind on 64-bit ARM (rpi5 running debian
>> 12.5), and I got some suspicous reports, all related to the radixtree
>> code used by tidstore.
> 
> What's the test scenario that triggers this?
> 

I haven't investigated that yet, I just ran "make check".


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: What is a typical precision of gettimeofday()?

2024-06-19 Thread Peter Eisentraut

On 18.06.24 07:47, Andrey M. Borodin wrote:




On 19 Mar 2024, at 13:28, Peter Eisentraut  wrote:

I feel that we don't actually have any information about this portability 
concern.  Does anyone know what precision we can expect from gettimeofday()?  
Can we expect the full microsecond precision usually?


At PGConf.dev Hannu Krossing draw attention to pg_test_timing module. I’ve 
tried this module(slightly modified to measure nanoseconds) on some systems, 
and everywhere I found ~100ns resolution (95% of ticks fall into 64ns and 128ns 
buckets).


AFAICT, pg_test_timing doesn't use gettimeofday(), so this doesn't 
really address the original question.






Re: Better error message when --single is not the first arg to postgres executable

2024-06-19 Thread Peter Eisentraut

On 19.06.24 16:04, Nathan Bossart wrote:

What about just inlining --version and --help e.g.

else if (strcmp(argv[i], "--version") == 0 || strcmp(argv[i], "-V") == 0)
{
  fputs(PG_BACKEND_VERSIONSTR, stdout);
  exit(0);
}

I'm fine with being more persnickety about the other options; they are much
rarer and not unixy.


That seems like it should work.  I'm not sure I agree that's the least
surprising behavior (e.g., what exactly is the user trying to tell us with
commands like "postgres --version --help --describe-config"?), but I also
don't feel too strongly about it.


There is sort of an existing convention that --help and --version behave 
like this, meaning they act immediately and exit without considering 
other arguments.


I'm not really sure all this here is worth solving.  I think requiring 
things like --single or --boot to be first seems ok, and the 
alternatives just make things more complicated.





Re: SQL/JSON query functions context_item doc entry and type requirement

2024-06-19 Thread jian he
On Mon, Jun 17, 2024 at 9:05 PM Chapman Flack  wrote:
>
> Hi,
>
> On 06/17/24 02:43, Amit Langote wrote:
> > context_item expression can be a value of
> > any type that can be cast to jsonb. This includes types
> > such as char,  text, bpchar,
> > character varying, and bytea (with
> > ENCODING UTF8), as well as any domains over these types.
>
> Reading this message in conjunction with [0] makes me think that we are
> really talking about a function that takes a first parameter of type jsonb,
> and behaves exactly that way (so any cast required is applied by the system
> ahead of the call). Under those conditions, this seems like an unusual
> sentence to add in the docs, at least until we have also documented that
> tan's argument can be of any type that can be cast to double precision.
>

I guess it would be fine to add an unusual sentence to the docs.

imagine a function: array_avg(anyarray) returns anyelement.
array_avg calculate an array's elements's avg. like
array('{1,2,3}'::int[]) returns 2.
but array_avg won't make sense if the input argument is a date array.
so mentioning in the doc: array_avg can accept anyarray, but anyarray
cannot date array.
seems ok.


> On the other hand, if the behavior of the functions were to be changed
> (perhaps using prosupport rewriting as suggested in [1]?) so that it was
> not purely describable as a function accepting exactly jsonb with a
> possible system-applied cast in front, then in that case such an added
> explanation in the docs might be very fitting.
>

prosupport won't work, I think.
because json_exists, json_value, json_query, json_table don't have
pg_proc entries.
These are more like expressions.




Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-19 Thread Robert Haas
On Sat, Jun 1, 2024 at 8:08 PM Jelte Fennema-Nio  wrote:
> Writing the sql migration scripts that are run by CREATE EXTENSION and
> ALTER EXTENSION UPDATE are security minefields for extension authors.
> One big reason for this is that search_path is set to the schema of the
> extension while running these scripts, and thus if a user with lower
> privileges can create functions or operators in that schema they can do
> all kinds of search_path confusion attacks if not every function and
> operator that is used in the script is schema qualified. While doing
> such schema qualification is possible, it relies on the author to never
> make a mistake in any of the sql files. And sadly humans have a tendency
> to make mistakes.

I agree that this is a problem. I also think that the patch might be a
reasonable solution (but I haven't reviewed it).

But I wonder if there might also be another possible approach: could
we, somehow, prevent object references in extension scripts from
resolving to anything other than the system catalogs and the contents
of that extension? Perhaps with a control file setting to specify a
list of trusted extensions which we're also allowed to reference?

I have a feeling that this might be pretty annoying to implement, and
if that is true, then never mind. But if it isn't that annoying to
implement, it would make a lot of unsafe extensions safe by default,
without the extension author needing to take any action. Which could
be pretty cool. It would also make it possible for extensions to
safely share a schema, if desired.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-19 Thread David G. Johnston
On Wed, Jun 19, 2024 at 8:19 AM Jelte Fennema-Nio  wrote:


> Because part of it would
> only be relevant once we support upgrading from PG18. So for now the
> upgrade_code I haven't actually run.
>

Does it apply against v16?  If so, branch off there, apply it, then upgrade
from the v16 branch to master.

David J.


Re: Extension security improvement: Add support for extensions with an owned schema

2024-06-19 Thread Jelte Fennema-Nio
Attached is an updated version of this patch that fixes a few issues
that CI reported (autoconf, compiler warnings and broken docs).

I also think I changed the pg_upgrade to do the correct thing, but I'm
not sure how to test this (even manually). Because part of it would
only be relevant once we support upgrading from PG18. So for now the
upgrade_code I haven't actually run.
From 37b6fa45bf877bcc15ce76d7e342199b7ca76d50 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Fri, 31 May 2024 02:04:31 -0700
Subject: [PATCH v2] Add support for extensions with an owned schema

Writing the sql migration scripts that are run by CREATE EXTENSION and
ALTER EXTENSION UPDATE are security minefields for extension authors.
One big reason for this is that search_path is set to the schema of the
extension while running these scripts, and thus if a user with lower
privileges can create functions or operators in that schema they can do
all kinds of search_path confusion attacks if not every function and
operator that is used in the script is schema qualified. While doing
such schema qualification is possible, it relies on the author to never
make a mistake in any of the sql files. And sadly humans have a tendency
to make mistakes.

This patch adds a new "owned_schema" option to the extension control
file that can be set to true to indicate that this extension wants to
own the schema in which it is installed. What that means is that the
schema should not exist before creating the extension, and will be
created during extension creation. This thus gives the extension author
an easy way to use a safe search_path, while still allowing all objects
to be grouped together in a schema. The implementation also has the
pleasant side effect that the schema will be automatically dropped when
the extension is dropped.
---
 doc/src/sgml/extend.sgml  |  13 ++
 doc/src/sgml/ref/create_extension.sgml|   3 +-
 src/backend/commands/extension.c  | 141 +-
 src/backend/utils/adt/pg_upgrade_support.c|  20 ++-
 src/bin/pg_dump/pg_dump.c |  15 +-
 src/bin/pg_dump/pg_dump.h |   1 +
 src/include/catalog/pg_extension.h|   1 +
 src/include/catalog/pg_proc.dat   |   2 +-
 src/include/commands/extension.h  |   4 +-
 src/test/modules/test_extensions/Makefile |   7 +-
 .../expected/test_extensions.out  |  50 +++
 src/test/modules/test_extensions/meson.build  |   4 +
 .../test_extensions/sql/test_extensions.sql   |  27 
 .../test_ext_owned_schema--1.0.sql|   2 +
 .../test_ext_owned_schema.control |   5 +
 ...test_ext_owned_schema_relocatable--1.0.sql |   2 +
 .../test_ext_owned_schema_relocatable.control |   4 +
 17 files changed, 248 insertions(+), 53 deletions(-)
 create mode 100644 src/test/modules/test_extensions/test_ext_owned_schema--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_owned_schema.control
 create mode 100644 src/test/modules/test_extensions/test_ext_owned_schema_relocatable--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_owned_schema_relocatable.control

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index 218940ee5ce..36dc692abef 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -809,6 +809,19 @@ RETURNS anycompatible AS ...
   
  
 
+ 
+  owned_schema (boolean)
+  
+   
+An extension is owned_schema if it requires a
+new dedicated schema for its objects. Such a requirement can make
+security concerns related to search_path injection
+much easier to reason about. The default is false,
+i.e., the extension can be installed into an existing schema.
+   
+  
+ 
+
  
   schema (string)
   
diff --git a/doc/src/sgml/ref/create_extension.sgml b/doc/src/sgml/ref/create_extension.sgml
index ca2b80d669c..6e767c7bfca 100644
--- a/doc/src/sgml/ref/create_extension.sgml
+++ b/doc/src/sgml/ref/create_extension.sgml
@@ -102,7 +102,8 @@ CREATE EXTENSION [ IF NOT EXISTS ] extension_name

 The name of the schema in which to install the extension's
 objects, given that the extension allows its contents to be
-relocated.  The named schema must already exist.
+relocated.  The named schema must already exist if the extension's
+control file does not specify owned_schema.
 If not specified, and the extension's control file does not specify a
 schema either, the current default object creation schema is used.

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 1643c8c69a0..c9586ad62a1 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -83,6 +83,8 @@ typedef struct ExtensionControlFile
 	 * MODULE_PATHNAME */
 	char	   *comment;		/* comment, if any */
 	

Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tom Lane
Tomas Vondra  writes:
> I tried running master under valgrind on 64-bit ARM (rpi5 running debian
> 12.5), and I got some suspicous reports, all related to the radixtree
> code used by tidstore.

What's the test scenario that triggers this?

regards, tom lane




Re: Report runtimes in pg_upgrade verbose mode

2024-06-19 Thread Nathan Bossart
On Wed, Jun 19, 2024 at 04:50:59PM +0200, Daniel Gustafsson wrote:
> When doing performance hacking on pg_upgrade it's often important to see
> individual runtimes to isolate changes.  I've written versions of the attached
> patch numerous times, and I wouldn't be surprised if others have done the 
> same.

Indeed: https://postgr.es/m/flat/20230727235134.GA3658499%40nathanxps13

> Is there any interest in adding something like the attached to pg_upgrade?  
> The
> patch needs some cleaning and tidying up but I wanted to to gauge interest
> before investing time.  I've added it to verbose mode mainly since it's not
> really all that informative for regular users I think.

I've been using 'ts -i' as Peter suggested [0], and that has worked
decently well.  One other thing that I've noticed is that some potentially
long-running tasks don't have corresponding reports.  For example, the
initial get_db_rel_and_slot_infos() on the old cluster doesn't report
anything, but that is often one of the most time-consuming steps.

[0] https://postgr.es/m/32d24bcf-9ac4-b10e-4aa2-da6975312eb2%40eisentraut.org

-- 
nathan




Re: Missing docs for new enable_group_by_reordering GUC

2024-06-19 Thread Pavel Borisov
Hi, Alexander!

On Wed, 19 Jun 2024 at 05:27, Alexander Korotkov 
wrote:

> On Tue, Jun 18, 2024 at 4:14 PM Pavel Borisov 
> wrote:
> >> Controls if the query planner will produce a plan which will provide
> GROUP BY keys presorted in the order of keys of a child
> node of the plan, such as an index scan. When disabled, the query planner
> will produce a plan with GROUP BY keys only reordered to
> match
> >> the ORDER BY clause, if any. When enabled, the
> planner will try to produce a more efficient plan. The default value is on.
> > A correction of myself: presorted -> sorted, reordered ->sorted
>
> Thank you for your review.  I think all of this make sense.  Please,
> check the revised patch attached.
>
To me patch v3 looks good.

Regards,
Pavel Borisov
Supabase


RE: AIX support

2024-06-19 Thread Srirama Kucherlapati
Hi Team,
Please find the attached patch, which resumes the AIX support with gcc alone. 
We have
removed changes wrt to XLC on AIX.

We are also continuing to work on the XLC and IBM-clang(openXLC) specific patch 
as well.
Once we get an approval for the above patch we can submit a subsequent patch to 
support
XLC/IBM-clang changes.

Kindly let us know your inputs/feedback.

Warm regards,
Sriram.


0001-AIX-support-revert-the-changes-from-0b16bb8776bb8-v2.patch
Description:  0001-AIX-support-revert-the-changes-from-0b16bb8776bb8-v2.patch


Report runtimes in pg_upgrade verbose mode

2024-06-19 Thread Daniel Gustafsson
When doing performance hacking on pg_upgrade it's often important to see
individual runtimes to isolate changes.  I've written versions of the attached
patch numerous times, and I wouldn't be surprised if others have done the same.
Is there any interest in adding something like the attached to pg_upgrade?  The
patch needs some cleaning and tidying up but I wanted to to gauge interest
before investing time.  I've added it to verbose mode mainly since it's not
really all that informative for regular users I think.

--
Daniel Gustafsson



0001-Report-runtime-for-steps-in-pg_upgrade-verbose-mode.patch
Description: Binary data


suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Tomas Vondra
Hi,

I tried running master under valgrind on 64-bit ARM (rpi5 running debian
12.5), and I got some suspicous reports, all related to the radixtree
code used by tidstore. I'm used to valgrind on arm sometimes reporting
harmless issues, but this seems like it might be an actual issue.

I'm attaching a snippet with a couple example reports. I can provide the
complete report, but AFAIK it's all just repetitions of these cases. If
needed, I can probably provide access to the rpi5 machine.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company==103190== Conditional jump or move depends on uninitialised value(s)
==103190==at 0x21FAC4: local_ts_node_16_search_eq (radixtree.h:1022)
==103190==by 0x21FC0F: local_ts_node_search (radixtree.h:1057)
==103190==by 0x220E23: local_ts_get_slot_recursive (radixtree.h:1667)
==103190==by 0x221043: local_ts_set (radixtree.h:1744)
==103190==by 0x2253A7: TidStoreSetBlockOffsets (tidstore.c:427)
==103190==by 0x2912EF: dead_items_add (vacuumlazy.c:2892)
==103190==by 0x28F193: lazy_scan_prune (vacuumlazy.c:1500)
==103190==by 0x28E767: lazy_scan_heap (vacuumlazy.c:975)
==103190==by 0x28D8C7: heap_vacuum_rel (vacuumlazy.c:497)
==103190==by 0x4C0D2B: table_relation_vacuum (tableam.h:1720)
==103190==by 0x4C458F: vacuum_rel (vacuum.c:2210)
==103190==by 0x4C1F8B: vacuum (vacuum.c:622)
==103190==by 0x6A2877: autovacuum_do_vac_analyze (autovacuum.c:3100)
==103190==by 0x6A144F: do_autovacuum (autovacuum.c:2417)
==103190==by 0x69FEE7: AutoVacWorkerMain (autovacuum.c:1569)
==103190==by 0x6A739B: postmaster_child_launch (launch_backend.c:265)
==103190==by 0x6AE9F3: StartChildProcess (postmaster.c:3928)
==103190==by 0x6AEB63: StartAutovacuumWorker (postmaster.c:3997)
==103190==by 0x6AE7DF: process_pm_pmsignal (postmaster.c:3809)
==103190==by 0x6AA937: ServerLoop (postmaster.c:1667)
==103190==  Uninitialised value was created by a heap allocation
==103190==at 0x9DAE30: MemoryContextAlloc (mcxt.c:1201)
==103190==by 0x21F7D3: local_ts_alloc_node (radixtree.h:839)
==103190==by 0x2208A3: local_ts_grow_node_4 (radixtree.h:1484)
==103190==by 0x220ADF: local_ts_node_insert (radixtree.h:1547)
==103190==by 0x220E4F: local_ts_get_slot_recursive (radixtree.h:1675)
==103190==by 0x221043: local_ts_set (radixtree.h:1744)
==103190==by 0x2253A7: TidStoreSetBlockOffsets (tidstore.c:427)
==103190==by 0x2912EF: dead_items_add (vacuumlazy.c:2892)
==103190==by 0x28F193: lazy_scan_prune (vacuumlazy.c:1500)
==103190==by 0x28E767: lazy_scan_heap (vacuumlazy.c:975)
==103190==by 0x28D8C7: heap_vacuum_rel (vacuumlazy.c:497)
==103190==by 0x4C0D2B: table_relation_vacuum (tableam.h:1720)
==103190==by 0x4C458F: vacuum_rel (vacuum.c:2210)
==103190==by 0x4C1F8B: vacuum (vacuum.c:622)
==103190==by 0x6A2877: autovacuum_do_vac_analyze (autovacuum.c:3100)
==103190==by 0x6A144F: do_autovacuum (autovacuum.c:2417)
==103190==by 0x69FEE7: AutoVacWorkerMain (autovacuum.c:1569)
==103190==by 0x6A739B: postmaster_child_launch (launch_backend.c:265)
==103190==by 0x6AE9F3: StartChildProcess (postmaster.c:3928)
==103190==by 0x6AEB63: StartAutovacuumWorker (postmaster.c:3997)
==103190== 
{
   
   Memcheck:Cond
   fun:local_ts_node_16_search_eq
   fun:local_ts_node_search
   fun:local_ts_get_slot_recursive
   fun:local_ts_set
   fun:TidStoreSetBlockOffsets
   fun:dead_items_add
   fun:lazy_scan_prune
   fun:lazy_scan_heap
   fun:heap_vacuum_rel
   fun:table_relation_vacuum
   fun:vacuum_rel
   fun:vacuum
   fun:autovacuum_do_vac_analyze
   fun:do_autovacuum
   fun:AutoVacWorkerMain
   fun:postmaster_child_launch
   fun:StartChildProcess
   fun:StartAutovacuumWorker
   fun:process_pm_pmsignal
   fun:ServerLoop
}
==103190== Conditional jump or move depends on uninitialised value(s)
==103190==at 0x21F18C: pg_rightmost_one_pos32 (pg_bitutils.h:114)
==103190==by 0x21FACF: local_ts_node_16_search_eq (radixtree.h:1023)
==103190==by 0x21FC0F: local_ts_node_search (radixtree.h:1057)
==103190==by 0x21FD57: local_ts_find (radixtree.h:)
==103190==by 0x225413: TidStoreIsMember (tidstore.c:443)
==103190==by 0x4C4DBF: vac_tid_reaped (vacuum.c:2545)
==103190==by 0x2A86D7: btvacuumpage (nbtree.c:1235)
==103190==by 0x2A829B: btvacuumscan (nbtree.c:1023)
==103190==by 0x2A7F1B: btbulkdelete (nbtree.c:824)
==103190==by 0x296C73: index_bulk_delete (indexam.c:758)
==103190==by 0x4C4C63: vac_bulkdel_one_index (vacuum.c:2498)
==103190==by 0x290957: lazy_vacuum_one_index (vacuumlazy.c:2443)
==103190==by 0x28FD7B: lazy_vacuum_all_indexes (vacuumlazy.c:2026)
==103190==by 0x28FB7B: lazy_vacuum (vacuumlazy.c:1944)
==103190==by 0x28E93F: lazy_scan_heap (vacuumlazy.c:1050)
==103190==by 0x28D8C7: heap_vacuum_rel (vacuumlazy.c:497)
==103190==by 0x4C0D2B: table_relation_vacuum (tableam.h:1720)
==103190== 

Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan

2024-06-19 Thread Alexander Pyhalov

Richard Guo писал(а) 2024-06-19 16:30:

On Wed, Jun 19, 2024 at 12:49 PM Tom Lane  wrote:

Richard Guo  writes:
> It seems to me that the new operator is somewhat artificial, since it is
> designed to support a mergejoin but lacks a valid commutator.  So before
> we proceed to discuss the fix, I'd like to know whether this is a valid
> issue that needs fixing.



I do not think we should add a great deal of complexity or extra
planner cycles to deal with this; but if it can be fixed at low
cost, we should.


I think we can simply verify the validity of commutators for clauses in
the form "inner op outer" when selecting mergejoin/hash clauses.  If a
clause lacks a commutator, we should consider it unusable for the given
pair of outer and inner rels.  Please see the attached patch.



This seems to be working for my test cases.
--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Isaac Morland
On Wed, 19 Jun 2024 at 07:35, Joel Jacobson  wrote:

> Hello hackers,
>
> Currently, obtaining the Access Control List (ACL) for a database object
> requires querying specific pg_catalog tables directly, where the user
> needs to know the name of the ACL column for the object.
>

I have no idea how often this would be useful, but I wonder if it could
work to have overloaded single-parameter versions for each of regprocedure
(pg_proc.proacl), regclass (pg_class.relacl), …. To call, just cast the OID
to the appropriate reg* type.

For example: To get the ACL for table 'example_table', call pg_get_acl
('example_table'::regclass)


Re: Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Joel Jacobson
On Wed, Jun 19, 2024, at 15:51, Ranier Vilela wrote:
> Regarding the patch, could it be written in the following style?

Thanks for nice improvement. New version attached.

Best,
Joel

v2-0001-Add-pg_get_acl.patch
Description: Binary data


Re: Avoid orphaned objects dependencies, take 3

2024-06-19 Thread Bertrand Drouvot
Hi,

On Mon, Jun 17, 2024 at 05:57:05PM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote:
> > So to try to sum up here: I'm not sure I agree with this design. But I
> > also feel like the design is not as clear and consistently implemented
> > as it could be. So even if I just ignored the question of whether it's
> > the right design, it feels like we're a ways from having something
> > potentially committable here, because of issues like the ones I
> > mentioned in the last paragraph.
> > 
> 
> Agree. I'll now move on with the "XXX Do we need a lock for 
> RelationRelationId?"
> comments that I put in v10 (see [1]) and study all the cases around them.

A. I went through all of them, did some testing for all, and reached the
conclusion that we must be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends of
the table...).

So we don't need to add a lock if this is a RelationRelationId object class for
the cases above.

As a consequence, I replaced the "XXX" related comments that were in v10 by
another set of comments in v11 (attached) like "No need to call 
LockRelationOid()
(through LockNotPinnedObject())". Reason is to make it clear in the code
and also to ease the review.

B. I explained in [1] (while sharing v10) that the object locking is now outside
of the dependency code except for (and I explained why):

recordDependencyOnExpr() 
recordDependencyOnSingleRelExpr()
makeConfigurationDependencies()

So I also did some testing, on the RelationRelationId case, for those and I
reached the same conclusion as the one shared above.

For A. and B. the testing has been done by adding a "ereport(WARNING.." at
those places when a RelationRelationId is involved. Then I run "make check"
and went to the failed tests (output were not the expected ones due to the
extra "WARNING"), reproduced them with gdb and checked for the lock on the
relation producing the "WARNING". All of those were linked to 1. or 2.

Note that adding an assertion on an existing lock would not work for the cases
described in 2.

So, I'm now confident that we must be in 1. or 2. but it's also possible
that I've missed some cases (though I think the way the testing has been done is
not that weak).

To sum up, I did not see any cases that did not lead to 1. or 2., so I think
it's safe to not add an extra lock for the RelationRelationId case. If, for any
reason, there is still cases that are outside 1. or 2. then they may lead to
orphaned dependencies linked to the RelationRelationId class. I think that's
fine to take that "risk" given that a. that would not be worst than currently
and b. I did not see any of those in our fleet currently (while I have seen a 
non
negligible amount outside of the RelationRelationId case).

> Once done, I think that it will easier to 1.remove ambiguity, 2.document and
> 3.do the "right" thing regarding the RelationRelationId object class.
> 

Please find attached v11, where I added more detailed comments in the commit
message and also in the code (I also removed the useless check on
AuthMemRelationId).

[1]: 
https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 50e11432960e0ad5d940d2e7d9557fc4770d8262 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v11] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except:

- the ones that 

Re: Better error message when --single is not the first arg to postgres executable

2024-06-19 Thread Nathan Bossart
On Tue, Jun 18, 2024 at 09:42:32PM -0400, Greg Sabino Mullane wrote:
> If I am reading your patch correctly, we have lost the behavior of least
> surprise in which the first "meta" argument overrides all others:
> 
> $ bin/postgres --version --boot --extrastuff
> postgres (PostgreSQL) 16.2

Right, with the patch we fail if there are multiple such options specified:

$ postgres --version --help
FATAL:  multiple server modes set
DETAIL:  Only one of --check, --boot, --describe-config, --single, 
--help/-?, --version/-V, -C may be set.

> What about just inlining --version and --help e.g.
> 
> else if (strcmp(argv[i], "--version") == 0 || strcmp(argv[i], "-V") == 0)
> {
>  fputs(PG_BACKEND_VERSIONSTR, stdout);
>  exit(0);
> }
> 
> I'm fine with being more persnickety about the other options; they are much
> rarer and not unixy.

That seems like it should work.  I'm not sure I agree that's the least
surprising behavior (e.g., what exactly is the user trying to tell us with
commands like "postgres --version --help --describe-config"?), but I also
don't feel too strongly about it.

-- 
nathan




Re: Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Ranier Vilela
Em qua., 19 de jun. de 2024 às 10:28, Ranier Vilela 
escreveu:

> Em qua., 19 de jun. de 2024 às 10:26, Joel Jacobson 
> escreveu:
>
>> Hi Ranier,
>>
>> Thanks for looking at this.
>>
>> I've double-checked the patch I sent, and it works fine.
>>
>> I think I know the cause of your problem:
>>
>> Since this is a catalog change, you need to run `make clean`, to ensure
>> the catalog is rebuilt,
>> followed by the usual `make && make install`.
>>
>> You also need to run `initdb` to create a new database cluster, with the
>> new catalog version.
>>
>> Let me know if you need more specific instructions.
>>
> Sorry, sorry but I'm on Windows -> meson.
>
> Double checked with:
> ninja clean
> ninja
> ninja install
>
Sorry for the noise, now pg_get_acl is shown in the regress test.

Regarding the patch, could it be written in the following style?

Datum
pg_get_acl(PG_FUNCTION_ARGS)
{
Oid classId = PG_GETARG_OID(0);
Oid objectId = PG_GETARG_OID(1);
Oid catalogId;
AttrNumber Anum_oid;
AttrNumber Anum_acl;

/* for "pinned" items in pg_depend, return null */
if (!OidIsValid(classId) && !OidIsValid(objectId))
PG_RETURN_NULL();

catalogId = (classId == LargeObjectRelationId) ?
LargeObjectMetadataRelationId : classId;
Anum_oid = get_object_attnum_oid(catalogId);
Anum_acl = get_object_attnum_acl(catalogId);

if (Anum_acl != InvalidAttrNumber)
{
Relation rel;
HeapTuple tup;
Datum datum;
bool isnull;

rel = table_open(catalogId, AccessShareLock);

tup = get_catalog_object_by_oid(rel, Anum_oid, objectId);
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for object %u of catalog \"%s\"",
objectId, RelationGetRelationName(rel));

datum = heap_getattr(tup, Anum_acl, RelationGetDescr(rel), );

table_close(rel, AccessShareLock);

if (!isnull)
PG_RETURN_DATUM(datum);
}

PG_RETURN_NULL();
}

best regards,
Ranier Vilela


Re: Meson far from ready on Windows

2024-06-19 Thread Dave Page
Hi

On Tue, 18 Jun 2024 at 17:08, Andres Freund  wrote:

> > None of the dependencies include cmake files for distribution on Windows,
> > so there are no additional files to tell meson to search for. The same
> > applies to pkgconfig files, which is why the EDB team had to manually
> craft
> > them.
>
> Many of them do include at least cmake files on windows if you build them
> though?
>

The only one that does is libxml2 as far as I can see. And that doesn't
seem to work even if I use --cmake-prefix-path= as you suggested:

C:\Users\dpage\git\postgresql>meson setup --auto-features=disabled --wipe
-Dlibxml=enabled --cmake-prefix-path=C:\build64\lib\cmake\libxml2-2.11.8
build-libxml
The Meson build system
Version: 1.4.0
Source dir: C:\Users\dpage\git\postgresql
Build dir: C:\Users\dpage\git\postgresql\build-libxml
Build type: native build
Project name: postgresql
Project version: 17beta1
C compiler for the host machine: cl (msvc 19.39.33523 "Microsoft (R) C/C++
Optimizing Compiler Version 19.39.33523 for x64")
C linker for the host machine: link link 14.39.33523.0
Host machine cpu family: x86_64
Host machine cpu: x86_64
Run-time dependency threads found: YES
Library ws2_32 found: YES
Library secur32 found: YES
Program perl found: YES (C:\Strawberry\perl\bin\perl.EXE)
Program python found: YES (C:\Python312\python.EXE)
Program win_flex found: YES 2.6.4 2.6.4
(C:\ProgramData\chocolatey\bin\win_flex.EXE)
Program win_bison found: YES 3.7.4 3.7.4
(C:\ProgramData\chocolatey\bin\win_bison.EXE)
Program sed found: YES (C:\ProgramData\chocolatey\bin\sed.EXE)
Program prove found: YES (C:\Strawberry\perl\bin\prove.BAT)
Program tar found: YES (C:\Windows\system32\tar.EXE)
Program gzip found: YES (C:\ProgramData\chocolatey\bin\gzip.EXE)
Program lz4 found: NO
Program openssl found: YES (C:\build64\bin\openssl.EXE)
Program zstd found: NO
Program dtrace skipped: feature dtrace disabled
Program config/missing found: YES (sh
C:\Users\dpage\git\postgresql\config/missing)
Program cp found: YES (C:\Program Files (x86)\GnuWin32\bin\cp.EXE)
Program xmllint found: YES (C:\build64\bin\xmllint.EXE)
Program xsltproc found: YES (C:\build64\bin\xsltproc.EXE)
Program wget found: YES (C:\ProgramData\chocolatey\bin\wget.EXE)
Program C:\Python312\Scripts\meson found: YES
(C:\Python312\Scripts\meson.exe)
Check usable header "bsd_auth.h" skipped: feature bsd_auth disabled
Check usable header "dns_sd.h" skipped: feature bonjour disabled
Compiler for language cpp skipped: feature llvm disabled
Found pkg-config: YES (C:\ProgramData\chocolatey\bin\pkg-config.EXE) 0.28
Found CMake: C:\Program Files\Microsoft Visual
Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.EXE
(3.28.0)
Run-time dependency libxml-2.0 found: NO (tried pkgconfig and cmake)

meson.build:796:11: ERROR: Dependency "libxml-2.0" not found, tried
pkgconfig and cmake

A full log can be found at
C:\Users\dpage\git\postgresql\build-libxml\meson-logs\meson-log.txt


>
>
> Btw, I've been working with Bilal to add a few of the dependencies to the
> CI
> images so we can test those automatically. Using vcpkg. We got that nearly
> working, but he's on vacation this week...  That does ensure both cmake and
> .pc files are generated, fwiw.
>
> Currently builds gettext, icu, libxml2, libxslt, lz4, openssl, pkgconf,
> python3, tcl, zlib, zstd.


That appears to be using Mingw/Msys, which is quite different from a VC++
build, in part because it's a full environment with its own package manager
and packages that people have put a lot of effort into making work as they
do on unix.


> I'm *NOT* sure that vcpkg is the way to go, fwiw. It does seem
> advantageous to
> use one of the toolkits thats commonly built for building dependencies on
> windows, which seems to mean vcpkg or conan.
>

I don't think requiring or expecting vcpkg or conan is reasonable at all,
for a number of reasons:

- Neither supports all the dependencies at present.
- There are real supply chain verification concerns for vendors.
- That would be a huge change from what we've required in the last 19
years, with no deprecation notices or warnings for packagers etc.


> > And that's why we really need to be able to locate headers and libraries
> > easily by passing paths to meson, as we can't rely on pkgconfig, cmake,
> or
> > things being in some standard directory on Windows.
>
> Except that that often causes hard to diagnose breakages, because that
> doesn't
> allow including the necessary compiler/linker flags [2].  It's a bad
> model, we shouldn't
> perpetuate it.  If we want to forever make windows a complicated annoying
> stepchild, that's the way to go.
>

That is a good point, though I suspect it wouldn't solve your second
example of the Kerberos libraries, as you'll get both 32 and 64 bit libs if
you follow their standard process for building on Windows so you still need
to have code to pick the right ones.


>
> FWIW, at least libzstd, libxml [3], lz4, zlib can generate cmake 

Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-06-19 Thread Bruce Momjian
On Tue, Jun 18, 2024 at 02:00:34PM -0400, Bruce Momjian wrote:
> While the license we are concerned about does not have this clause, it
> does have:
> 
>2. Redistributions in binary form must reproduce the above
>   copyright notice, this list of conditions and the following
>   disclaimer in the documentation and/or other materials provided
>   with the distribution.
> 
> I assume that must also include the name of the copyright holder.
> 
> I think that means we need to mention The Regents of the University of
> California in our copyright notice, which we do.  However several
> non-Regents of the University of California copyright holder licenses
> exist in our source tree, and accepting this AVX-512 patch would add
> another one.  Specifically, I see existing entries for:
> 
>   Aaron D. Gifford
>   Board of Trustees of the University of Illinois
>   David Burren
>   Eric P. Allman
>   Jens Schweikhardt
>   Marko Kreen
>   Sun Microsystems, Inc.
>   WIDE Project
>   
> Now, some of these are these names plus Berkeley, and some are just the
> names above.

In summary, either we are doing something wrong in how we list
copyrights in our documentation, or we don't need to make any changes for
this Intel patch.

Our license is at:

https://www.postgresql.org/about/licence/

The Intel copyright in the source code is:

 * Copyright 2017 The Chromium Authors
 * Copyright (c) 2024, Intel(r) Corporation
 *
 * Use of this source code is governed by a BSD-style license that can 
be
 * found in the Chromium source repository LICENSE file.
 * 
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/LICENSE

and the URL contents are:

// Copyright 2015 The Chromium Authors
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions 
are
// met:
//
//* Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//* Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following 
disclaimer
// in the documentation and/or other materials provided with the
// distribution.
//* Neither the name of Google LLC nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Google LLC is added to clause three, and I assume Intel is also covered
by this because it is considered "the names of its contributors", maybe?

It would be good to know exactly what, if any, changes the Intel lawyers
want us to make to our license if we accept this patch.

There are also different versions of clause three in our source tree.
The Postgres license only lists the University of California in our
equivalent of clause three, meaning that there are three-clause BSD
licenses in our source tree that reference entities that we don't
reference in the Postgres license.  Oddly, the Postgres license doesn't
even disclaim warranties for the PostgreSQL Global Development Group,
only for Berkeley.

An even bigger issue is that we are distributing 3-clause BSD licensed
software under the Postgres license, which is not the 3-clause BSD
license.  I think we were functioning under the assuption that the
licenses are compatibile, so can be combined, which is true, but I don't
think we can assume the individual licenses can be covered by our one
license, can we?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-19 Thread Matthias van de Meent
On Wed, 19 Jun 2024 at 15:17, Robert Haas  wrote:
>
> On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart  
> wrote:
> > On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> > > Actually, I think you are right that we need a manual checkpoint, except I
> > > think we need it to be after prepare_new_globals().  set_frozenxids()
> > > connects to each database (including template0) and updates a bunch of
> > > pg_class rows, and we probably want those on disk before we start copying
> > > the files to create all the user's databases.
> >
> > Here is an updated patch.
>
> OK, I have a (probably) stupid question. The comment says:
>
> + * In binary upgrade mode, we can skip this checkpoint because neither of
> + * these problems applies: we don't ever replay the WAL generated during
> + * pg_upgrade, and we don't concurrently modify template0 (not to mention
> + * that trying to take a backup during pg_upgrade is pointless).
>
> But what happens if the system crashes during pg_upgrade? Does this
> patch make things worse than they are today? And should we care?

As Nathan just said, AFAIK we don't have a way to resume progress from
a crashed pg_upgrade, which implies you currently have to start over
with a fresh dbinit.
This patch wouldn't change that for the worse, it'd only add one more
process that depends on that behaviour.

Maybe in the future that'll change if pg_upgrade and pg_dump are made
smart enough to resume their restore progress across forceful
disconnects, but for now this patch seems like a nice performance
improvement.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-19 Thread Nathan Bossart
On Wed, Jun 19, 2024 at 09:17:00AM -0400, Robert Haas wrote:
> OK, I have a (probably) stupid question. The comment says:
> 
> + * In binary upgrade mode, we can skip this checkpoint because neither of
> + * these problems applies: we don't ever replay the WAL generated during
> + * pg_upgrade, and we don't concurrently modify template0 (not to mention
> + * that trying to take a backup during pg_upgrade is pointless).
> 
> But what happens if the system crashes during pg_upgrade? Does this
> patch make things worse than they are today? And should we care?

My understanding is that you basically have to restart the upgrade from
scratch if that happens.  I suppose there could be a problem if you try to
use the half-upgraded cluster after a crash, but I imagine you have a good
chance of encountering other problems if you do that, too.  So I don't
think we care...

-- 
nathan




Re: Avoid orphaned objects dependencies, take 3

2024-06-19 Thread Ashutosh Sharma
Hi Robert,

On Wed, Jun 19, 2024 at 5:50 PM Robert Haas  wrote:
>
> On Wed, Jun 19, 2024 at 7:49 AM Ashutosh Sharma  wrote:
> > If the dependency is more, this can hit max_locks_per_transaction
> > limit very fast.
>
> Your experiment doesn't support this conclusion. Very few users would
> have 15 separate user-defined types in the same table, and even if
> they did, and dropped the table, using 23 locks is no big deal. By
> default, max_locks_per_transaction is 64, so the user would need to
> have more like 45 separate user-defined types in the same table in
> order to use more than 64 locks. So, yes, it is possible that if every
> backend in the system were simultaneously trying to drop a table and
> all of those tables had an average of at least 45 or so user-defined
> types, all different from each other, you might run out of lock table
> space.
>
> But probably nobody will ever do that in real life, and if they did,
> they could just raise max_locks_per_transaction.
>
> When posting about potential problems like this, it is a good idea to
> first do a careful thought experiment to assess how realistic the
> problem is. I would consider an issue like this serious if there were
> a realistic scenario under which a small number of backends could
> exhaust the lock table for the whole system, but I think you can see
> that this isn't the case here. Even your original scenario is more
> extreme than what most people are likely to hit in real life, and it
> only uses 23 locks.
>

I agree that based on the experiment I shared (which is somewhat
unrealistic), this doesn't seem to have any significant implications.
However, I was concerned that it could potentially increase the usage
of max_locks_per_transaction, which is why I wanted to mention it
here. Nonetheless, my experiment did not reveal any serious issues
related to this. Sorry for the noise.

--
With Regards,
Ashutosh Sharma.




Re: ON ERROR in json_query and the like

2024-06-19 Thread jian he
On Mon, Jun 17, 2024 at 9:07 PM Markus Winand  wrote:
>
>
> I think it affects the following feature IDs:
>
>   - T821, Basic SQL/JSON query operators
>  For JSON_VALUE, JSON_TABLE and JSON_EXISTS
>   - T828, JSON_QUERY
>
> Also, how hard would it be to add the functions that accept
> character strings? Is there, besides the effort, any thing else
> against it? I’m asking because I believe once released it might
> never be changed — for backward compatibility.
>

we have ExecEvalCoerceViaIOSafe, so it's doable.
I tried, but other things break. so it's not super easy i think.

because of eval_const_expressions_mutator, postgres will constantly
evaluate simple const expressions to simplify some expressions.
`SELECT JSON_QUERY('a', '$');`
postgres will try to do the cast coercion from text 'a' to jsonb. but
it will fail, but it's hard to send the cast failed information to
later code,
in ExecInterpExpr. in ExecEvalCoerceViaIOSafe, if we cast coercion
failed, then this function value is zero, isnull is set to true.

`SELECT JSON_QUERY('a', '$');`
will be equivalent to
`SELECT JSON_QUERY(NULL, '$');`

so making one of the next 2 examples to return jsonb 1 would be hard
to implement.
SELECT JSON_QUERY('a', '$' default  '1' on empty);
SELECT JSON_QUERY('a', '$' default  '1' on error);


--
If the standard says the context_item can be text string (cannot cast
to json successfully). future version we make it happen,
then it should be fine?
because it's like the previous version we are not fully compliant with
standard, now the new version is in full compliance with standard.




Re: Inconsistency between try_mergejoin_path and create_mergejoin_plan

2024-06-19 Thread Richard Guo
On Wed, Jun 19, 2024 at 12:49 PM Tom Lane  wrote:
> Richard Guo  writes:
> > It seems to me that the new operator is somewhat artificial, since it is
> > designed to support a mergejoin but lacks a valid commutator.  So before
> > we proceed to discuss the fix, I'd like to know whether this is a valid
> > issue that needs fixing.

> I do not think we should add a great deal of complexity or extra
> planner cycles to deal with this; but if it can be fixed at low
> cost, we should.

I think we can simply verify the validity of commutators for clauses in
the form "inner op outer" when selecting mergejoin/hash clauses.  If a
clause lacks a commutator, we should consider it unusable for the given
pair of outer and inner rels.  Please see the attached patch.

Thanks
Richard


v2-0001-Check-the-validity-of-commutators-for-merge-hash-clauses.patch
Description: Binary data


Re: Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Ranier Vilela
Em qua., 19 de jun. de 2024 às 10:26, Joel Jacobson 
escreveu:

> Hi Ranier,
>
> Thanks for looking at this.
>
> I've double-checked the patch I sent, and it works fine.
>
> I think I know the cause of your problem:
>
> Since this is a catalog change, you need to run `make clean`, to ensure
> the catalog is rebuilt,
> followed by the usual `make && make install`.
>
> You also need to run `initdb` to create a new database cluster, with the
> new catalog version.
>
> Let me know if you need more specific instructions.
>
Sorry, sorry but I'm on Windows -> meson.

Double checked with:
ninja clean
ninja
ninja install

best regards,
Ranier Vilela


Re: Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Joel Jacobson
Hi Ranier,

Thanks for looking at this.

I've double-checked the patch I sent, and it works fine.

I think I know the cause of your problem:

Since this is a catalog change, you need to run `make clean`, to ensure the 
catalog is rebuilt,
followed by the usual `make && make install`.

You also need to run `initdb` to create a new database cluster, with the new 
catalog version.

Let me know if you need more specific instructions.

Best,
Joel

On Wed, Jun 19, 2024, at 14:59, Ranier Vilela wrote:
> Em qua., 19 de jun. de 2024 às 08:35, Joel Jacobson  
> escreveu:
>> Hello hackers,
>> 
>> Currently, obtaining the Access Control List (ACL) for a database object
>> requires querying specific pg_catalog tables directly, where the user
>> needs to know the name of the ACL column for the object.
>> 
>> Consider:
>> 
>> ```
>> CREATE USER test_user;
>> CREATE USER test_owner;
>> CREATE SCHEMA test_schema AUTHORIZATION test_owner;
>> SET ROLE TO test_owner;
>> CREATE TABLE test_schema.test_table ();
>> GRANT SELECT ON TABLE test_schema.test_table TO test_user;
>> ```
>> 
>> To get the ACL we can do:
>> 
>> ```
>> SELECT relacl FROM pg_class WHERE oid = 
>> 'test_schema.test_table'::regclass::oid;
>> 
>>  relacl
>> -
>>  {test_owner=arwdDxtm/test_owner,test_user=r/test_owner}
>> ```
>> 
>> Attached patch adds a new SQL-callable functoin `pg_get_acl()`, so we can do:
>> 
>> ```
>> SELECT pg_get_acl('pg_class'::regclass, 
>> 'test_schema.test_table'::regclass::oid);
>>pg_get_acl
>> -
>>  {test_owner=arwdDxtm/test_owner,test_user=r/test_owner}
>> ```
>> 
>> The original idea for this function came from Alvaro Herrera,
>> in this related discussion:
>> https://postgr.es/m/261def6b-226e-4238-b7eb-ff240cb9c...@www.fastmail.com
>> 
>> On Thu, Mar 25, 2021, at 16:16, Alvaro Herrera wrote:
>> > On 2021-Mar-25, Joel Jacobson wrote:
>> >
>> >> pg_shdepend doesn't contain the aclitem info though,
>> >> so it won't work for pg_permissions if we want to expose
>> >> privilege_type, is_grantable and grantor.
>> >
>> > Ah, of course -- the only way to obtain the acl columns is by going
>> > through the catalogs individually, so it won't be possible.  I think
>> > this could be fixed with some very simple, quick function pg_get_acl()
>> > that takes a catalog OID and object OID and returns the ACL; then
>> > use aclexplode() to obtain all those details.
>> 
>> The pg_get_acl() function has been implemented by following
>> the guidance from Alvaro in the related dicussion:
>> 
>> On Fri, Mar 26, 2021, at 13:43, Alvaro Herrera wrote:
>> > AFAICS the way to do it is like AlterObjectOwner_internal obtains data
>> > -- first do get_catalog_object_by_oid (gives you the HeapTuple that
>> > represents the object), then
>> > heap_getattr( ..., get_object_attnum_acl(), ..), and there you have the
>> > ACL which you can "explode" (or maybe just return as-is).
>> >
>> > AFAICS if you do this, it's just one cache lookups per object, or
>> > one indexscan for the cases with no by-OID syscache.  It should be much
>> > cheaper than the UNION ALL query.  And you use pg_shdepend to guide
>> > this, so you only do it for the objects that you already know are
>> > interesting.
>> 
>> Many thanks Alvaro for the very helpful instructions.
>> 
>> This function would then allow users to e.g. create a view to show the 
>> privileges
>> for all database objects, like the pg_privileges system view suggested in the
>> related discussion.
>> 
>> Tests and docs are added.
> Hi,
> For some reason, the function pg_get_acl, does not exist in generated 
> fmgrtab.c
>
> So, when install postgres, the function does not work.
>
> postgres=# SELECT pg_get_acl('pg_class'::regclass, 
> 'atest2'::regclass::oid);
> ERROR:  function pg_get_acl(regclass, oid) does not exist
> LINE 1: SELECT pg_get_acl('pg_class'::regclass, 'atest2'::regclass::...
>^
> HINT:  No function matches the given name and argument types. You might 
> need to add explicit type casts.
>
> best regards,
> Ranier Vilela

-- 
Kind regards,

Joel




Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-19 Thread Robert Haas
On Fri, Jun 14, 2024 at 5:29 PM Nathan Bossart  wrote:
> On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> > Actually, I think you are right that we need a manual checkpoint, except I
> > think we need it to be after prepare_new_globals().  set_frozenxids()
> > connects to each database (including template0) and updates a bunch of
> > pg_class rows, and we probably want those on disk before we start copying
> > the files to create all the user's databases.
>
> Here is an updated patch.

OK, I have a (probably) stupid question. The comment says:

+ * In binary upgrade mode, we can skip this checkpoint because neither of
+ * these problems applies: we don't ever replay the WAL generated during
+ * pg_upgrade, and we don't concurrently modify template0 (not to mention
+ * that trying to take a backup during pg_upgrade is pointless).

But what happens if the system crashes during pg_upgrade? Does this
patch make things worse than they are today? And should we care?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-19 Thread Matthias van de Meent
On Fri, 14 Jun 2024 at 23:29, Nathan Bossart  wrote:
>
> On Wed, Jun 12, 2024 at 09:41:01AM -0500, Nathan Bossart wrote:
> > Actually, I think you are right that we need a manual checkpoint, except I
> > think we need it to be after prepare_new_globals().  set_frozenxids()
> > connects to each database (including template0) and updates a bunch of
> > pg_class rows, and we probably want those on disk before we start copying
> > the files to create all the user's databases.

Good catch, I hadn't thought of that.

> Here is an updated patch.

LGTM.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Ranier Vilela
Em qua., 19 de jun. de 2024 às 08:35, Joel Jacobson 
escreveu:

> Hello hackers,
>
> Currently, obtaining the Access Control List (ACL) for a database object
> requires querying specific pg_catalog tables directly, where the user
> needs to know the name of the ACL column for the object.
>
> Consider:
>
> ```
> CREATE USER test_user;
> CREATE USER test_owner;
> CREATE SCHEMA test_schema AUTHORIZATION test_owner;
> SET ROLE TO test_owner;
> CREATE TABLE test_schema.test_table ();
> GRANT SELECT ON TABLE test_schema.test_table TO test_user;
> ```
>
> To get the ACL we can do:
>
> ```
> SELECT relacl FROM pg_class WHERE oid =
> 'test_schema.test_table'::regclass::oid;
>
>  relacl
> -
>  {test_owner=arwdDxtm/test_owner,test_user=r/test_owner}
> ```
>
> Attached patch adds a new SQL-callable functoin `pg_get_acl()`, so we can
> do:
>
> ```
> SELECT pg_get_acl('pg_class'::regclass,
> 'test_schema.test_table'::regclass::oid);
>pg_get_acl
> -
>  {test_owner=arwdDxtm/test_owner,test_user=r/test_owner}
> ```
>
> The original idea for this function came from Alvaro Herrera,
> in this related discussion:
> https://postgr.es/m/261def6b-226e-4238-b7eb-ff240cb9c...@www.fastmail.com
>
> On Thu, Mar 25, 2021, at 16:16, Alvaro Herrera wrote:
> > On 2021-Mar-25, Joel Jacobson wrote:
> >
> >> pg_shdepend doesn't contain the aclitem info though,
> >> so it won't work for pg_permissions if we want to expose
> >> privilege_type, is_grantable and grantor.
> >
> > Ah, of course -- the only way to obtain the acl columns is by going
> > through the catalogs individually, so it won't be possible.  I think
> > this could be fixed with some very simple, quick function pg_get_acl()
> > that takes a catalog OID and object OID and returns the ACL; then
> > use aclexplode() to obtain all those details.
>
> The pg_get_acl() function has been implemented by following
> the guidance from Alvaro in the related dicussion:
>
> On Fri, Mar 26, 2021, at 13:43, Alvaro Herrera wrote:
> > AFAICS the way to do it is like AlterObjectOwner_internal obtains data
> > -- first do get_catalog_object_by_oid (gives you the HeapTuple that
> > represents the object), then
> > heap_getattr( ..., get_object_attnum_acl(), ..), and there you have the
> > ACL which you can "explode" (or maybe just return as-is).
> >
> > AFAICS if you do this, it's just one cache lookups per object, or
> > one indexscan for the cases with no by-OID syscache.  It should be much
> > cheaper than the UNION ALL query.  And you use pg_shdepend to guide
> > this, so you only do it for the objects that you already know are
> > interesting.
>
> Many thanks Alvaro for the very helpful instructions.
>
> This function would then allow users to e.g. create a view to show the
> privileges
> for all database objects, like the pg_privileges system view suggested in
> the
> related discussion.
>
> Tests and docs are added.
>
Hi,
For some reason, the function pg_get_acl, does not exist in generated
fmgrtab.c

So, when install postgres, the function does not work.

postgres=# SELECT pg_get_acl('pg_class'::regclass, 'atest2'::regclass::oid);
ERROR:  function pg_get_acl(regclass, oid) does not exist
LINE 1: SELECT pg_get_acl('pg_class'::regclass, 'atest2'::regclass::...
   ^
HINT:  No function matches the given name and argument types. You might
need to add explicit type casts.

best regards,
Ranier Vilela


Re: Avoid orphaned objects dependencies, take 3

2024-06-19 Thread Robert Haas
On Wed, Jun 19, 2024 at 7:49 AM Ashutosh Sharma  wrote:
> If the dependency is more, this can hit max_locks_per_transaction
> limit very fast.

Your experiment doesn't support this conclusion. Very few users would
have 15 separate user-defined types in the same table, and even if
they did, and dropped the table, using 23 locks is no big deal. By
default, max_locks_per_transaction is 64, so the user would need to
have more like 45 separate user-defined types in the same table in
order to use more than 64 locks. So, yes, it is possible that if every
backend in the system were simultaneously trying to drop a table and
all of those tables had an average of at least 45 or so user-defined
types, all different from each other, you might run out of lock table
space.

But probably nobody will ever do that in real life, and if they did,
they could just raise max_locks_per_transaction.

When posting about potential problems like this, it is a good idea to
first do a careful thought experiment to assess how realistic the
problem is. I would consider an issue like this serious if there were
a realistic scenario under which a small number of backends could
exhaust the lock table for the whole system, but I think you can see
that this isn't the case here. Even your original scenario is more
extreme than what most people are likely to hit in real life, and it
only uses 23 locks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: State of pg_createsubscriber

2024-06-19 Thread Robert Haas
On Tue, Jun 18, 2024 at 11:55 PM Amit Kapila  wrote:
> We can close the open item pointing to this thread.

Done, and for the record I also asked for the thread to be split, back
on May 22.

IMHO, we shouldn't add open items pointing to general complaints like
the one that started this thread. Open items need to be specific and
actionable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pgoutput not capturing the generated columns

2024-06-19 Thread Shubham Khanna
On Mon, Jun 17, 2024 at 1:57 PM Peter Smith  wrote:
>
> Hi, here are my review comments for patch v7-0001.
>
> ==
> 1. GENERAL - \dRs+
>
> Shouldn't the new SUBSCRIPTION parameter be exposed via "describe"
> (e.g. \dRs+ mysub) the same as the other boolean parameters?
>
> ==
> Commit message
>
> 2.
> When 'include_generated_columns' is false then the PUBLICATION
> col-list will ignore any generated cols even when they are present in
> a PUBLICATION col-list
>
> ~
>
> Maybe you don't need to mention "PUBLICATION col-list" twice.
>
> SUGGESTION
> When 'include_generated_columns' is false, generated columns are not
> replicated, even when present in a PUBLICATION col-list.
>
> ~~~
>
> 2.
> CREATE SUBSCRIPTION test1 connection 'dbname=postgres host=localhost port=
> 'publication pub1;
>
> ~
>
> 2a.
> (I've questioned this one in previous reviews)
>
> What exactly is the purpose of this statement in the commit message?
> Was this supposed to demonstrate the usage of the
> 'include_generated_columns' parameter?
>
> ~
>
> 2b.
> /publication/ PUBLICATION/
>
>
> ~~~
>
> 3.
> If the subscriber-side column is also a generated column then
> thisoption has no effect; the replicated data will be ignored and the
> subscriber column will be filled as normal with the subscriber-side
> computed or default data.
>
> ~
>
> Missing space: /thisoption/this option/
>
> ==
> .../expected/decoding_into_rel.out
>
> 4.
> +-- When 'include-generated-columns' is not set
> +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> +data
> +-
> + BEGIN
> + table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
> + table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
> + table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
> + COMMIT
> +(5 rows)
>
> Why is the default value here equivalent to
> 'include-generated-columns' = '1' here instead of '0'? The default for
> the CREATE SUBSCRIPTION parameter 'include_generated_columns' is
> false, and IMO it seems confusing for these 2 defaults to be
> different. Here I think it should default to '0' *regardless* of what
> the previous functionality might have done -- e.g. this is a "test
> decoder" so the parameter should behave sensibly.
>
> ==
> .../test_decoding/sql/decoding_into_rel.sql
>
> NITPICK - wrong comments.
>
> ==
> doc/src/sgml/protocol.sgml
>
> 5.
> +
> + include_generated_columns
> +  
> +   
> +Boolean option to enable generated columns. This option controls
> +whether generated columns should be included in the string
> +representation of tuples during logical decoding in PostgreSQL.
> +The default is false.
> +   
> +  
> +
> +
>
> Does the protocol version need to be bumped to support this new option
> and should that be mentioned on this page similar to how all other
> version values are mentioned?

I already did the Backward Compatibility test earlier and decided that
protocol bump is not needed.

> doc/src/sgml/ref/create_subscription.sgml
>
> NITPICK - some missing words/sentence.
> NITPICK - some missing  tags.
> NITPICK - remove duplicated sentence.
> NITPICK - add another .
>
> ==
> src/backend/commands/subscriptioncmds.c
>
> 6.
>  #define SUBOPT_ORIGIN 0x8000
> +#define SUBOPT_include_generated_columns 0x0001
>
> Please use UPPERCASE for consistency with other macros.
>
> ==
> .../libpqwalreceiver/libpqwalreceiver.c
>
> 7.
> + if (options->proto.logical.include_generated_columns &&
> + PQserverVersion(conn->streamConn) >= 17)
> + appendStringInfoString(, ", include_generated_columns 'on'");
> +
>
> IMO it makes more sense to say 'true' here instead of 'on'. It seems
> like this was just cut/paste from the above code (where 'on' was
> sensible).
>
> ==
> src/include/catalog/pg_subscription.h
>
> 8.
> @@ -98,6 +98,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId)
> BKI_SHARED_RELATION BKI_ROW
>   * slots) in the upstream database are enabled
>   * to be synchronized to the standbys. */
>
> + bool subincludegencol; /* True if generated columns must be published */
> +
>
> Not fixed as claimed. This field name ought to be plural.
>
> /subincludegencol/subincludegencols/
>
> ~~~
>
> 9.
>   char*origin; /* Only publish data originating from the
>   * specified origin */
> + bool includegencol; /* publish generated column data */
>  } Subscription;
>
> Not fixed as claimed. This field name ought to be plural.
>
> /includegencol/includegencols/
>
> ==
> src/test/subscription/t/031_column_list.pl
>
> 10.
> +$node_publisher->safe_psql('postgres',
> + "CREATE TABLE tab2 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a
> * 2) STORED)"
> +);
> +
> +$node_publisher->safe_psql('postgres',
> + "CREATE TABLE tab3 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a
> + 10) STORED)"
> +);
> +
> 

Re: Avoid orphaned objects dependencies, take 3

2024-06-19 Thread Ashutosh Sharma
Hi,

If the dependency is more, this can hit max_locks_per_transaction
limit very fast. Won't it? I just tried this little experiment with
and without patch.

1) created some UDTs (I have just chosen some random number, 15)
do $$
declare
i int := 1;
type_name text;
begin
while i <= 15 loop
type_name := format('ct_%s', i);

-- check if the type already exists
if not exists (
select 1
from pg_type
where typname = type_name
) then
execute format('create type %I as (f1 INT, f2 TEXT);', type_name);
end if;

i := i + 1;
end loop;
end $$;

2) started a transaction and tried creating a table that uses all udts
created above:
begin;
create table dep_tab(a ct_1, b ct_2, c ct_3, d ct_4, e ct_5, f ct_6, g
ct_7, h ct_8, i ct_9, j ct_10, k ct_11, l ct_12, m ct_13, n ct_14, o
ct_15);

3) checked the pg_locks entries inside the transaction both with and
without patch:

-- with patch:
select count(*) from pg_locks;
 count
---
23
(1 row)

-- without patch:
select count(*) from pg_locks;
 count
---
 7
(1 row)

With patch, it increased by 3 times. Won't that create a problem if
many concurrent sessions engage in similar activity?

--
With Regards,
Ashutosh Sharma.




Re: Pgoutput not capturing the generated columns

2024-06-19 Thread Shubham Khanna
> Few comments:
> 1) Here tab1 and tab2 are exactly the same tables, just check if the
> table tab1 itself can be used for your tests.
> @@ -24,20 +24,50 @@ $node_publisher->safe_psql('postgres',
> "CREATE TABLE tab1 (a int PRIMARY KEY, b int GENERATED ALWAYS
> AS (a * 2) STORED)"
>  );
> +$node_publisher->safe_psql('postgres',
> +   "CREATE TABLE tab2 (a int PRIMARY KEY, b int GENERATED ALWAYS
> AS (a * 2) STORED)"
> +);

On the subscription side the tables have different descriptions, so we
need to have different tables on the publisher side.

> 2) We can document  that the include_generate_columns option cannot be 
> altered.
>
> 3) You can mention that include-generated-columns is true by default
> and generated column data will be selected
> +-- When 'include-generated-columns' is not set
> +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> +data
> +-
> + BEGIN
> + table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
> + table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
> + table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
> + COMMIT
> +(5 rows)
>
> 4)  The comment seems to be wrong here, the comment says b will not be
> replicated but b is being selected:
> -- When 'include-generated-columns' = '1' the generated column 'b'
> values will not be replicated
> INSERT INTO gencoltable (a) VALUES (1), (2), (3);
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include-generated-columns', '1');
> data
> -
>  BEGIN
>  table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
>  table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
>  table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
>  COMMIT
> (5 rows)
>
> 5)  Similarly here too the comment seems to be wrong, the comment says
> b will not replicated but b is not being selected:
> INSERT INTO gencoltable (a) VALUES (4), (5), (6);
> -- When 'include-generated-columns' = '0' the generated column 'b'
> values will be replicated
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include-generated-columns', '0');
>   data
> 
>  BEGIN
>  table public.gencoltable: INSERT: a[integer]:4
>  table public.gencoltable: INSERT: a[integer]:5
>  table public.gencoltable: INSERT: a[integer]:6
>  COMMIT
> (5 rows)
>
> 6) SUBOPT_include_generated_columns change it to SUBOPT_GENERATED to
> keep the name consistent:
> --- a/src/backend/commands/subscriptioncmds.c
> +++ b/src/backend/commands/subscriptioncmds.c
> @@ -72,6 +72,7 @@
>  #define SUBOPT_FAILOVER0x2000
>  #define SUBOPT_LSN 0x4000
>  #define SUBOPT_ORIGIN  0x8000
> +#define SUBOPT_include_generated_columns   0x0001
>
> 7) The comment style seems to be inconsistent, both of them can start
> in lower case
> +-- check include-generated-columns option with generated column
> +CREATE TABLE gencoltable (a int, b int GENERATED ALWAYS AS (a * 2) STORED);
> +INSERT INTO gencoltable (a) VALUES (1), (2), (3);
> +-- When 'include-generated-columns' is not set
> +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> +data
> +-
> + BEGIN
> + table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
> + table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
> + table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
> + COMMIT
> +(5 rows)
> +
> +-- When 'include-generated-columns' = '1' the generated column 'b'
> values will not be replicated
>
> 8) This could be changed to remove the insert statements by using
> pg_logical_slot_peek_changes:
> -- When 'include-generated-columns' is not set
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> -- When 'include-generated-columns' = '1' the generated column 'b'
> values will not be replicated
> INSERT INTO gencoltable (a) VALUES (1), (2), (3);
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include-generated-columns', '1');
> INSERT INTO gencoltable (a) VALUES (4), (5), (6);
> -- When 'include-generated-columns' = '0' the generated column 'b'
> values will be replicated
> SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1',
> 

Add pg_get_acl() function get the ACL for a database object

2024-06-19 Thread Joel Jacobson
Hello hackers,

Currently, obtaining the Access Control List (ACL) for a database object
requires querying specific pg_catalog tables directly, where the user
needs to know the name of the ACL column for the object.

Consider:

```
CREATE USER test_user;
CREATE USER test_owner;
CREATE SCHEMA test_schema AUTHORIZATION test_owner;
SET ROLE TO test_owner;
CREATE TABLE test_schema.test_table ();
GRANT SELECT ON TABLE test_schema.test_table TO test_user;
```

To get the ACL we can do:

```
SELECT relacl FROM pg_class WHERE oid = 'test_schema.test_table'::regclass::oid;

 relacl
-
 {test_owner=arwdDxtm/test_owner,test_user=r/test_owner}
```

Attached patch adds a new SQL-callable functoin `pg_get_acl()`, so we can do:

```
SELECT pg_get_acl('pg_class'::regclass, 
'test_schema.test_table'::regclass::oid);
   pg_get_acl
-
 {test_owner=arwdDxtm/test_owner,test_user=r/test_owner}
```

The original idea for this function came from Alvaro Herrera,
in this related discussion:
https://postgr.es/m/261def6b-226e-4238-b7eb-ff240cb9c...@www.fastmail.com

On Thu, Mar 25, 2021, at 16:16, Alvaro Herrera wrote:
> On 2021-Mar-25, Joel Jacobson wrote:
>
>> pg_shdepend doesn't contain the aclitem info though,
>> so it won't work for pg_permissions if we want to expose
>> privilege_type, is_grantable and grantor.
>
> Ah, of course -- the only way to obtain the acl columns is by going
> through the catalogs individually, so it won't be possible.  I think
> this could be fixed with some very simple, quick function pg_get_acl()
> that takes a catalog OID and object OID and returns the ACL; then
> use aclexplode() to obtain all those details.

The pg_get_acl() function has been implemented by following
the guidance from Alvaro in the related dicussion:

On Fri, Mar 26, 2021, at 13:43, Alvaro Herrera wrote:
> AFAICS the way to do it is like AlterObjectOwner_internal obtains data
> -- first do get_catalog_object_by_oid (gives you the HeapTuple that
> represents the object), then
> heap_getattr( ..., get_object_attnum_acl(), ..), and there you have the
> ACL which you can "explode" (or maybe just return as-is).
>
> AFAICS if you do this, it's just one cache lookups per object, or
> one indexscan for the cases with no by-OID syscache.  It should be much
> cheaper than the UNION ALL query.  And you use pg_shdepend to guide
> this, so you only do it for the objects that you already know are
> interesting.

Many thanks Alvaro for the very helpful instructions.

This function would then allow users to e.g. create a view to show the 
privileges
for all database objects, like the pg_privileges system view suggested in the
related discussion.

Tests and docs are added.

Best regards,
Joel Jakobsson

0001-Add-pg_get_acl.patch
Description: Binary data


Re: Pgoutput not capturing the generated columns

2024-06-19 Thread Shubham Khanna
> Hi Shubham, thanks for providing a patch.
> I have some comments for v6-0001.
>
> 1. create_subscription.sgml
> There is repetition of the same line.
>
> + 
> +  Specifies whether the generated columns present in the tables
> +  associated with the subscription should be replicated. If the
> +  subscriber-side column is also a generated column then this option
> +  has no effect; the replicated data will be ignored and the 
> subscriber
> +  column will be filled as normal with the subscriber-side computed 
> or
> +  default data.
> +  false.
> + 
> +
> + 
> +  This parameter can only be set true if
> copy_data is
> +  set to false. If the subscriber-side
> column is also a
> +  generated column then this option has no effect; the
> replicated data will
> +  be ignored and the subscriber column will be filled as
> normal with the
> +  subscriber-side computed or default data.
> + 
>
> ==
> 2. subscriptioncmds.c
>
> 2a. The macro name should be in uppercase. We can use a short name
> like 'SUBOPT_INCLUDE_GEN_COL'. Thought?
> +#define SUBOPT_include_generated_columns 0x0001
>
> 2b.Update macro name accordingly
> + if (IsSet(supported_opts, SUBOPT_include_generated_columns))
> + opts->include_generated_columns = false;
>
> 2c. Update macro name accordingly
> + else if (IsSet(supported_opts, SUBOPT_include_generated_columns) &&
> + strcmp(defel->defname, "include_generated_columns") == 0)
> + {
> + if (IsSet(opts->specified_opts, SUBOPT_include_generated_columns))
> + errorConflictingDefElem(defel, pstate);
> +
> + opts->specified_opts |= SUBOPT_include_generated_columns;
> + opts->include_generated_columns = defGetBoolean(defel);
> + }
>
> 2d. Update macro name accordingly
> +   SUBOPT_RUN_AS_OWNER | SUBOPT_FAILOVER | SUBOPT_ORIGIN |
> +   SUBOPT_include_generated_columns);
>
>
> ==
>
> 3. decoding_into_rel.out
>
> 3a. In comment, I think it should be "When 'include-generated-columns'
> = '1' the generated column 'b' values will be replicated"
> +-- When 'include-generated-columns' = '1' the generated column 'b'
> values will not be replicated
> +INSERT INTO gencoltable (a) VALUES (1), (2), (3);
> +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include-generated-columns', '1');
> +data
> +-
> + BEGIN
> + table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
> + table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
> + table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
> + COMMIT
>
> 3b. In comment, I think it should be "When 'include-generated-columns'
> = '1' the generated column 'b' values will not be replicated"
> +-- When 'include-generated-columns' = '0' the generated column 'b'
> values will be replicated
> +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include-generated-columns', '0');
> +  data
> +
> + BEGIN
> + table public.gencoltable: INSERT: a[integer]:4
> + table public.gencoltable: INSERT: a[integer]:5
> + table public.gencoltable: INSERT: a[integer]:6
> + COMMIT
> +(5 rows)
>
> =
>
> 4. Here names for both the tests are the same. I think we should use
> different names.
>
> +$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab2");
> +is( $result, qq(4|8
> +5|10), 'generated columns replicated to non-generated column on subscriber');
> +
> +$node_publisher->safe_psql('postgres', "INSERT INTO tab3 VALUES (4), (5)");
> +
> +$node_publisher->wait_for_catchup('sub3');
> +
> +$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab3");
> +is( $result, qq(4|24
> +5|25), 'generated columns replicated to non-generated column on subscriber');

All the comments are handled.

The attached Patch contains all the suggested changes.

Thanks and Regards,
Shubham Khanna.


v8-0001-Currently-generated-column-values-are-not-replica.patch
Description: Binary data


Re: What is a typical precision of gettimeofday()?

2024-06-19 Thread Hannu Krosing
Here is the output of nanosecond-precision pg_test_timing on M1 Macbook Air

/work/postgres/src/bin/pg_test_timing % ./pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 21.54 ns
Histogram of timing durations:
   <= ns   % of total  running %  count
   0  49.165549.1655   68481688
   1   0.49.1655  0
   3   0.49.1655  0
   7   0.49.1655  0
  15   0.49.1655  0
  31   0.49.1655  0
  63  50.689099.8545   70603742
 127   0.143299.9976 199411
 255   0.001599.9991   2065
 511   0.000199.9992 98
1023   0.000199.9993140
2047   0.000299.9995284
4095   0.99.9995 50
8191   0.99.9996 65
   16383   0.000299.9997240
   32767   0.000199.9998128
   65535   0.000199. 97
  131071   0.99. 58
  262143   0.   100. 44
  524287   0.   100. 22
 1048575   0.   100.  7
 2097151   0.   100.  2
First 128 exact nanoseconds:
   0  49.165549.1655   68481688
  41  16.896466.0619   23534708
  42  33.792699.8545   47069034
  83   0.083599.9380 116362
  84   0.041999.9799  58349
 125   0.017799.9976  24700

As you see the 40 ns internal tick gets somehow blurred into
not-quite-40-ns timing step

On Linux / ARM Ampere where __builtin_readcyclecounter() works (it
compiles but crashes on Mac OS M1, I have not yet tested on Linux M1)
the tick is exactly 40 ns and I'd expect it to be the same on M1.


On Tue, Jun 18, 2024 at 5:08 PM Hannu Krosing  wrote:
>
> I plan to send patch to pg_test_timing in a day or two
>
> the underlying time precision on modern linux seems to be
>
> 2 ns for some Intel CPUs
> 10 ns for Zen4
> 40 ns for ARM (Ampere)
>
> ---
> Hannu
>
>
>
> |
>
>
>
>
> On Tue, Jun 18, 2024 at 7:48 AM Andrey M. Borodin  
> wrote:
>>
>>
>>
>> > On 19 Mar 2024, at 13:28, Peter Eisentraut  wrote:
>> >
>> > I feel that we don't actually have any information about this portability 
>> > concern.  Does anyone know what precision we can expect from 
>> > gettimeofday()?  Can we expect the full microsecond precision usually?
>>
>> At PGConf.dev Hannu Krossing draw attention to pg_test_timing module. I’ve 
>> tried this module(slightly modified to measure nanoseconds) on some systems, 
>> and everywhere I found ~100ns resolution (95% of ticks fall into 64ns and 
>> 128ns buckets).
>>
>> I’ll add cc Hannu, and also pg_test_timing module authors Ants ang Greg. 
>> Maybe they can add some context.
>>
>>
>> Best regards, Andrey Borodin.




Re: Proposal: Document ABI Compatibility

2024-06-19 Thread Peter Eisentraut

On 18.06.24 00:40, David E. Wheeler wrote:

On Jun 12, 2024, at 11:57, Peter Geoghegan  wrote:


That having been said, it would be useful if there was a community web
resource for this -- something akin to coverage.postgresql.org, but
with differential ABI breakage reports. You can see an example report
here:

https://postgr.es/m/CAH2-Wzm-W6hSn71sUkz0Rem=qdeu7tnfmc7_jg2djrlfef_...@mail.gmail.com

Theoretically anybody can do this themselves. In practice they don't.
So something as simple as providing automated reports about ABI
changes might well move the needle here.


What would be required to make such a thing? Maybe it’d make a good Summer of 
Code project.


The above thread contains a lengthy discussion about what one could do.




Re: Proposal: Document ABI Compatibility

2024-06-19 Thread Peter Eisentraut

On 18.06.24 00:37, David E. Wheeler wrote:

ABI Policy
==

The PostgreSQL core team maintains two application binary interface (ABI) 
guarantees: one for major releases and one for minor releases.

Major Releases
--

Applications that use the PostgreSQL APIs


This is probably a bit confusing.  This might as well mean client 
application code against libpq.  Better something like "server plugin 
code that uses the PostgreSQL server APIs".



must be compiled for each major release supported by the application. The 
inclusion of `PG_MODULE_MAGIC` ensures that code compiled for one major version 
will rejected by other major versions.


ok so far


Furthermore, new releases may make API changes that require code changes. Use 
the `PG_VERSION_NUM` constant to adjust code in a backwards compatible way:

``` c
#if PG_VERSION_NUM >= 16
#include "varatt.h"
#endif
```


But now we're talking about API.  That might be subject of another 
document or another section in this one, but it seems confusing to mix 
this with the ABI discussion.



PostgreSQL avoids unnecessary API changes in major releases, but usually ships 
a few necessary API changes, including deprecation, renaming, and argument 
variation.


Obviously, as a practical matter, there won't be random pointless 
changes.  But I wouldn't go as far as writing anything down about how 
these APIs are developed.



In such cases the incompatible changes will be listed in the Release Notes.


I don't think anyone is signing up to do that.



Minor Releases
--

PostgreSQL makes an effort to avoid both API and ABI breaks in minor releases. 
In general, an application compiled against any minor release will work with 
any other minor release, past or future.

When a change *is* required, PostgreSQL will choose the least invasive way 
possible, for example by squeezing a new field into padding space or appending 
it to the end of a struct. This sort of change should not impact dependent 
applications unless, they use `sizeof(the struct)` on a struct with an appended 
field, or create their own instances of such structs --- patterns best avoided.

In rare cases, however, even such non-invasive changes may be impractical or 
impossible. In such an event, the change will be documented in the Release 
Notes, and details on the issue will also be posted to [TBD; mail list? Blog 
post? News item?].


I think one major problem besides actively avoiding or managing such 
minor-version ABI breaks is some automated detection.  Otherwise, this 
just means "we try, but who knows".






Re: Conflict Detection and Resolution

2024-06-19 Thread Dilip Kumar
On Wed, Jun 19, 2024 at 2:36 PM shveta malik  wrote:
>
> On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
> >
> > On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> > > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  
> > > wrote:
> > >
> > > I tried to work out a few scenarios with this, where the apply worker
> > > will wait until its local clock hits 'remote_commit_tts - max_skew
> > > permitted'. Please have a look.
> > >
> > > Let's say, we have a GUC to configure max_clock_skew permitted.
> > > Resolver is last_update_wins in both cases.
> > > 
> > > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> > >
> > > Remote Update with commit_timestamp = 10.20AM.
> > > Local clock (which is say 5 min behind) shows = 10.15AM.
> > >
> > > When remote update arrives at local node, we see that skew is greater
> > > than max_clock_skew and thus apply worker waits till local clock hits
> > > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > > local clock hits 10.20 AM, the worker applies the remote change with
> > > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > > worker)) if some local update on same row has happened at say 10.18am,
> > > that will applied first, which will be later overwritten by above
> > > remote change of 10.20AM as remote-change's timestamp appear more
> > > latest, even though it has happened earlier than local change.

Oops lot of mistakes in the usage of change-1 and change-2, sorry about that.

> > For the sake of simplicity let's call the change that happened at
> > 10:20 AM change-1 and the change that happened at 10:15 as change-2
> > and assume we are talking about the synchronous commit only.
>
> Do you mean "the change that happened at 10:18 as change-2"

Right

> >
> > I think now from an application perspective the change-1 wouldn't have
> > caused the change-2 because we delayed applying change-2 on the local
> > node
>
> Do you mean "we delayed applying change-1 on the local node."

Right

> >which would have delayed the confirmation of the change-1 to the
> > application that means we have got the change-2 on the local node
> > without the confirmation of change-1 hence change-2 has no causal
> > dependency on the change-1.  So it's fine that we perform change-1
> > before change-2
>
> Do you mean "So it's fine that we perform change-2 before change-1"

Right

> >and the timestamp will also show the same at any other
> > node if they receive these 2 changes.
> >
> > The goal is to ensure that if we define the order where change-2
> > happens before change-1, this same order should be visible on all
> > other nodes. This will hold true because the commit timestamp of
> > change-2 is earlier than that of change-1.
>
> Considering the above corrections as base, I agree with this.

+1

> > > 2)  Case 2: max_clock_skew is set to 2min.
> > >
> > > Remote Update with commit_timestamp=10.20AM
> > > Local clock (which is say 5 min behind) = 10.15AM.
> > >
> > > Now apply worker will notice skew greater than 2min and thus will wait
> > > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > > always save the origin's commit timestamp into local commit_tts, see
> > > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > > another local update is triggered at 10.19am, it will be applied
> > > locally but it will be ignored on remote node. On the remote node ,
> > > the existing change with a timestamp of 10.20 am will win resulting in
> > > data divergence.
> >
> > Let's call the 10:20 AM change as a change-1 and the change that
> > happened at 10:19 as change-2
> >
> > IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> > commit_ts of that change is 10:20, and the same will be visible to all
> > other nodes.  So in conflict resolution still the change-1 happened
> > after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> > there could be a problem with the causal order because we applied the
> > change-1 at 10:18 AM so the application might have gotten confirmation
> > at 10:18 AM and the change-2 of the local node may be triggered as a
> > result of confirmation of the change-1 that means now change-2 has a
> > causal dependency on the change-1 but commit_ts shows change-2
> > happened before the change-1 on all the nodes.
> >
> > So, is this acceptable? I think yes because the user has configured a
> > maximum clock skew of 2 minutes, which means the detected order might
> > not always align with the causal order for transactions occurring
> > within that time frame.
>
> Agree. I had the same thoughts, and wanted to confirm my understanding.

Okay

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Conflict Detection and Resolution

2024-06-19 Thread Ashutosh Bapat
On Wed, Jun 19, 2024 at 12:03 PM Amit Kapila 
wrote:

> > I doubt that it would be that simple. The application will have to
> intervene and tell one of the employees that their reservation has failed.
> It looks natural that the first one to reserve the room should get the
> reservation, but implementing that is more complex than resolving a
> conflict in the database. In fact, mostly it will be handled outside
> database.
> >
>
> Sure, the application needs some handling but I have tried to explain
> with a simple way that comes to my mind and how it can be realized
> with db involved. This is a known conflict detection method but note
> that I am not insisting to have "earliest_timestamp_wins". Even, if we
> want this we can have a separate discussion on this and add it later.
>
>
It will be good to add a minimal set of conflict resolution strategies to
begin with, while designing the feature for extensibility. I imagine the
first version might just detect the conflict and throw error or do nothing.
That's already two simple conflict resolution strategies with minimal
efforts. We can add more complicated ones incrementally.


> >
> > The inconsistency will arise irrespective of conflict resolution method.
> On a single system effects of whichever transaction runs last will be
> visible entirely. But in the example above the node where T1, T2, and T3
> (from *different*) origins) are applied, we might end up with a situation
> where some changes from T1 are applied whereas some changes from T3 are
> applied.
> >
>
> I still think it will lead to the same result if all three T1, T2, T3
> happen on the same node in the same order as you mentioned. Say, we
> have a pre-existing table with rows r1, r2, r3, r4. Now, if we use the
> order of transactions to be applied on the same node based on t2 < t1
> < t3. First T2 will be applied, so for now, r1 is a pre-existing
> version and r2 is from T2. Next, when T1 is performed, both r1 and r2
> are from T1. Lastly, when T3 is applied, r1 will be from T3 and r2
> will be from T1. This is what you mentioned will happen after conflict
> resolution in the above example.
>
>
You are right. It won't affect the consistency. The contents of transaction
on each node might vary after application depending upon the changes that
conflict resolver makes; but the end result will be the same.

--
Best Wishes,
Ashutosh Bapat


Re: tiny step toward threading: reduce dependence on setlocale()

2024-06-19 Thread Peter Eisentraut

On 15.06.24 01:35, Jeff Davis wrote:

On Thu, 2024-06-06 at 11:37 -0700, Jeff Davis wrote:


I think this patch series is a nice cleanup, as well, making libc
more
like the other providers and not dependent on global state.


New rebased series attached with additional cleanup. Now that
pg_locale_t is never NULL, we can simplify the way the collation cache
works, eliminating ~100 lines.


Overall, this is great.  I have wished for a simplification like this
for a long time.  It is the logical continuation of relying on
locale_t stuff rather than process-global settings.


* v2-0001-Make-database-default-collation-internal-to-pg_lo.patch

+void
+pg_init_database_collation()

The function argument should be (void).

The prefix pg_ makes it sound like it's a user-facing function of some
sort.  Is there a reason for it?

Maybe add a small comment before the function.


* v2-0002-Make-database-collation-pg_locale_t-always-non-NU.patch

There is quite a bit of code duplication from
pg_newlocale_from_collation().  Can we do this better?


* v2-0003-ts_locale.c-do-not-use-NULL-to-mean-the-database-.patch

The "TODO" markers should be left, because what they refer to is that
these functions just use the default collation rather than something
passed in from the expression machinery.  This is not addressed by
this change.  (Obviously, the comments could have been clearer about
this.)


* v2-0004-Remove-support-for-null-pg_locale_t.patch

I found a few more places that should be adjusted in a similar way.

- In src/backend/regex/regc_pg_locale.c, the whole business with
  pg_regex_locale, in particular in pg_set_regex_collation().

- In src/backend/utils/adt/formatting.c, various places such as
  str_tolower().

- In src/backend/utils/adt/pg_locale.c, wchar2char() and char2wchar().
  (Also, for wchar2char() there is one caller that explicitly passes
  NULL.)

The changes at the call sites of pg_locale_deterministic() are
unfortunate, because they kind of go into the opposite direction: They
add checks for NULL locales instead of removing them.  (For a minute I
was thinking I was reading your patch backwards.)  We should think of
a way to write this clearer.

Looking for example at hashtext() after 0001..0004 applied, it is

if (!lc_collate_is_c(collid))
mylocale = pg_newlocale_from_collation(collid);

if (!mylocale || pg_locale_deterministic(mylocale))
{

But then after your 0006 patch, lc_locale_is_c() internally also calls
pg_newlocale_from_collation(), so this code just becomes redundant.
Better might be if pg_locale_deterministic() itself checked if collate
is C, and then hashtext() would just need to write:

mylocale = pg_newlocale_from_collation(collid);

if (pg_locale_deterministic(mylocale))
{

The patch sequencing might be a bit tricky here.  Maybe it's ok if
patch 0004 stays as is in this respect if 0006 were to fix it back.


* v2-0005-Avoid-setlocale-in-lc_collate_is_c-and-lc_ctype_i.patch

Nothing uses this, AFAICT, so why?

Also, this creates more duplication between
pg_init_database_collation() and pg_newlocale_from_collation(), as
mentioned at patch 0002.


* v2-0006-Simplify-collation-cache.patch

ok





Re: Conflict Detection and Resolution

2024-06-19 Thread shveta malik
On Wed, Jun 19, 2024 at 1:52 PM Dilip Kumar  wrote:
>
> On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> > On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
> >
> > I tried to work out a few scenarios with this, where the apply worker
> > will wait until its local clock hits 'remote_commit_tts - max_skew
> > permitted'. Please have a look.
> >
> > Let's say, we have a GUC to configure max_clock_skew permitted.
> > Resolver is last_update_wins in both cases.
> > 
> > 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
> >
> > Remote Update with commit_timestamp = 10.20AM.
> > Local clock (which is say 5 min behind) shows = 10.15AM.
> >
> > When remote update arrives at local node, we see that skew is greater
> > than max_clock_skew and thus apply worker waits till local clock hits
> > 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> > local clock hits 10.20 AM, the worker applies the remote change with
> > commit_tts of 10.20AM. In the meantime (during wait period of apply
> > worker)) if some local update on same row has happened at say 10.18am,
> > that will applied first, which will be later overwritten by above
> > remote change of 10.20AM as remote-change's timestamp appear more
> > latest, even though it has happened earlier than local change.
>
> For the sake of simplicity let's call the change that happened at
> 10:20 AM change-1 and the change that happened at 10:15 as change-2
> and assume we are talking about the synchronous commit only.

Do you mean "the change that happened at 10:18 as change-2"

>
> I think now from an application perspective the change-1 wouldn't have
> caused the change-2 because we delayed applying change-2 on the local
> node

Do you mean "we delayed applying change-1 on the local node."

>which would have delayed the confirmation of the change-1 to the
> application that means we have got the change-2 on the local node
> without the confirmation of change-1 hence change-2 has no causal
> dependency on the change-1.  So it's fine that we perform change-1
> before change-2

Do you mean "So it's fine that we perform change-2 before change-1"

>and the timestamp will also show the same at any other
> node if they receive these 2 changes.
>
> The goal is to ensure that if we define the order where change-2
> happens before change-1, this same order should be visible on all
> other nodes. This will hold true because the commit timestamp of
> change-2 is earlier than that of change-1.

Considering the above corrections as base, I agree with this.

> > 2)  Case 2: max_clock_skew is set to 2min.
> >
> > Remote Update with commit_timestamp=10.20AM
> > Local clock (which is say 5 min behind) = 10.15AM.
> >
> > Now apply worker will notice skew greater than 2min and thus will wait
> > till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> > 10.18 and will apply the change with commit_tts of 10.20 ( as we
> > always save the origin's commit timestamp into local commit_tts, see
> > RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> > another local update is triggered at 10.19am, it will be applied
> > locally but it will be ignored on remote node. On the remote node ,
> > the existing change with a timestamp of 10.20 am will win resulting in
> > data divergence.
>
> Let's call the 10:20 AM change as a change-1 and the change that
> happened at 10:19 as change-2
>
> IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
> commit_ts of that change is 10:20, and the same will be visible to all
> other nodes.  So in conflict resolution still the change-1 happened
> after the change-2 because change-2's commit_ts is 10:19 AM.   Now
> there could be a problem with the causal order because we applied the
> change-1 at 10:18 AM so the application might have gotten confirmation
> at 10:18 AM and the change-2 of the local node may be triggered as a
> result of confirmation of the change-1 that means now change-2 has a
> causal dependency on the change-1 but commit_ts shows change-2
> happened before the change-1 on all the nodes.
>
> So, is this acceptable? I think yes because the user has configured a
> maximum clock skew of 2 minutes, which means the detected order might
> not always align with the causal order for transactions occurring
> within that time frame.

Agree. I had the same thoughts, and wanted to confirm my understanding.

>Generally, the ideal configuration for
> max_clock_skew should be in multiple of the network round trip time.
> Assuming this configuration, we wouldn’t encounter this problem
> because for change-2 to be caused by change-1, the client would need
> to get confirmation of change-1 and then trigger change-2, which would
> take at least 2-3 network round trips.


thanks
Shveta




Re: [PATCH] Add CANONICAL option to xmlserialize

2024-06-19 Thread Jim Jones


On 09.02.24 14:19, Jim Jones wrote:
> v9 attached with rebase due to changes done to primnodes.h in 615f5f6
>
v10 attached with rebase due to changes in primnodes, parsenodes.h, and
gram.y

-- 
Jim
From fbd98149d50fe19b886b30ed49b9d553a18f30b4 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Wed, 19 Jun 2024 10:22:10 +0200
Subject: [PATCH v10] Add CANONICAL output format to xmlserialize

This patch introduces the CANONICAL option to xmlserialize, which
serializes xml documents in their canonical form - as described in
the W3C Canonical XML Version 1.1 specification. This option can
be used with the additional parameter WITH [NO] COMMENTS to keep
or remove xml comments from the canonical xml output. In case no
parameter is provided, WITH COMMENTS will be used as default. This
feature is based on the function xmlC14NDocDumpMemory from the C14N
module of libxml2.

This patch also includes regression tests and documentation.
---
 doc/src/sgml/datatype.sgml|  41 +++-
 src/backend/executor/execExprInterp.c |   2 +-
 src/backend/parser/gram.y |  21 +-
 src/backend/parser/parse_expr.c   |   2 +-
 src/backend/utils/adt/xml.c   | 272 --
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |  10 +
 src/include/parser/kwlist.h   |   1 +
 src/include/utils/xml.h   |   2 +-
 src/test/regress/expected/xml.out | 114 +++
 src/test/regress/expected/xml_1.out   | 108 ++
 src/test/regress/expected/xml_2.out   | 114 +++
 src/test/regress/sql/xml.sql  |  63 ++
 src/tools/pgindent/typedefs.list  |   1 +
 14 files changed, 642 insertions(+), 110 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 6646820d6a..7c28d34866 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4483,7 +4483,7 @@ xml 'bar'
 xml, uses the function
 xmlserialize:xmlserialize
 
-XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ [ NO ] INDENT ] )
+XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ { [ NO ] INDENT ] | CANONICAL [ WITH [NO] COMMENTS ]})
 
 type can be
 character, character varying, or
@@ -4500,6 +4500,45 @@ XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS 
 
+   
+The option CANONICAL converts a given
+XML document to its https://www.w3.org/TR/xml-c14n11/#Terminology;>canonical form
+based on the https://www.w3.org/TR/xml-c14n11/;>W3C Canonical XML 1.1 Specification.
+It is basically designed to provide applications the ability to compare xml documents or test if they
+have been changed. The optional parameters WITH COMMENTS (which is the default) or
+WITH NO COMMENTS, respectively, keep or remove XML comments from the given document.
+
+
+ 
+ Example:
+
+
+   

 When a character string value is cast to or from type
 xml without going through XMLPARSE or
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 852186312c..f14d7464ef 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4053,7 +4053,7 @@ ExecEvalXmlExpr(ExprState *state, ExprEvalStep *op)
 *op->resvalue =
 	PointerGetDatum(xmltotext_with_options(DatumGetXmlP(value),
 		   xexpr->xmloption,
-		   xexpr->indent));
+		   xexpr->format));
 *op->resnull = false;
 			}
 			break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4d582950b7..ff1caa21f2 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -619,12 +619,13 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	xml_root_version opt_xml_root_standalone
 %type 	xmlexists_argument
 %type 	document_or_content
-%type 	xml_indent_option xml_whitespace_option
+%type 	xml_whitespace_option
 %type 	xmltable_column_list xmltable_column_option_list
 %type 	xmltable_column_el
 %type 	xmltable_column_option_el
 %type 	xml_namespace_list
 %type 	xml_namespace_el
+%type  	opt_xml_serialize_format
 
 %type 	func_application func_expr_common_subexpr
 %type 	func_expr func_expr_windowless
@@ -712,7 +713,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT
 	BOOLEAN_P BOTH BREADTH BY
 
-	CACHE CALL CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
+	CACHE CALL CALLED CANONICAL CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
 	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
 	CLUSTER COALESCE COLLATE COLLATION COLUMN COLUMNS COMMENT COMMENTS COMMIT
 	COMMITTED COMPRESSION CONCURRENTLY CONDITIONAL CONFIGURATION CONFLICT
@@ -15965,14 +15966,14 @@ func_expr_common_subexpr:
 	$$ = makeXmlExpr(IS_XMLROOT, NULL, NIL,
 	 list_make3($3, $5, $6), @1);
 }
-			| XMLSERIALIZE '(' document_or_content a_expr AS SimpleTypename xml_indent_option ')'

Re: replace strtok()

2024-06-19 Thread Kyotaro Horiguchi
At Tue, 18 Jun 2024 09:18:28 +0200, Peter Eisentraut  
wrote in 
> Under the topic of getting rid of thread-unsafe functions in the
> backend [0], here is a patch series to deal with strtok().
> 
> Of course, strtok() is famously not thread-safe and can be replaced by
> strtok_r().  But it also has the wrong semantics in some cases,
> because it considers adjacent delimiters to be one delimiter.  So if
> you parse
> 
> SCRAM-SHA-256$:$:
> 
> with strtok(), then
> 
> SCRAM-SHA-256$$::$$::
> 
> parses just the same.  In many cases, this is arguably wrong and could
> hide mistakes.
> 
> So I'm suggesting to use strsep() in those places.  strsep() is
> nonstandard but widely available.
> 
> There are a few places where strtok() has the right semantics, such as
> parsing tokens separated by whitespace.  For those, I'm using
> strtok_r().

I agree with the distinction.

> A reviewer job here would be to check whether I made that distinction
> correctly in each case.

0001 and 0002 look correct to me regarding that distinction. They
applied correctly to the master HEAD and all tests passed on Linux.

> On the portability side, I'm including a port/ replacement for
> strsep() and some workaround to get strtok_r() for Windows.  I have
> included these here as separate patches for clarity.

0003 looks fine and successfully built and seems working on an MSVC
build.

About 0004, Cygwin seems to have its own strtok_r, but I haven't
checked how that fact affects the build.

> [0]:
> https://www.postgresql.org/message-id/856e5ec3-879f-42ee-8258-8bcc6ec9b...@eisentraut.org

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Conflict Detection and Resolution

2024-06-19 Thread Dilip Kumar
On Tue, Jun 18, 2024 at 3:29 PM shveta malik  wrote:
> On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar  wrote:
>
> I tried to work out a few scenarios with this, where the apply worker
> will wait until its local clock hits 'remote_commit_tts - max_skew
> permitted'. Please have a look.
>
> Let's say, we have a GUC to configure max_clock_skew permitted.
> Resolver is last_update_wins in both cases.
> 
> 1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.
>
> Remote Update with commit_timestamp = 10.20AM.
> Local clock (which is say 5 min behind) shows = 10.15AM.
>
> When remote update arrives at local node, we see that skew is greater
> than max_clock_skew and thus apply worker waits till local clock hits
> 'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
> local clock hits 10.20 AM, the worker applies the remote change with
> commit_tts of 10.20AM. In the meantime (during wait period of apply
> worker)) if some local update on same row has happened at say 10.18am,
> that will applied first, which will be later overwritten by above
> remote change of 10.20AM as remote-change's timestamp appear more
> latest, even though it has happened earlier than local change.

For the sake of simplicity let's call the change that happened at
10:20 AM change-1 and the change that happened at 10:15 as change-2
and assume we are talking about the synchronous commit only.

I think now from an application perspective the change-1 wouldn't have
caused the change-2 because we delayed applying change-2 on the local
node which would have delayed the confirmation of the change-1 to the
application that means we have got the change-2 on the local node
without the confirmation of change-1 hence change-2 has no causal
dependency on the change-1.  So it's fine that we perform change-1
before change-2 and the timestamp will also show the same at any other
node if they receive these 2 changes.

The goal is to ensure that if we define the order where change-2
happens before change-1, this same order should be visible on all
other nodes. This will hold true because the commit timestamp of
change-2 is earlier than that of change-1.

> 2)  Case 2: max_clock_skew is set to 2min.
>
> Remote Update with commit_timestamp=10.20AM
> Local clock (which is say 5 min behind) = 10.15AM.
>
> Now apply worker will notice skew greater than 2min and thus will wait
> till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
> 10.18 and will apply the change with commit_tts of 10.20 ( as we
> always save the origin's commit timestamp into local commit_tts, see
> RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
> another local update is triggered at 10.19am, it will be applied
> locally but it will be ignored on remote node. On the remote node ,
> the existing change with a timestamp of 10.20 am will win resulting in
> data divergence.

Let's call the 10:20 AM change as a change-1 and the change that
happened at 10:19 as change-2

IIUC, although we apply the change-1 at 10:18 AM the commit_ts of that
commit_ts of that change is 10:20, and the same will be visible to all
other nodes.  So in conflict resolution still the change-1 happened
after the change-2 because change-2's commit_ts is 10:19 AM.   Now
there could be a problem with the causal order because we applied the
change-1 at 10:18 AM so the application might have gotten confirmation
at 10:18 AM and the change-2 of the local node may be triggered as a
result of confirmation of the change-1 that means now change-2 has a
causal dependency on the change-1 but commit_ts shows change-2
happened before the change-1 on all the nodes.

So, is this acceptable? I think yes because the user has configured a
maximum clock skew of 2 minutes, which means the detected order might
not always align with the causal order for transactions occurring
within that time frame. Generally, the ideal configuration for
max_clock_skew should be in multiple of the network round trip time.
Assuming this configuration, we wouldn’t encounter this problem
because for change-2 to be caused by change-1, the client would need
to get confirmation of change-1 and then trigger change-2, which would
take at least 2-3 network round trips.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Speed up collation cache

2024-06-19 Thread Peter Eisentraut

On 15.06.24 01:46, Jeff Davis wrote:

   * instead of a hardwired set of special collation IDs, have a single-
element "last collation ID" to check before doing the hash lookup?


I'd imagine that method could be very effective.




Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-19 Thread Peter Smith
FYI - I applied this latest patch and re-ran the original failing test
script 10 times (e.g. 10 x 100 test iterations; it took 4+ hours).

There were zero failures observed in my environment.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: ON ERROR in json_query and the like

2024-06-19 Thread Amit Langote
On Mon, Jun 17, 2024 at 10:07 PM Markus Winand  wrote:
> > On 17.06.2024, at 08:20, Amit Langote  wrote:
> >>> 2. EMPTY [ARRAY|OBJECT] ON ERROR implies ERROR ON EMPTY
> >>>
> >>>   17beta1=# SELECT JSON_QUERY('[]', '$[*]' EMPTY ARRAY ON ERROR) a;
> >>>a
> >>>   
> >>>[]
> >>>   (1 row)
> >>>
> >>>   As NULL ON EMPTY is implied, it should give the same result as
> >>>   explicitly adding NULL ON EMPTY:
> >>>
> >>>   17beta1=# SELECT JSON_QUERY('[]', '$[*]' NULL ON EMPTY EMPTY ARRAY ON 
> >>> ERROR) a;
> >>>a
> >>>   ---
> >>>
> >>>   (1 row)
> >>>
> >>>   Interestingly, Oracle DB gives the same (wrong) results. Db2 (LUW)
> >>>   on the other hand returns NULL for both queries.
> >>>
> >>>   I don’t think that PostgreSQL should follow Oracle DB's suit here
> >>>   but again, in case this is intentional it should be made explicit
> >>>   in the docs.
> >
> > This behavior is a bug and result of an unintentional change that I
> > made at some point after getting involved with this patch set.  So I'm
> > going to fix this so that the empty results of jsonpath evaluation use
> > NULL ON EMPTY by default, ie, when the ON EMPTY clause is not present.
> > Attached a patch to do so.
> >
>
> Tested: works.

Pushed, thanks for testing.

I'll work on the documentation updates that may be needed based on
this and nearby discussion(s).

-- 
Thanks, Amit Langote




Re: Incorrect matching of sql/json PASSING variable names

2024-06-19 Thread Amit Langote
On Thu, Jun 13, 2024 at 5:04 PM Amit Langote  wrote:
> On Thu, Jun 6, 2024 at 6:20 PM Amit Langote  wrote:
> >
> > Hi,
> >
> > Alvaro reported off-list that the following should really fail,
> > because the jsonpath expression refers to a PASSING variable that
> > doesn't exist:
> >
> > select json_query('"1"', jsonpath '$xy' passing 2 AS xyz);
> >  json_query
> > 
> >  2
> > (1 row)
> >
> > This works because of a bug in GetJsonPathVar() whereby it allows a
> > jsonpath expression to reference any prefix of the PASSING variable
> > names.
> >
> > Attached is a patch to fix that.
>
> Here's an updated version that I'll push tomorrow.

Pushed.

(Seems like pgsql-committers notification has stalled.)

-- 
Thanks, Amit Langote




Re: Separate HEAP WAL replay logic into its own file

2024-06-19 Thread Li, Yong


> On Jun 18, 2024, at 20:42, Melanie Plageman  wrote:
> 
> External Email
> 
> On Mon, Jun 17, 2024 at 9:12 PM Li, Yong  wrote:
>> 
>> As a newcomer, when I was walking through the code looking for WAL replay 
>> related code, it was relatively easy for me to find them for the B-Tree 
>> access method because of the “xlog” hint in the file names. It took me a 
>> while to find the same for the heap access method. When I finally found them 
>> (via text search), it was a small surprise. Having different file 
>> organizations for different access methods gives me this urge to make 
>> everything consistent. I think it will make it easier for newcomers, and it 
>> will reduce the mental load for everyone to remember that heap replay is 
>> inside the heapam.c not some “???xlog.c”.
> 
> That makes sense. The branch for PG18 has not been cut yet, so I
> recommend registering this patch for the July commitfest [1] so it
> doesn't get lost.
> 
> - Melanie
> 

Thanks for the positive feedback. I’ve added the patch to the July CF.

Yong



Re: Conflict Detection and Resolution

2024-06-19 Thread Amit Kapila
On Tue, Jun 11, 2024 at 3:12 PM Ashutosh Bapat
 wrote:
>
> On Sat, Jun 8, 2024 at 3:52 PM Amit Kapila  wrote:
>>
>> On Fri, Jun 7, 2024 at 5:39 PM Ashutosh Bapat
>>  wrote:
>> >
>> > On Thu, Jun 6, 2024 at 5:16 PM Nisha Moond  
>> > wrote:
>> >>
>> >> >
>> >>
>> >> Here are more use cases of the "earliest_timestamp_wins" resolution 
>> >> method:
>> >> 1) Applications where the record of first occurrence of an event is
>> >> important. For example, sensor based applications like earthquake
>> >> detection systems, capturing the first seismic wave's time is crucial.
>> >> 2) Scheduling systems, like appointment booking, prioritize the
>> >> earliest request when handling concurrent ones.
>> >> 3) In contexts where maintaining chronological order is important -
>> >>   a) Social media platforms display comments ensuring that the
>> >> earliest ones are visible first.
>> >>   b) Finance transaction processing systems rely on timestamps to
>> >> prioritize the processing of transactions, ensuring that the earliest
>> >> transaction is handled first
>> >
>> >
>> > Thanks for sharing examples. However, these scenarios would be handled by 
>> > the application and not during replication. What we are discussing here is 
>> > the timestamp when a row was updated/inserted/deleted (or rather when the 
>> > transaction that updated row committed/became visible) and not a DML on 
>> > column which is of type timestamp. Some implementations use a hidden 
>> > timestamp column but that's different from a user column which captures 
>> > timestamp of (say) an event. The conflict resolution will be based on the 
>> > timestamp when that column's value was recorded in the database which may 
>> > be different from the value of the column itself.
>> >
>>
>> It depends on how these operations are performed. For example, the
>> appointment booking system could be prioritized via a transaction
>> updating a row with columns emp_name, emp_id, reserved, time_slot.
>> Now, if two employees at different geographical locations try to book
>> the calendar, the earlier transaction will win.
>
>
> I doubt that it would be that simple. The application will have to intervene 
> and tell one of the employees that their reservation has failed. It looks 
> natural that the first one to reserve the room should get the reservation, 
> but implementing that is more complex than resolving a conflict in the 
> database. In fact, mostly it will be handled outside database.
>

Sure, the application needs some handling but I have tried to explain
with a simple way that comes to my mind and how it can be realized
with db involved. This is a known conflict detection method but note
that I am not insisting to have "earliest_timestamp_wins". Even, if we
want this we can have a separate discussion on this and add it later.

>>
>>
>> > If we use the transaction commit timestamp as basis for resolution, a 
>> > transaction where multiple rows conflict may end up with different rows 
>> > affected by that transaction being resolved differently. Say three 
>> > transactions T1, T2 and T3 on separate origins with timestamps t1, t2, and 
>> > t3 respectively changed rows r1, r2 and r2, r3 and r1, r4 respectively. 
>> > Changes to r1 and r2 will conflict. Let's say T2 and T3 are applied first 
>> > and then T1 is applied. If t2 < t1 < t3, r1 will end up with version of T3 
>> > and r2 will end up with version of T1 after applying all the three 
>> > transactions.
>> >
>>
>> Are you telling the results based on latest_timestamp_wins? If so,
>> then it is correct. OTOH, if the user has configured
>> "earliest_timestamp_wins" resolution method, then we should end up
>> with a version of r1 from T1 because t1 < t3. Also, due to the same
>> reason, we should have version r2 from T2.
>>
>> >
>>  Would that introduce an inconsistency between r1 and r2?
>> >
>>
>> As per my understanding, this shouldn't be an inconsistency. Won't it
>> be true even when the transactions are performed on a single node with
>> the same timing?
>>
>
> The inconsistency will arise irrespective of conflict resolution method. On a 
> single system effects of whichever transaction runs last will be visible 
> entirely. But in the example above the node where T1, T2, and T3 (from 
> *different*) origins) are applied, we might end up with a situation where 
> some changes from T1 are applied whereas some changes from T3 are applied.
>

I still think it will lead to the same result if all three T1, T2, T3
happen on the same node in the same order as you mentioned. Say, we
have a pre-existing table with rows r1, r2, r3, r4. Now, if we use the
order of transactions to be applied on the same node based on t2 < t1
< t3. First T2 will be applied, so for now, r1 is a pre-existing
version and r2 is from T2. Next, when T1 is performed, both r1 and r2
are from T1. Lastly, when T3 is applied, r1 will be from T3 and r2
will be from T1. This is what you mentioned will happen after conflict
resolution in 

  1   2   >