Re: A tidyup of pathkeys.c

2025-10-14 Thread Chao Li


> On Oct 14, 2025, at 14:02, David Rowley  wrote:
> 
> When working on a5a68dd6d I noticed that truncate_useless_pathkeys()
> uses a bunch of different static helper functions that are mostly the
> same as each other.  Most of them differ only in which field of
> PlannerInfo they look at.
> 
> The attached aims to clean that up by making 2 reusable functions.
> I've also fixed a few other things I noticed along the way.
> 
> Here's a list of what I've changed:
> 
> 1. Add count_common_leading_pathkeys_ordered() function to check for
> leading common pathkeys and use that for sort_pathkeys,
> window_pathkeys and window_pathkeys.
> 2. Add count_common_leading_pathkeys_unordered() to check for leading
> common pathkeys that exist in any portion of the other list of
> pathkeys. Use this for group_pathkeys and distinct_pathkeys.
> 3. Add some short-circuiting to truncate_useless_pathkeys() as there's
> no point in trying to trim down the list when some other operation has
> already figured out that it needs all of the pathkeys.
> 4. Remove the stray " if (root->group_pathkeys != NIL) return true"
> from has_useful_pathkeys().
> 
> 


I like 1 and 2 that make truncate_useless_pathkeys() easier to read. And I 
agree 3 is an optimization though I am not sure how much it will help.

For 4, I traced the function has_useful_pathkeys(), and it showed me that 
root->group_pathkeys and root->query_pathkeys actually point to the same list. 
So deletion of the check root->group_pathkeys is reasonable.

I have only a trivial comment. As you pull out the shared code into 
count_common_leading_pathkeys_ordered()/unordered(), it’s reasonable to make 
them inline, which ensures the new code has the same performance as before. 
However, I realized that truncate_useless_pathkeys() is not on a critical 
execution path, not making them inline won’t hurt very much. So, it’s up to you 
whether or not add “inline” to the two new functions.

Best reagards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






memory context to complement ASAN

2025-10-14 Thread John Naylor
Some bugs have been found in ASAN builds using a compiled-in
malloc-only memory context  (see commits 54ab74865, 1fe5a347e,
ccc8194e4 for the bugs fixed). The patch used was first mentioned here
(attachment: malloc_allocator.patch):

https://www.postgresql.org/message-id/[email protected]

Do we want something like this in core? I don't think we'd need the
ENABLE_ASAN symbol, since that seems orthogonal -- we can configure
ASAN separately how we wish, and I don't see why we'd need to tie the
GUC default for the special context to whether we use ASAN. Aside from
that, the basic idea seems useful.

-- 
John Naylor
Amazon Web Services




Re: Improved TAP tests by replacing sub-optimal uses of ok() with better Test::More functions

2025-10-14 Thread Arseniy Mukhin
Hi,

Thank you for the patch! I think it can make life easier in case of
failure of affected tests. It has conflicts with the current master,
so rebase is needed.


Best regards,
Arseniy Mukhin




Re: Panic during xlog building with big values

2025-10-14 Thread Maksim.Melnikov

Hi, Andy, thanks for your review!

I've checked RecordTransactionCommit too, but I don't think it can fire 
similar error. Problem, that was described above, occurred because we 
used external column storage without compression and with REPLICA 
IDENTITY FULL.
To be honest, it's degenerate case, that can occur only in case of tuple 
update/delete, when we need full row to identify updated/deleted value, 
more info can be found in doc [1].


I've fixed comments with yours remarks, thanks. Patch is attached.

Also rebased patch with commit d3ba50db48e66be8804b9edf093b0f921d625425.

[1] 
https://www.postgresql.org/docs/current/logical-replication-publication.html


Best regards,
Maksim Melnikov
From 125b8b75f8db12cfb076709a914d7717060e2e51 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov 
Date: Fri, 15 Aug 2025 12:15:09 +0300
Subject: [PATCH v3] Pre-check potential XLogRecord oversize for heap_update
 and heap_insert.

Pre-check potential XLogRecord oversize. XLogRecord will be created
later, and it's size will be checked. However, this operation will
occur within a critical section, and in the event of failure, a core
dump will be generated.
It does't seem good, so to avoid this, we can calculate the approximate
xlog record size here and check it.

Size prediction is based on xlog update and xlog delete logic, and can
be revised if it changes. For now, the buf size is limited by
UINT16_MAX (Assert(regbuf->rdata_len <= UINT16_MAX) in xloginsert).

Anyway, to accommodate some overhead, 1M is subtracted from the predicted
value. It seems like that's enough for now.
---
 src/backend/access/heap/heapam.c| 31 +
 src/backend/access/transam/xloginsert.c | 20 
 src/include/access/xloginsert.h |  1 +
 3 files changed, 52 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 568696333c2..bc9b115d279 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -61,6 +61,7 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
   Buffer newbuf, HeapTuple oldtup,
   HeapTuple newtup, HeapTuple old_key_tuple,
   bool all_visible_cleared, bool new_all_visible_cleared);
+static void log_heap_precheck(Relation reln, HeapTuple tp);
 #ifdef USE_ASSERT_CHECKING
 static void check_lock_if_inplace_updateable_rel(Relation relation,
  ItemPointer otid,
@@ -9061,6 +9062,34 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	return recptr;
 }
 
+/*
+ * Pre-check potential XLogRecord oversize. XLogRecord will be created
+ * later, and it's size will be checked. However, this operation will
+ * occur within a critical section, and in the event of failure, a core
+ * dump will be generated.
+ * It does't seem good, so to avoid this, we can calculate the approximate
+ * xlog record size here and check it.
+
+ * Size prediction is based on xlog update and xlog delete logic, and can
+ * be revised if it changes. For now, the buf size is limited by
+ * UINT16_MAX (Assert(regbuf->rdata_len <= UINT16_MAX) in xloginsert).
+
+ * Anyway, to accommodate some overhead, 1M is subtracted from the predicted
+ * value. It seems like that's enough for now.
+ */
+static void
+log_heap_precheck(Relation reln, HeapTuple tp)
+{
+#define XLogRecordMaxOverhead ((uint32) (1024 * 1024))
+
+	if (tp && RelationIsLogicallyLogged(reln))
+	{
+		uint32		data_len = tp->t_len - SizeofHeapTupleHeader;
+
+		XLogPreCheckSize(data_len + XLogRecordMaxOverhead);
+	}
+}
+
 /*
  * Perform XLogInsert of an XLOG_HEAP2_NEW_CID record
  *
@@ -9178,6 +9207,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 			*copy = true;
 			tp = toast_flatten_tuple(tp, desc);
 		}
+		log_heap_precheck(relation, tp);
 		return tp;
 	}
 
@@ -9234,6 +9264,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 		heap_freetuple(oldtup);
 	}
 
+	log_heap_precheck(relation, key_tuple);
 	return key_tuple;
 }
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 496e0fa4ac6..bb770294c2d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1053,6 +1053,26 @@ XLogCheckBufferNeedsBackup(Buffer buffer)
 	return false;/* buffer does not need to be backed up */
 }
 
+/*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure that we will
+ * not emit records larger than the sizes advertised to be supported.
+ */
+void
+XLogPreCheckSize(uint32 data_len)
+{
+	if (data_len > XLogRecordMaxSize)
+	{
+		ereport(ERROR,
+(errmsg_internal("oversized WAL record"),
+ errdetail_internal("WAL record would be %u bytes (of maximum "
+	"%u bytes).",
+	data_len, XLogRecordMaxSize)));
+	}
+}
+
 /*
  * Write a backup block if needed when we are 

Re: add function argument name to substring and substr

2025-10-14 Thread jian he
On Tue, Oct 14, 2025 at 12:11 AM Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > I'm late to the party on this, but I wonder if it wouldn't be better to
> > use a type-neutral parameter name here, like "source", which could cover
> > all these cases, instead of "string", "bytes", etc.
>
> +1 for that idea.  As Jian notes, we'd need to make the docs match,
> but I think that this would be an improvement across the board.
> Parameter names like "string" don't convey much information.
>

please check the attached v8.

\df substr
 List of functions
   Schema   |  Name  | Result data type |Argument data
types | Type
 
++--++--
 pg_catalog | substr | bytea| source bytea, start integer
  | func
 pg_catalog | substr | bytea| source bytea, start integer,
count integer | func
 pg_catalog | substr | text   | source text, start integer
| func
 pg_catalog | substr | text   | source text, start
integer, count integer  | func
(4 rows)

\df substring
   List of functions
   Schema   |   Name| Result data type |Argument data
types | Type
+---+--++--
 pg_catalog | substring | bit  | source bit, start integer
 | func
 pg_catalog | substring | bit  | source bit, start
integer, count integer   | func
 pg_catalog | substring | bytea| source bytea, start
integer| func
 pg_catalog | substring | bytea| source bytea, start
integer, count integer | func
 pg_catalog | substring | text | source text, pattern text
 | func
 pg_catalog | substring | text | source text, pattern
text, escape text | func
 pg_catalog | substring | text | source text, start
integer | func
 pg_catalog | substring | text | source text, start
integer, count integer  | func
(8 rows)

For function substr, the doc change is not that a big deal.

For function substring, in doc/src/sgml/func/func-matching.sgml, we need
change some of the
string
to
source.

we also need to change many
string
to
source.
That's why v8-0002 is big.
From d1b3d6c7eb3ebb4456ea2f6bb2223fd6c535fedd Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 14 Oct 2025 15:42:57 +0800
Subject: [PATCH v8 2/2] add function argument name to function substring

with patch applied
\df substring
   List of functions
   Schema   |   Name| Result data type |Argument data types | Type
+---+--++--
 pg_catalog | substring | bit  | source bit, start integer  | func
 pg_catalog | substring | bit  | source bit, start integer, count integer   | func
 pg_catalog | substring | bytea| source bytea, start integer| func
 pg_catalog | substring | bytea| source bytea, start integer, count integer | func
 pg_catalog | substring | text | source text, pattern text  | func
 pg_catalog | substring | text | source text, pattern text, escape text | func
 pg_catalog | substring | text | source text, start integer | func
 pg_catalog | substring | text | source text, start integer, count integer  | func
(8 rows)

discussion: https://postgr.es/m/CACJufxHTBkymh06D4mGKNe1YfRNFN+gFBybmygWk=ptmqu0...@mail.gmail.com
---
 doc/src/sgml/func/func-binarystring.sgml | 26 --
 doc/src/sgml/func/func-bitstring.sgml| 24 -
 doc/src/sgml/func/func-matching.sgml | 32 +++-
 doc/src/sgml/func/func-string.sgml   | 63 +---
 src/backend/catalog/system_functions.sql |  2 +-
 src/include/catalog/pg_proc.dat  |  8 +++
 6 files changed, 130 insertions(+), 25 deletions(-)

diff --git a/doc/src/sgml/func/func-binarystring.sgml b/doc/src/sgml/func/func-binarystring.sgml
index 294d1d97233..98f12c05bfa 100644
--- a/doc/src/sgml/func/func-binarystring.sgml
+++ b/doc/src/sgml/func/func-binarystring.sgml
@@ -200,11 +200,11 @@
 
  substring
 
-substring ( bytes bytea  FROM start integer   FOR count integer  )
+substring ( source bytea  FROM start integer   FOR count integer  )
 bytea


-Extracts the substring of bytes starting at
+Extracts the substring of source starting at
 the start'th byte if that is specified,
 and stopping after count bytes if that is
 specified.  Provide at least one of start
@@ -216,6 +216,26 @@

   
 
+  
+   
+
+ 

Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-10-14 Thread Peter Smith
Hi Sawada-San,

I started to look again at this thread. Here are some comments for v18
(the documentation only).

==
doc/src/sgml/config.sgml

1.
+   
+It is important to note that when
wal_level is set to
+replica, the effective WAL level can
automatically change
+based on the presence of 
+logical replication slots. The system automatically
increases the
+effective WAL level to logical when
creating the first
+logical replication slot, and decreases it back to
replica
+when dropping the last logical replication slot. The current
effective WAL
+level can be monitored through 
+parameter.
+   


As you say "It is important to note...". So, should this all be using
 SGML markup?

~~~

2.
+  
+   effective_wal_level configuration
parameter
+  

Should this even be called a "configuration parameter", given that you
can't configure it?

==
doc/src/sgml/logicaldecoding.sgml

3.
+
+ When wal_level is set to replica,
+ logical decoding is automatically activated upon creation of the first
+ logical replication slot. This activation process involves several steps
+ and requires waiting for any concurrent transactions to finish, ensuring
+ system-wide consistency. Conversely, when the last logical
replication slot
+ is dropped from a system with wal_level set to
+ replica, logical decoding is automatically disabled.
+ Note that the deactivation of logical decoding might take some time as
+ it is performed asynchronously by the checkpointer process.
+

Here it seems to be describing again the concept of the
"effective_wal_level" as in the config.sgml, so should this paragraph
also make mention of that terminology, and link to that concept like
was done before?

~~~

4.
  Existing logical slots on standby also get invalidated if
- wal_level on the primary is reduced to less than
- logical.
+ logical decoding is disabled on the primary.

Would it be easier just to leave the text as it was, but say
"effective_wal_level" instead of "wal_level"?

==
doc/src/sgml/system-views.sgml

5.
  
   wal_level_insufficient means that the
-  primary doesn't have a  sufficient to
-  perform logical decoding.  It is set only for logical slots.
+  logical decoding is disabled on primary (See
+  . It is set
+  only for logical slots.
  

5a.
Should this also make use of the "effective_wal_level" terminology?

e.g
"...means that the logical decoding is disabled on primary (i.e the
effective_wal_level is not logical)"

~

5b.
Would it be better to list all these reasons in alphabetical order?

~

5c.
I felt the "It is set only for logical slots" to be a bit vague. IMO
it is clearer to say "This reason value is only used for logical
slots".

Alternatively, maybe the whole list should be restructured like below

SUGGESTION
The reason for the slot's invalidation. NULL if the slot is not invalidated.

Possible reason values for logical or physical slots are:
- wal_removed means ...
- idle_timeout means ...

Possible reason values, only for logical slots are:
- rows_removed means ...
- wal_level_insufficient means ...

~

5d.
A missing closing parenthesis. "(See xxx"

==
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: Executing pg_createsubscriber with a non-compatible control file

2025-10-14 Thread Michael Paquier
On Mon, Oct 13, 2025 at 03:35:06PM -0700, Masahiko Sawada wrote:
> v1-0001-Introduce-API-able-to-retrieve-contents-of-PG_VER.patch:
> 
> v1-0002-pg_upgrade-Use-PG_VERSION-generic-routine.patch:
> v1-0003-pg_createsubscriber-Use-PG_VERSION-generic-routin.patch:

Applied both of these.

> The new log detail message uses the same message as what pg_resetwal
> uses, but pg_createsubscriber shows an integer major version whereas
> pg_resetwal shows the raw version string. I guess it's better to unify
> the usage for better consistency.

OK, done as suggested to limit the translation work.

> v1-0004-pg_combinebackup-use-PG_VERSION-generic-routine.patch:
> 
> +   pg_log_debug("read server version %u from file \"%s\"",
> +GET_PG_MAJORVERSION_NUM(version), "PG_VERSION");
> 
> We used to show the full path of PG_VERSION file in the debug message.
> While probably we can live with such a small difference, do you think
> it's a good idea to move the debug message to get_pg_version()?

I cannot get into doing that, leaving that up to the caller with the
error message they want.  That's a minor point, I guess, I can see
both sides of the coin.

Switched this one to report the full path, like previously.

> v1-0005-pg_resetwal-use-PG_VERSION-generic-routine.patch:
> 
> /* Check that data directory matches our server version */
> -   CheckDataVersion();
> +   CheckDataVersion(DataDir);
> 
> With the patch, pg_resetwal fails to handle a relative path of PGDATA
> as follows:
> 
> $ bin/pg_resetwal data
> pg_resetwal: error: could not open version file "data/PG_VERSION": No
> such file or directory
> 
> This is because pg_resetwal does chdir() to the given path of the data
> directory before checking the version.

Right.  I've tested with absolute paths and forgot relative paths.
For this one, I would just use a ".", as the chdir is before the
version check.  Or we could reverse the chdir() and the version check,
but there is no real benefit in doing so.

Updated patch set attached.
--
Michael
From 1446b30bea9786432dcd9029ce61958ae334e46b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 14 Oct 2025 16:41:19 +0900
Subject: [PATCH v2 1/3] pg_createsubscriber: Use PG_VERSION generic routine

---
 src/bin/pg_basebackup/pg_createsubscriber.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index d29407413d96..9fa5caaf91d6 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -26,6 +26,7 @@
 #include "fe_utils/recovery_gen.h"
 #include "fe_utils/simple_list.h"
 #include "fe_utils/string_utils.h"
+#include "fe_utils/version.h"
 #include "getopt_long.h"
 
 #define	DEFAULT_SUB_PORT	"50432"
@@ -407,7 +408,8 @@ static void
 check_data_directory(const char *datadir)
 {
 	struct stat statbuf;
-	char		versionfile[MAXPGPATH];
+	uint32		major_version;
+	char	   *version_str;
 
 	pg_log_info("checking if directory \"%s\" is a cluster data directory",
 datadir);
@@ -420,11 +422,18 @@ check_data_directory(const char *datadir)
 			pg_fatal("could not access directory \"%s\": %m", datadir);
 	}
 
-	snprintf(versionfile, MAXPGPATH, "%s/PG_VERSION", datadir);
-	if (stat(versionfile, &statbuf) != 0 && errno == ENOENT)
+	/*
+	 * Retrieve the contents of this cluster's PG_VERSION.  We require
+	 * compatibility with the same major version as the one this tool is
+	 * compiled with.
+	 */
+	major_version = GET_PG_MAJORVERSION_NUM(get_pg_version(datadir, &version_str));
+	if (major_version != PG_MAJORVERSION_NUM)
 	{
-		pg_fatal("directory \"%s\" is not a database cluster directory",
- datadir);
+		pg_log_error("data directory is of wrong version");
+		pg_log_error_detail("File \"%s\" contains \"%s\", which is not compatible with this program's version \"%s\".",
+			"PG_VERSION", version_str, PG_MAJORVERSION);
+		exit(1);
 	}
 }
 
-- 
2.51.0

From 7185c5d6eb3ce0afb22d7be197e6901f2cab55c4 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 14 Oct 2025 16:47:03 +0900
Subject: [PATCH v2 2/3] pg_combinebackup: use PG_VERSION generic routine

---
 src/bin/pg_combinebackup/pg_combinebackup.c | 65 +++--
 1 file changed, 9 insertions(+), 56 deletions(-)

diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index a330c09d939d..3a3251272092 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -34,6 +34,7 @@
 #include "common/relpath.h"
 #include "copy_file.h"
 #include "fe_utils/option_utils.h"
+#include "fe_utils/version.h"
 #include "getopt_long.h"
 #include "lib/stringinfo.h"
 #include "load_manifest.h"
@@ -117,7 +118,6 @@ static void process_directory_recursively(Oid tsoid,
 		  manifest_data **manifests,
 		  manifest_writer *mwriter,
 		  cb_options *opt);
-static int	read_pg_vers

Re: A tidyup of pathkeys.c

2025-10-14 Thread Richard Guo
On Tue, Oct 14, 2025 at 3:03 PM David Rowley  wrote:
> Here's a list of what I've changed:
>
> 1. Add count_common_leading_pathkeys_ordered() function to check for
> leading common pathkeys and use that for sort_pathkeys,
> window_pathkeys and window_pathkeys.
> 2. Add count_common_leading_pathkeys_unordered() to check for leading
> common pathkeys that exist in any portion of the other list of
> pathkeys. Use this for group_pathkeys and distinct_pathkeys.
> 3. Add some short-circuiting to truncate_useless_pathkeys() as there's
> no point in trying to trim down the list when some other operation has
> already figured out that it needs all of the pathkeys.
> 4. Remove the stray " if (root->group_pathkeys != NIL) return true"
> from has_useful_pathkeys().

+1.  I think this is a nice tidy-up.

FWIW, I complained about the stray check in has_useful_pathkeys() in
[1] last week, but you were quicker than me in making the code change
to remove it.

[1] 
https://postgr.es/m/CAMbWs4_zW5QU=zk32s17p8qwy+ga-3zutons+y+wopguiop...@mail.gmail.com

- Richard




Re: Implement waiting for wal lsn replay: reloaded

2025-10-14 Thread Xuneng Zhou
Hi,

On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou  wrote:
>
> Hi,
>
> On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou  wrote:
> >
> > Hi,
> >
> > On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou  wrote:
> > >
> > > Hi Álvaro,
> > >
> > > Thanks for your review.
> > >
> > > On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera  
> > > wrote:
> > > >
> > > > On 2025-Sep-15, Alexander Korotkov wrote:
> > > >
> > > > > > It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and 
> > > > > > CREATE
> > > > > > PUBLICATION - all use minimal grammar rules that produce generic
> > > > > > option lists, with the actual interpretation done in their 
> > > > > > respective
> > > > > > implementation files. The moderate complexity in wait.c seems
> > > > > > acceptable.
> > > >
> > > > Actually I find the code in ExecWaitStmt pretty unusual.  We tend to use
> > > > lists of DefElem (a name optionally followed by a value) instead of
> > > > individual scattered elements that must later be matched up.  Why not
> > > > use utility_option_list instead and then loop on the list of DefElems?
> > > > It'd be a lot simpler.
> > >
> > > I took a look at commands like VACUUM and EXPLAIN and they do follow
> > > this pattern. v11 will make use of utility_option_list.
> > >
> > > > Also, we've found that failing to surround the options by parens leads
> > > > to pain down the road, so maybe add that.  Given that the LSN seems to
> > > > be mandatory, maybe make it something like
> > > >
> > > > WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
> > > >
> > > > This requires that you make LSN a keyword, albeit unreserved.  Or you
> > > > could make it
> > > > WAIT FOR Ident [the rest]
> > > > and then ensure in C that the identifier matches the word LSN, such as
> > > > we do for "permissive" and "restrictive" in
> > > > RowSecurityDefaultPermissive.
> > >
> > > Shall make LSN an unreserved keyword as well.
> >
> > Here's the updated v11.  Many thanks Jian for off-list discussions and 
> > review.
>
> v12 removed unused
> +WaitStmt
> +WaitStmtParam in pgindent/typedefs.list.
>

Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1].

[1] https://www.postgresql.org/message-id/aOMsv9TszlB1n-W7%40paquier.xyz

Best,
Xuneng
From c3dd9972d8043c07247bb3e2b476026268ee1bad Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 14 Oct 2025 20:50:04 +0800
Subject: [PATCH v13 3/3] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan 
Author: Alexander Korotkov 
Co-authored-by: Xuneng Zhou 

Reviewed-by: Michael Paquier 
Reviewed-by: Peter Eisentraut 
Reviewed-by: Dilip Kumar 
Reviewed-by: Amit Kapila 
Reviewed-by: Alexander Lakhin 
Reviewed-by: Bharath Rupireddy 
Reviewed-by: Euler Taveira 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: jian he 
---
 doc/src/sgml/high-availability.sgml   |  54 
 doc/src/sgml/ref/allfiles.sgml|   1 +
 doc/src/sgml/ref/wait_for.sgml| 234 +
 doc/src/sgml/reference.sgml   |   1 +
 src/backend/access/transam/xact.c |   6 +
 src/backend/access/transam/xlog.c |   7 +
 src/backend/access/transam/xlogrecovery.c |  11 +
 src/backend/access/transam/xlogwait.c |  27 +-
 src/backend/commands/Makefile |   3 +-
 src/backend/commands/meson.build  |   1 +
 src/backend/commands/wait.c   | 212 
 src/backend/parser/gram.y |  33 ++-
 src/backend/storage/lmgr/proc.c   |   5 +
 src/backend/tcop/pquery.c |  12 +-
 src/backend/tcop/utility.c|  22 ++
 src/include/access/xlogwait.h |   3 +-
 src/include/commands/wait.h   |  22 ++
 src/include/nodes/parsenodes.h|   8 +
 src/include/parser/kwlist.h   |   2 +
 src/include/tcop/cmdtaglist.h |   1 +
 src/test/recovery/meson.build |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl   | 293 ++
 src/tools/pgindent/typedefs.list  |   3 +
 23 files changed, 951 insertions(+), 13 deletio

RE: Logical Replication of sequences

2025-10-14 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for updating the patch. Here are comments for recent 0002.
Others are still being reviewed

01. pg_subscription_rel.h
```
+#include "nodes/primnodes.h"
```

The inclusion is not needed because the 


02. typedefs.list
```
+SubscriptionRelKind
```

Missing update.

03. subscritioncmds.c
```
+#include "catalog/pg_sequence.h"
```

I could build without the inclusion. Can you remove?

04. check_publications_origin
```
+
+   query = "SELECT DISTINCT P.pubname AS pubname\n"
+   "FROM pg_publication P,\n"
+   " LATERAL %s GPR\n"
...
```

pgindent does not like the notation. How aboout chaning the line after the "="?
I.e.,

```
query =
"SELECT DISTINCT P.pubname AS pubname\n"
"FROM pg_publication P,\n"
" LATERAL %s GPR\n"
...
```

05. AddSubscriptionRelState

```
if (HeapTupleIsValid(tup))
elog(ERROR, "subscription table %u in subscription %u already 
exists",
 relid, subid);
```

Theoretically subid might be the sequence, right? Should we say "relation"
instead of "table" as well?

06. AlterSubscription_refresh_seq
```
+   /* Get local relation list. */
```

In contrast can we say "sequence"?


07. check_publications_origin
```
if (res->status != WALRCV_OK_TUPLES)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
 errmsg("could not receive list of replicated 
tables from the publisher: %s",
res->err)));
```

Should we say "relations" instead of "tables"? Similar lines are:

```
/* Process tables. */
...
 * Log a warning if the publisher has subscribed to the same table from
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED
 


Re: pageinspect some function no need superuser priv

2025-10-14 Thread Nathan Bossart
On Tue, Oct 14, 2025 at 10:29:39AM -0400, Tom Lane wrote:
> Yeah, I do not think it follows that being table owner should
> entitle you to such low-level access.  I'm inclined to reject
> this proposal.

-1 here, too.  IMHO all of pageinspect should remain superuser-only since
it is meant for development/debugging.  The proposal doesn't describe a
use-case for the relaxed privileges, either.

-- 
nathan




Re: Patch for migration of the pg_commit_ts directory

2025-10-14 Thread ls7777
Hi,I applied the diff file.Deleted the $ts variable.   Кому: 'ls' ([email protected]);Копия: [email protected], [email protected], [email protected];Тема: Patch for migration of the pg_commit_ts directory;14.10.2025, 09:27, "Hayato Kuroda (Fujitsu)" :Hi,Thanks for updating the patch.```+my ($xid,$ts) = $resold =~ /\s*(\d+)\s*\|(.*)/;```Per my understanding $ts is not used here. Can you remove it?Apart from above, I found some cosmetic issues. Please see attachedmy fix which can be applied atop HEAD. Can you check and include ifit is OK?Best regards,Hayato KurodaFUJITSU LIMITED From bc582227cdd51d252db00fccc4d572524f9a4b91 Mon Sep 17 00:00:00 2001
From: Sergey Levin 
Date: Sat, 4 Oct 2025 20:22:46 +0500
Subject: [PATCH] Migration of the pg_commit_ts directory

---
 src/bin/pg_upgrade/check.c| 24 +++
 src/bin/pg_upgrade/controldata.c  | 20 ++
 src/bin/pg_upgrade/meson.build|  1 +
 src/bin/pg_upgrade/pg_upgrade.c   | 28 +++-
 src/bin/pg_upgrade/pg_upgrade.h   |  2 +
 .../pg_upgrade/t/007_transfer_commit_ts.pl| 67 +++
 6 files changed, 139 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_upgrade/t/007_transfer_commit_ts.pl

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 1e17d64b3ec..382da0485a7 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -32,6 +32,7 @@ static void check_new_cluster_replication_slots(void);
 static void check_new_cluster_subscription_configuration(void);
 static void check_old_cluster_for_valid_slots(void);
 static void check_old_cluster_subscription_state(void);
+static void check_new_cluster_pg_commit_ts(void);
 
 /*
  * DataTypesUsageChecks - definitions of data type checks for the old cluster
@@ -767,9 +768,32 @@ check_new_cluster(void)
 	check_new_cluster_replication_slots();
 
 	check_new_cluster_subscription_configuration();
+
+	check_new_cluster_pg_commit_ts();
+
 }
 
+void
+check_new_cluster_pg_commit_ts(void)
+{
+	PGconn	   *conn;
+	PGresult   *res;
+	bool		track_commit_timestamp_on;
+
+	prep_status("Checking for new cluster configuration for commit timestamp");
+	conn = connectToServer(&new_cluster, "template1");
+	res = executeQueryOrDie(conn, "SELECT setting FROM pg_settings "
+  "WHERE name = 'track_commit_timestamp'");
+	track_commit_timestamp_on = strcmp(PQgetvalue(res, 0, 0), "on") == 0;
+	PQclear(res);
+	PQfinish(conn);
 
+	if (!track_commit_timestamp_on &&
+		old_cluster.controldata.chkpnt_newstCommitTsxid > 0)
+		pg_fatal("\"track_commit_timestamp\" must be \"on\" but is set to \"off\"");
+
+	check_ok();
+}
 void
 report_clusters_compatible(void)
 {
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index 90cef0864de..594dc1c0a9f 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -321,6 +321,26 @@ get_control_data(ClusterInfo *cluster)
 			cluster->controldata.chkpnt_nxtmulti = str2uint(p);
 			got_multi = true;
 		}
+		else if ((p = strstr(bufin, "Latest checkpoint's oldestCommitTsXid:")) != NULL)
+		{
+			p = strchr(p, ':');
+
+			if (p == NULL || strlen(p) <= 1)
+pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+			p++;/* remove ':' char */
+			cluster->controldata.chkpnt_oldstCommitTsxid = str2uint(p);
+		}
+		else if ((p = strstr(bufin, "Latest checkpoint's newestCommitTsXid:")) != NULL)
+		{
+			p = strchr(p, ':');
+
+			if (p == NULL || strlen(p) <= 1)
+pg_fatal("%d: controldata retrieval problem", __LINE__);
+
+			p++;/* remove ':' char */
+			cluster->controldata.chkpnt_newstCommitTsxid = str2uint(p);
+		}
 		else if ((p = strstr(bufin, "Latest checkpoint's oldestXID:")) != NULL)
 		{
 			p = strchr(p, ':');
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index ac992f0d14b..030618152d0 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -47,6 +47,7 @@ tests += {
   't/004_subscription.pl',
   't/005_char_signedness.pl',
   't/006_transfer_modes.pl',
+  't/007_transfer_commit_ts.pl',
 ],
 'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 490e98fa26f..70297f38dca 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -772,6 +772,9 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir)
 static void
 copy_xact_xlog_xid(void)
 {
+	bool		is_copy_commit_ts;
+	uint32		oldest_xid, newest_xid;
+
 	/*
 	 * Copy old commit logs to new data dir. pg_clog has been renamed to
 	 * pg_xact in post-10 clusters.
@@ -781,6 +784,22 @@ copy_xact_xlog_xid(void)
 	  GET_MAJOR_VERSION(new_cluster.major_version) <= 906 ?
 	  "pg_clog" : "pg_xact");
 
+	/*
+	 * Copy old commit_timestamp data to new, if available.
+	 */
+	is

Re: Improved TAP tests by replacing sub-optimal uses of ok() with better Test::More functions

2025-10-14 Thread Andrew Dunstan



On 2025-10-14 Tu 4:01 AM, Michael Paquier wrote:

On Fri, Oct 10, 2025 at 09:33:10AM -0400, Andrew Dunstan wrote:

Great, I think this is a definite improvement. I saw someone recently
complaining about this overuse of ok(), so thanks for doing the work to
improve it.

Yeah, it's really cool to see someone step up and do all this leg
work for the existing code.  I have not checked the patch in details
or if there are missing spots.  Andrew, is that something you are
planning to do?




I believe Sadhuprasad used this recipe to find these:


  find src contrib -type f -name '*.p[lm]' -print | \
  xargs grep -P '\bok[(].*[~=]'


Maybe that would miss a few, but I bet not too many.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: A tidyup of pathkeys.c

2025-10-14 Thread Chao Li


> On Oct 14, 2025, at 19:22, David Rowley  wrote:
> 
> What makes you think making them inline would make the performance the
> same as before? The previous functions were not inlined, and I've not
> made any changes that should affect the compiler's ability to choose
> to inline these functions or not.


Ah… You are right. The old code:

static int
pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
int n_common_pathkeys;

(void) pathkeys_count_contained_in(root->sort_pathkeys, pathkeys,
   &n_common_pathkeys);

return n_common_pathkeys;
}

Your patch’s code:

static int
count_common_leading_pathkeys_ordered(List *keys1, List *keys2)
{
int ncommon;

(void) pathkeys_count_contained_in(keys1, keys2, &ncommon);

return ncommon;
}

They both call pathkeys_count_contained_in(), you are NOT adding an extra 
wrapper. So, I withdraw the “inline” comment.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

2025-10-14 Thread Xuneng Zhou
Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou  wrote:
>
> Hi,
>
> The following is the split patch set. There are certain limitations to
> this simplification effort, particularly in patch 2. The
> read_local_xlog_page_guts callback demands more functionality from the
> facility than the WAIT FOR patch — specifically, it must wait for WAL
> flush events, though it does not require timeout handling. In some
> sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
> patch, since it installs wake-up hooks in more locations. Unlike the
> WAIT FOR patch, which only needs wake-ups triggered by replay,
> read_local_xlog_page_guts must also handle wake-ups triggered by WAL
> flushes.
>
> Workload characteristics play a key role here. A sorted dlist performs
> well when insertions and removals occur in order, achieving O(1)
> complexity in the best case. In synchronous replication, insertion
> patterns seem generally monotonic with commit LSNs, though not
> strictly ordered due to timing variations and contention. When most
> insertions remain ordered, a dlist can be efficient. However, as the
> number of elements grows and out-of-order insertions become more
> frequent, the insertion cost can degrade to O(n) more often.
>
> By contrast, a pairing heap maintains stable O(1) insertion for both
> ordered and disordered inputs, with amortized O(log n) removals. Since
> LSNs in the WAIT FOR command are likely to arrive in a non-sequential
> fashion, the pairing heap introduced in v6 provides more predictable
> performance under such workloads.
>
> At this stage (v7), no consolidation between syncrep and xlogwait has
> been implemented. This is mainly because the dlist and pairing heap
> each works well under different workloads — neither is likely to be
> universally optimal. Introducing the facility with a pairing heap
> first seems reasonable, as it offers flexibility for future
> refactoring: we could later replace dlist with a heap or adopt a
> modular design depending on observed workload characteristics.
>

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Best,
Xuneng
From 32dab7ed64eecb62adce6b1d124b1fa389515e74 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Fri, 10 Oct 2025 16:35:38 +0800
Subject: [PATCH v8 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan 
Author: Alexander Korotkov 
Author: Xuneng Zhou 

Reviewed-by: Michael Paquier 
Reviewed-by: Peter Eisentraut 
Reviewed-by: Dilip Kumar 
Reviewed-by: Amit Kapila 
Reviewed-by: Alexander Lakhin 
Reviewed-by: Bharath Rupireddy 
Reviewed-by: Euler Taveira 
Reviewed-by: Heikki Linnakangas 
Reviewed-by: Kyotaro Horiguchi 
---
 src/backend/access/transam/Makefile   |   3 +-
 src/backend/access/transam/meson.build|   1 +
 src/backend/access/transam/xlogwait.c | 525 ++
 src/backend/storage/ipc/ipci.c|   3 +
 .../utils/activity/wait_event_names.txt   |   3 +
 src/include/access/xlogwait.h | 112 
 src/include/storage/lwlocklist.h  |   1 +
 7 files changed, 627 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/trans

Re: pg_createsubscriber - more logging to say if there are no pubs to drop

2025-10-14 Thread Masahiko Sawada
On Tue, Oct 14, 2025 at 5:26 PM Peter Smith  wrote:
>
> On Wed, Oct 15, 2025 at 11:11 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Oct 14, 2025 at 4:29 PM Peter Smith  wrote:
> > >
> > > On Wed, Oct 15, 2025 at 4:42 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, Oct 8, 2025 at 6:34 PM Peter Smith  
> > > > wrote:
> > > > >
> > > > > Hi hackers,
> > > > >
> > > > > While reviewing pg_createsubscriber in another thread, I found some of
> > > > > the current logging to be confusing. Specifically, there is one part
> > > > > that drops all existing publications. Sometimes it might look like
> > > > > this:
> > > > >
> > > > > --
> > > > > pg_createsubscriber: dropping all existing publications in database 
> > > > > "db2"
> > > > > pg_createsubscriber: dropping publication "pub_exists1" in database 
> > > > > "db2"
> > > > > pg_createsubscriber: dropping publication "pub_exists2" in database 
> > > > > "db2"
> > > > > pg_createsubscriber: dropping publication "pub_exists3" in database 
> > > > > "db2"
> > > > > --
> > > > >
> > > > > ~~~
> > > > >
> > > > > OTOH, if there is nothing found to be dropped, then the logging just 
> > > > > says:
> > > > >
> > > > > --
> > > > > pg_createsubscriber: dropping all existing publications in database 
> > > > > "db2"
> > > > > --
> > > > >
> > > > > That's the scenario that I found ambiguous. You can't be sure from the
> > > > > logs what happened:
> > > > > - Were there publications found, and were they dropped silently?
> > > > > - Did it not find anything to drop?
> > > > >
> > > > > ~~~
> > > > >
> > > > > Here is a small patch to remove that doubt. Now, if there is nothing
> > > > > found, the logging would look like:
> > > > >
> > > > > --
> > > > > pg_createsubscriber: dropping all existing publications in database 
> > > > > "db2"
> > > > > pg_createsubscriber: no publications found
> > > > > --
> > > > >
> > > > > Thoughts?
> > > >
> > > > Thank you for the patch!
> > > >
> > > > It sounds like a reasonable improvement. I'll push the patch, barring
> > > > any objections.
> > > >
> > >
> > > Hm.. I'm wondering now if this patch was correct :-(
> > >
> > > I had encountered this problem and made this patch while reviewing
> > > another thread [1], but that other patch is still in development flux
> > > and so the logging was changing, and may have been broken at the time
> > > I wrote this one.
> > >
> > > In hindsight, I am not sure if this "no publication found" log is
> > > working as intended for the master code.
> > >
> > > ~~~
> > >
> > > For example, during a '--dry-run' the create/drop publication in
> > > pg_createsubscriber are NOP but they still do the appropriate logging.
> > >
> > > So, it is true that PQntuples(res) can be 0, because for dry run
> > > create_publication is NOP.
> > > But the dry run will still show logging for the create and for the drop.
> > >
> > > I hacked one of the test cases:
> > >
> > > # PS HACK
> > > command_ok(
> > > [
> > > 'pg_createsubscriber',
> > > '--verbose',
> > > '--dry-run',
> > > '--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default,
> > > '--pgdata' => $node_s->data_dir,
> > > '--publisher-server' => $node_p->connstr($db1),
> > > '--socketdir' => $node_s->host,
> > > '--subscriber-port' => $node_s->port,
> > > '--publication' => 'pub1',
> > > '--publication' => 'pub2',
> > > '--subscription' => 'sub1',
> > > '--subscription' => 'sub2',
> > > '--database' => $db1,
> > > '--database' => $db2,
> > > '--clean' => 'publications',
> > > ],
> > > 'PS HACK');
> > >
> > > This gives a logging result like:
> > >
> > > pg_createsubscriber: dropping all existing publications in database "xxx"
> > > 2025-10-15 10:18:53.583 AEDT client backend[31402]
> > > 040_pg_createsubscriber.pl LOG:  statement: SELECT pubname FROM
> > > pg_catalog.pg_publication;
> > > pg_createsubscriber: no publications found
> > > pg_createsubscriber: dropping publication "pub2" in database "xxx"
> > >
> > > ~~~
> > >
> > > First saying "no publications found", and then logging a drop anyway
> > > may seem contrary to the user.
> > >
> > > That was not the intended outcome for this patch.
> >
> > Hmm. So the subscriber should have at least one publication in
> > non-dry-run cases, and in dry-run cases it has no subscription but
> > shows the "dropping publication" log, is that right? If the newly
> > added log could rather confuse users, we should revert the change.
> >
>
> IIUC the created publication (on publisher) ends up on the subscriber
> because of the physical replication that occurred before the tool
> switches over to use logical replication.
>
> So the created publication has ended up on subscriber, and now the
> --clean will remove it.
>
> But for dry-run that create_publication is NOP so it never *actually*
> propagated to the subscribe

Re: pg_createsubscriber - more logging to say if there are no pubs to drop

2025-10-14 Thread Masahiko Sawada
On Wed, Oct 8, 2025 at 6:34 PM Peter Smith  wrote:
>
> Hi hackers,
>
> While reviewing pg_createsubscriber in another thread, I found some of
> the current logging to be confusing. Specifically, there is one part
> that drops all existing publications. Sometimes it might look like
> this:
>
> --
> pg_createsubscriber: dropping all existing publications in database "db2"
> pg_createsubscriber: dropping publication "pub_exists1" in database "db2"
> pg_createsubscriber: dropping publication "pub_exists2" in database "db2"
> pg_createsubscriber: dropping publication "pub_exists3" in database "db2"
> --
>
> ~~~
>
> OTOH, if there is nothing found to be dropped, then the logging just says:
>
> --
> pg_createsubscriber: dropping all existing publications in database "db2"
> --
>
> That's the scenario that I found ambiguous. You can't be sure from the
> logs what happened:
> - Were there publications found, and were they dropped silently?
> - Did it not find anything to drop?
>
> ~~~
>
> Here is a small patch to remove that doubt. Now, if there is nothing
> found, the logging would look like:
>
> --
> pg_createsubscriber: dropping all existing publications in database "db2"
> pg_createsubscriber: no publications found
> --
>
> Thoughts?

Thank you for the patch!

It sounds like a reasonable improvement. I'll push the patch, barring
any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Making pg_rewind faster

2025-10-14 Thread John H
Hi,

On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla
 wrote:
>
> ...
> XLogFilePath , then validate this new path with the given path
> ,this helps to catch invalid xlog files like pg_wal/0001FF10.
> ...

Are you concerned that somehow these files, which are named like XLog
files but actually
aren't, are somehow created therefore we should sync them in case?
I'm trying to understand how these files would be generated in the first place.

>
> instead of passing last_common_segno down the call stack directly,
> we can have a struct maybe called "rewindState" which will have the common
> information related to both clusters involved in rewind ,so that in future
> if we want to pass more such information down we won't be increasing
> the number of arguments that need to be passed down to all functions in stack.
>

I don't feel too strongly about this. Not sure how we view
future-proofing things,
as in do we proactively wrap around structs or wait until there's an
actual need.

This is the first time it's been touched since it was introduced in 14 for what
it's worth.

>
> we can improve the TAP test file header comment as
> # Test the situation where we skip copying the same WAL files from source to 
> target
> # except if WAL file size differs.
>
> let me put it this way
> 1) WALs are written to 00010002 in primary.
> ...
> 7) so the last common WAL segment was 00010003.

Updated comments.

>
> ...
> where entry->target_size != entry->source_size and we do copy,
> like if stat->mtime is changed (which means file has been copied)
> and target_stat->size != soucrce_stat->size here no error expected.
>

Updated patch with one additional segment has a byte appended to it
on target but not source.


Thanks,

John H


v9-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patch
Description: Binary data


Re: Non-blocking archiver process

2025-10-14 Thread Robert Haas
Here are a few comments from me:

I think that by calling system(), anything that is printed out by the child
process ends up in the PostgreSQL log. With the patch, the output of the
command seems like it will be read into a buffer and discarded. That's a
significant behavior difference.

This patch has a 10ms polling loop after which it checks for interrupts,
and then repeats. I think our normal convention in these kinds of cases is
to use WaitLatchOrSocket(). That allows for a longer sleep time (like 1s)
which results in fewer processor wakeups, while actually still being more
responsive because the arrival of an interrupt should set the process latch
and terminate the wait.

In general, I agree with Artem's comments about writing code that fits the
PostgreSQL style. We don't want to invent new ways of solving problems for
which we already have infrastructure, nor do we want to solve problems in
this case in a way that is different from what we do in other cases. Using
ereport appropriately is part of that, but there's a lot more to it. Artem
mentioned malloc and free, but notice, for example, that we also have our
own wrapper around popen() in OpenPipeStream(). Maybe we shouldn't use that
here, but we shouldn't *accidentally* not use that here.

I wonder whether we should really be using popen() in any form, actually.
If what we want is for the output of the command to go to our standard
output, as it does with system(), that's exactly what popen() is designed
not to do, so it doesn't seem like a natural fit. Perhaps we should be
using fork() + exec()? Or maybe we already have some good infrastructure
for this we can reuse somewhere?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Re: plpython: Remove support for major version conflict detection

2025-10-14 Thread Peter Eisentraut

On 08.10.25 18:41, Nathan Bossart wrote:

On Wed, Oct 08, 2025 at 12:28:45PM -0300, Mario González Troncoso wrote:

I'm attaching the same patch after rebasing from master.


LGTM


Committed, thanks.





Re: get rid of RM_HEAP2_ID

2025-10-14 Thread John Naylor
On Wed, Oct 15, 2025 at 6:55 AM Michael Paquier  wrote:
>
> On Tue, Oct 14, 2025 at 03:20:16PM +0300, Heikki Linnakangas wrote:
> > I'm not sure I agree with the premise that we should try to get rid of
> > RM_HEAP2_ID. There's nothing wrong with that scheme as such. As an
> > alternative, we could easily teach e.g pg_waldump to treat RM_HEAP_ID and
> > RM_HEAP2_ID the same for statistics purposes.
>
> Yeah, I'd rather keep heap2 as well.  As long as there is more room
> for the record IDs, we're still going to need it in the long run.

Okay, I'll drop that aspect.

> We could move out xl_xid, which should not be required for all
> records, shaving 4 bytes from the base XLogRecord.  I'm afraid of the
> duplication this would create if we push this data to each RMGR, which
> would, I guess, require a new RMGR callback to retrieve this field on
> a per-record basis.  But perhaps it would not be that bad.

I've wondered if it would be possible to make xl_tot_len a varint that
starts in the last byte of the header, with the next bytes being like
XLogRecordDataHeader[Short|Long], but likewise using a varint.

-- 
John Naylor
Amazon Web Services




Re: get rid of RM_HEAP2_ID

2025-10-14 Thread Michael Paquier
On Wed, Oct 15, 2025 at 12:01:44PM +0700, John Naylor wrote:
> On Wed, Oct 15, 2025 at 6:55 AM Michael Paquier  wrote:
>> We could move out xl_xid, which should not be required for all
>> records, shaving 4 bytes from the base XLogRecord.  I'm afraid of the
>> duplication this would create if we push this data to each RMGR, which
>> would, I guess, require a new RMGR callback to retrieve this field on
>> a per-record basis.  But perhaps it would not be that bad.
> 
> I've wondered if it would be possible to make xl_tot_len a varint that
> starts in the last byte of the header, with the next bytes being like
> XLogRecordDataHeader[Short|Long], but likewise using a varint.

This suggestion gives me some shivers, TBH.  We've had a couple of
issues with the read of records that spawn across multiple pages,
relying now very strongly on xl_tot_len being the first field in
XLogRecord.  Making this field variable may be really tricky while the
code we have now in xlogreader.h is pretty stable.
--
Michael


signature.asc
Description: PGP signature


Re: Optimize LISTEN/NOTIFY

2025-10-14 Thread Chao Li


> On Oct 15, 2025, at 05:19, Tom Lane  wrote:
> 
> "Joel Jacobson"  writes:
>> Having investigated this, the "direct advancement" approach seems
>> correct to me.
> 
>> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
>> is of course needed because there are other operations that don't
>> acquire the heavyweight-lock, that take shared/exclusive lock on
>> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
>> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
>> QUEUE_HEAD.)
> 
> Right.  What the heavyweight lock buys for us in this context is that
> we can be sure no other would-be notifier can insert any messages
> in between ours, even though we may take and release NotifyQueueLock
> several times to allow readers to sneak in.  That in turn means that
> it's safe to advance readers over that whole set of messages if we
> know we didn't wake them up for any of those messages.
> 
> There is a false-positive possibility if a reader was previously
> signaled but hasn't yet awoken: we will think that maybe we signaled
> it and hence not advance its pointer.  This is an error in the safe
> direction however, and it will advance its pointer when it does
> wake up.
> 
> A potential complaint is that we are doubling down on the need for
> that heavyweight lock, despite the upthread discussion about maybe
> getting rid of it for better scalability.  However, this patch
> only requires holding a lock across all the insertions, not holding
> it through commit which I think is the true scalability blockage.
> If we did want to get rid of that lock, we'd only need to stop
> releasing NotifyQueueLock at insertion page boundary crossings,
> which I suspect isn't really that useful anyway.  (In connection
> with that though, I think you ought to capture both the "before" and
> "after" pointers within that lock interval, not expend another lock
> acquisition later.)
> 
> It would be good if the patch's comments made these points ...
> also, the comments above struct AsyncQueueControl need to be
> updated, because changing some other backend's queue pos is
> not legal under any of the stated rules.
> 

I used to think “direct advancement” was a good idea. After reading Tom’s 
explanation, and reading v16 again carefully, now I also consider it’s adding 
complexity and could be fragile.

I just composed an example of race condition, please see if it is valid.

Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in 
PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an 
interval in between, something may happen.

Say a listener A, it’s head pointing to 1.

And current QueueHead is 1.

Now two notifiers B and C are committing:
 * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records 
headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3.
 * Now QueueHead is 3.
 * C enters PreCommit_Notify(),  it records headBeforeWrite = 3 and writes to 
5, and records headAfterWrite = 5.
 * Now QueueHead is 5
 * C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s 
headBeforeWrite, C won’t advance A’s head.
 * A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B 
will advance A’s head to 3.
 * At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” 
will never work for A until A wakes up next time.

I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue 
lag exceeds a threshold, then wake it up. This solution is simpler and 
reliable, also reducing the total wake-up count.

> 
>> Given all the experiments since my earlier message, here is a fresh,
>> self-contained write-up:
> 
> I'm getting itchy about removing the local listenChannels list,
> because what you've done is to replace it with a shared data
> structure that can't be accessed without a good deal of locking
> overhead.  That seems like it could easily be a net loss.
> 
> Also, I really do not like this implementation of
> GetPendingNotifyChannels, as it looks like O(N^2) effort.
> The potentially large length of the list it builds is scary too,
> considering the comments that SignalBackends had better not fail.
> If we have to do it that way it'd be better to collect the list
> during PreCommit_Notify.
> 

I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.

In PreCommit_Notify(), we can maintain a local hash table to record pending 
nofications’ channel names. dahash also supports hash table in local memory.

Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can 
just iterate all keys of the local channel name hash.

And the local static numChannelsListeningOn is also not needed. We can get the 
count from the local hash.

WRT to v6, I got a few new comments:

1 - 0002
```
  *   After commit we are called another time (AtCommit_Notify()). Here we
- *   make any actual updates to the effective listen st

Re: new environment variable INITDB_LOCALE_PROVIDER

2025-10-14 Thread Peter Eisentraut

On 10.10.25 20:09, Jeff Davis wrote:

On Fri, 2025-10-10 at 11:32 +0200, Peter Eisentraut wrote:

* Use environment variable name PG_LOCALE_PROVIDER, which seems
more
consistent.


Is this not something that could already be done using
PG_TEST_INITDB_EXTRA_OPTS ?


1. PG_LOCALE_PROVIDER is a documented user-facing option, which will
make it easier for users to set their preferred provider in scripts,
etc.

2. This change also creates default locales for the builtin and ICU
providers, so that initdb without any other locale options will succeed
regardless of the provider.

I broke these up into two patches as v3 to make it easier to
understand.

These patches are independently useful, but also important if we ever
want to change the initdb default to builtin or ICU.


I'm skeptical that we want user-facing environment variables to provide 
initdb defaults.  The use for that hasn't really been explained.


For example, I don't recall anyone asking for an environment variable to 
determine the checksum default.


If we did that, then we might end up with an environment variable per 
option, which would be a lot.


The locale options are already complicated enough; adding more ways to 
set them with new ways that they interact with other options, this adds 
a lot more complications.


I think in practice initdb is mostly run through packager-provided 
infrastructure, so this facility would probably have very little impact 
in practice.






Re: Clarification on Role Access Rights to Table Indexes

2025-10-14 Thread Tom Lane
Jeff Davis  writes:
> On Mon, 2025-10-13 at 17:21 -0400, Tom Lane wrote:
>> I don't think so.  Even AccessShareLock is enough to block another
>> session trying to acquire AccessExclusiveLock, and then not only
>> have you DoS'd that session, but everything else trying to access
>> the table will queue up behind the AccessExclusiveLock request.
>> So it's only not-a-problem if nothing anywhere in the system wants
>> non-sharable locks.

> I tried imagining how that could be a problem, but couldn't come up
> with anything. If the privilege check is right after the lock, then
> either:
> (a) The malicious AccessShareLock is granted, then is quickly released
> when the privilege check fails and the transaction aborts; or
> (b) The malicious AccessShareLock is queued behind a legitimate
> AccessExclusiveLock, in which case every other lock would be queued up
> as well. As soon as the AccessExclusiveLock is released, the
> AccessShareLock would be granted, but quickly released when the
> privilege check fails.
> For it to be a problem, the malicious lock needs to be strong enough to
> conflict with a lock level weaker than itself, i.e. ShareLock or
> stronger.

Robert might remember better, but I think the assumption behind
the current design of RangeVarGetRelidExtended is that it's not
okay to take a lock in the first place if you don't have the
privilege to do so.  Your analysis here supposes that it's okay
to take a lock without privileges so long as you can't block someone
else for very long, where "very long" is not tightly defined but
hopefully isn't controllable by the malicious user.  So that's
moving the goalposts somewhat, but you might get people to sign
onto it with more careful analysis of the worst-case delay.
(The thing I'd worry about is whether it's possible to block
execution of the privilege check, or even just make it slow.)

Given that definition, I think you're right that it's possible to
identify cases where lock-then-check can't cause meaningful DoS.
RangeVarGetRelidExtended has to cope with the general case, so
that's not an argument for simplifying it, but we might not need
equivalent complexity everywhere.

regards, tom lane




Re: Optimize LISTEN/NOTIFY

2025-10-14 Thread Joel Jacobson
On Sat, Oct 11, 2025, at 09:43, Joel Jacobson wrote:
> On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
>> In addition to previously suggested optimization, there is another major
...
>> I'm not entirely sure this approach is correct though

Having investigated this, the "direct advancement" approach seems
correct to me.

(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)

Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:

This series has two patches:

* 0001-optimize_listen_notify-v16.patch:
Improve test coverage of async.c. Adds isolation specs covering
previously untested paths (subxact LISTEN reparenting/merge/abort,
simple NOTIFY reparenting, notification_match dedup, and an array-growth
case used by the follow-on patch.

* 0002-optimize_listen_notify-v16.patch:
Optimize LISTEN/NOTIFY by maintaining a shared channel map and using
direct advancement to avoid useless wakeups.

Problem
---

Today SignalBackends wakes all listeners in the same database, with no
knowledge of which backends listen on which channels. When some backends
are listening on different channels, each NOTIFY causes unnecessary
wakeups and context switches, which can become the bottleneck in
workloads.

Overview of the solution (patch 0002)
-

* Introduce a lazily-created DSA+dshash map (dboid, channel) ->
  [ProcNumber] (channelHash). AtCommit_Notify maintains it for
  LISTEN/UNLISTEN, and SignalBackends consults it to signal only
  listeners on the channels notified within the transaction.
* Add a per-backend wakeupPending flag to suppress duplicate signals.
* Direct advancement: while queuing, PreCommit_Notify records the queue
  head before and after our writes. Writers are globally serialized, so
  the interval [oldHead, newHead) contains only our entries.
  SignalBackends advances any backend still at oldHead directly to
  newHead, avoiding a pointless wakeup.
* Keep the queue healthy by signaling backends that have fallen too far
  behind (lag >= QUEUE_CLEANUP_DELAY) so the global tail can advance.
* pg_listening_channels and IsListeningOn now read from channelHash.
* Add LWLock tranche NOTIFY_CHANNEL_HASH and wait event
  NotifyChannelHash.

No user-visible semantic changes are intended; this is an internal
performance improvement.

Benchmark
-

Using a patched pgbench (adds --listen-notify-benchmark; attached as
.txt to avoid confusing cfbot). Each run performs 10 000 round trips and
adds 100 idle listeners per iteration.

master (HEAD):

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=1 
--notify-idle-step=100

idle_listeners  round_trips_per_sec max_latency_usec
 0  32123.7  893
   100   1952.5 1465
   200991.4 3438
   300663.5 2454
   400494.6 2950
   500398.6 3394
   600332.8 4272
   700287.1 4692
   800252.6 5208
   900225.4 5614
  1000202.5 6212

0002-optimize_listen_notify-v16.patch:

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=1 
--notify-idle-step=100

idle_listeners  round_trips_per_sec max_latency_usec
 0  31832.6 1067
   100  32341.0 1035
   200  31562.5 1054
   300  30040.1 1057
   400  29287.1 1023
   500  28191.9 1201
   600  28166.5 1019
   700  26994.3 1094
   800  26501.0 1043
   900  25974.2 1005
  1000  25720.6 1008

Benchmarked on MacBook Pro Apple M3 Max.

Files
-

* 0001-optimize_listen_notify-v16.patch - tests only.
* 0002-optimize_listen_notify-v16.patch - implementation.
* pgbench-listen-notify-benchmark-patch.txt - adds --listen-notify-benchmark.

Feedback and review much welcomed.

/Joel

0001-optimize_listen_notify-v16.patch
Description: Binary data


0002-optimize_listen_notify-v16.patch
Description: Binary data
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..3f47c50847d 100644
--- a/src