Re: A tidyup of pathkeys.c
> On Oct 14, 2025, at 14:02, David Rowley wrote: > > When working on a5a68dd6d I noticed that truncate_useless_pathkeys() > uses a bunch of different static helper functions that are mostly the > same as each other. Most of them differ only in which field of > PlannerInfo they look at. > > The attached aims to clean that up by making 2 reusable functions. > I've also fixed a few other things I noticed along the way. > > Here's a list of what I've changed: > > 1. Add count_common_leading_pathkeys_ordered() function to check for > leading common pathkeys and use that for sort_pathkeys, > window_pathkeys and window_pathkeys. > 2. Add count_common_leading_pathkeys_unordered() to check for leading > common pathkeys that exist in any portion of the other list of > pathkeys. Use this for group_pathkeys and distinct_pathkeys. > 3. Add some short-circuiting to truncate_useless_pathkeys() as there's > no point in trying to trim down the list when some other operation has > already figured out that it needs all of the pathkeys. > 4. Remove the stray " if (root->group_pathkeys != NIL) return true" > from has_useful_pathkeys(). > > I like 1 and 2 that make truncate_useless_pathkeys() easier to read. And I agree 3 is an optimization though I am not sure how much it will help. For 4, I traced the function has_useful_pathkeys(), and it showed me that root->group_pathkeys and root->query_pathkeys actually point to the same list. So deletion of the check root->group_pathkeys is reasonable. I have only a trivial comment. As you pull out the shared code into count_common_leading_pathkeys_ordered()/unordered(), it’s reasonable to make them inline, which ensures the new code has the same performance as before. However, I realized that truncate_useless_pathkeys() is not on a critical execution path, not making them inline won’t hurt very much. So, it’s up to you whether or not add “inline” to the two new functions. Best reagards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
memory context to complement ASAN
Some bugs have been found in ASAN builds using a compiled-in malloc-only memory context (see commits 54ab74865, 1fe5a347e, ccc8194e4 for the bugs fixed). The patch used was first mentioned here (attachment: malloc_allocator.patch): https://www.postgresql.org/message-id/[email protected] Do we want something like this in core? I don't think we'd need the ENABLE_ASAN symbol, since that seems orthogonal -- we can configure ASAN separately how we wish, and I don't see why we'd need to tie the GUC default for the special context to whether we use ASAN. Aside from that, the basic idea seems useful. -- John Naylor Amazon Web Services
Re: Improved TAP tests by replacing sub-optimal uses of ok() with better Test::More functions
Hi, Thank you for the patch! I think it can make life easier in case of failure of affected tests. It has conflicts with the current master, so rebase is needed. Best regards, Arseniy Mukhin
Re: Panic during xlog building with big values
Hi, Andy, thanks for your review!
I've checked RecordTransactionCommit too, but I don't think it can fire
similar error. Problem, that was described above, occurred because we
used external column storage without compression and with REPLICA
IDENTITY FULL.
To be honest, it's degenerate case, that can occur only in case of tuple
update/delete, when we need full row to identify updated/deleted value,
more info can be found in doc [1].
I've fixed comments with yours remarks, thanks. Patch is attached.
Also rebased patch with commit d3ba50db48e66be8804b9edf093b0f921d625425.
[1]
https://www.postgresql.org/docs/current/logical-replication-publication.html
Best regards,
Maksim Melnikov
From 125b8b75f8db12cfb076709a914d7717060e2e51 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov
Date: Fri, 15 Aug 2025 12:15:09 +0300
Subject: [PATCH v3] Pre-check potential XLogRecord oversize for heap_update
and heap_insert.
Pre-check potential XLogRecord oversize. XLogRecord will be created
later, and it's size will be checked. However, this operation will
occur within a critical section, and in the event of failure, a core
dump will be generated.
It does't seem good, so to avoid this, we can calculate the approximate
xlog record size here and check it.
Size prediction is based on xlog update and xlog delete logic, and can
be revised if it changes. For now, the buf size is limited by
UINT16_MAX (Assert(regbuf->rdata_len <= UINT16_MAX) in xloginsert).
Anyway, to accommodate some overhead, 1M is subtracted from the predicted
value. It seems like that's enough for now.
---
src/backend/access/heap/heapam.c| 31 +
src/backend/access/transam/xloginsert.c | 20
src/include/access/xloginsert.h | 1 +
3 files changed, 52 insertions(+)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 568696333c2..bc9b115d279 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -61,6 +61,7 @@ static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
Buffer newbuf, HeapTuple oldtup,
HeapTuple newtup, HeapTuple old_key_tuple,
bool all_visible_cleared, bool new_all_visible_cleared);
+static void log_heap_precheck(Relation reln, HeapTuple tp);
#ifdef USE_ASSERT_CHECKING
static void check_lock_if_inplace_updateable_rel(Relation relation,
ItemPointer otid,
@@ -9061,6 +9062,34 @@ log_heap_update(Relation reln, Buffer oldbuf,
return recptr;
}
+/*
+ * Pre-check potential XLogRecord oversize. XLogRecord will be created
+ * later, and it's size will be checked. However, this operation will
+ * occur within a critical section, and in the event of failure, a core
+ * dump will be generated.
+ * It does't seem good, so to avoid this, we can calculate the approximate
+ * xlog record size here and check it.
+
+ * Size prediction is based on xlog update and xlog delete logic, and can
+ * be revised if it changes. For now, the buf size is limited by
+ * UINT16_MAX (Assert(regbuf->rdata_len <= UINT16_MAX) in xloginsert).
+
+ * Anyway, to accommodate some overhead, 1M is subtracted from the predicted
+ * value. It seems like that's enough for now.
+ */
+static void
+log_heap_precheck(Relation reln, HeapTuple tp)
+{
+#define XLogRecordMaxOverhead ((uint32) (1024 * 1024))
+
+ if (tp && RelationIsLogicallyLogged(reln))
+ {
+ uint32 data_len = tp->t_len - SizeofHeapTupleHeader;
+
+ XLogPreCheckSize(data_len + XLogRecordMaxOverhead);
+ }
+}
+
/*
* Perform XLogInsert of an XLOG_HEAP2_NEW_CID record
*
@@ -9178,6 +9207,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
*copy = true;
tp = toast_flatten_tuple(tp, desc);
}
+ log_heap_precheck(relation, tp);
return tp;
}
@@ -9234,6 +9264,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
heap_freetuple(oldtup);
}
+ log_heap_precheck(relation, key_tuple);
return key_tuple;
}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 496e0fa4ac6..bb770294c2d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1053,6 +1053,26 @@ XLogCheckBufferNeedsBackup(Buffer buffer)
return false;/* buffer does not need to be backed up */
}
+/*
+ * Ensure that the XLogRecord is not too large.
+ *
+ * XLogReader machinery is only able to handle records up to a certain
+ * size (ignoring machine resource limitations), so make sure that we will
+ * not emit records larger than the sizes advertised to be supported.
+ */
+void
+XLogPreCheckSize(uint32 data_len)
+{
+ if (data_len > XLogRecordMaxSize)
+ {
+ ereport(ERROR,
+(errmsg_internal("oversized WAL record"),
+ errdetail_internal("WAL record would be %u bytes (of maximum "
+ "%u bytes).",
+ data_len, XLogRecordMaxSize)));
+ }
+}
+
/*
* Write a backup block if needed when we are
Re: add function argument name to substring and substr
On Tue, Oct 14, 2025 at 12:11 AM Tom Lane wrote: > > Andrew Dunstan writes: > > I'm late to the party on this, but I wonder if it wouldn't be better to > > use a type-neutral parameter name here, like "source", which could cover > > all these cases, instead of "string", "bytes", etc. > > +1 for that idea. As Jian notes, we'd need to make the docs match, > but I think that this would be an improvement across the board. > Parameter names like "string" don't convey much information. > please check the attached v8. \df substr List of functions Schema | Name | Result data type |Argument data types | Type ++--++-- pg_catalog | substr | bytea| source bytea, start integer | func pg_catalog | substr | bytea| source bytea, start integer, count integer | func pg_catalog | substr | text | source text, start integer | func pg_catalog | substr | text | source text, start integer, count integer | func (4 rows) \df substring List of functions Schema | Name| Result data type |Argument data types | Type +---+--++-- pg_catalog | substring | bit | source bit, start integer | func pg_catalog | substring | bit | source bit, start integer, count integer | func pg_catalog | substring | bytea| source bytea, start integer| func pg_catalog | substring | bytea| source bytea, start integer, count integer | func pg_catalog | substring | text | source text, pattern text | func pg_catalog | substring | text | source text, pattern text, escape text | func pg_catalog | substring | text | source text, start integer | func pg_catalog | substring | text | source text, start integer, count integer | func (8 rows) For function substr, the doc change is not that a big deal. For function substring, in doc/src/sgml/func/func-matching.sgml, we need change some of the string to source. we also need to change many string to source. That's why v8-0002 is big. From d1b3d6c7eb3ebb4456ea2f6bb2223fd6c535fedd Mon Sep 17 00:00:00 2001 From: jian he Date: Tue, 14 Oct 2025 15:42:57 +0800 Subject: [PATCH v8 2/2] add function argument name to function substring with patch applied \df substring List of functions Schema | Name| Result data type |Argument data types | Type +---+--++-- pg_catalog | substring | bit | source bit, start integer | func pg_catalog | substring | bit | source bit, start integer, count integer | func pg_catalog | substring | bytea| source bytea, start integer| func pg_catalog | substring | bytea| source bytea, start integer, count integer | func pg_catalog | substring | text | source text, pattern text | func pg_catalog | substring | text | source text, pattern text, escape text | func pg_catalog | substring | text | source text, start integer | func pg_catalog | substring | text | source text, start integer, count integer | func (8 rows) discussion: https://postgr.es/m/CACJufxHTBkymh06D4mGKNe1YfRNFN+gFBybmygWk=ptmqu0...@mail.gmail.com --- doc/src/sgml/func/func-binarystring.sgml | 26 -- doc/src/sgml/func/func-bitstring.sgml| 24 - doc/src/sgml/func/func-matching.sgml | 32 +++- doc/src/sgml/func/func-string.sgml | 63 +--- src/backend/catalog/system_functions.sql | 2 +- src/include/catalog/pg_proc.dat | 8 +++ 6 files changed, 130 insertions(+), 25 deletions(-) diff --git a/doc/src/sgml/func/func-binarystring.sgml b/doc/src/sgml/func/func-binarystring.sgml index 294d1d97233..98f12c05bfa 100644 --- a/doc/src/sgml/func/func-binarystring.sgml +++ b/doc/src/sgml/func/func-binarystring.sgml @@ -200,11 +200,11 @@ substring -substring ( bytes bytea FROM start integer FOR count integer ) +substring ( source bytea FROM start integer FOR count integer ) bytea -Extracts the substring of bytes starting at +Extracts the substring of source starting at the start'th byte if that is specified, and stopping after count bytes if that is specified. Provide at least one of start @@ -216,6 +216,26 @@ + + + +
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Hi Sawada-San, I started to look again at this thread. Here are some comments for v18 (the documentation only). == doc/src/sgml/config.sgml 1. + +It is important to note that when wal_level is set to +replica, the effective WAL level can automatically change +based on the presence of +logical replication slots. The system automatically increases the +effective WAL level to logical when creating the first +logical replication slot, and decreases it back to replica +when dropping the last logical replication slot. The current effective WAL +level can be monitored through +parameter. + As you say "It is important to note...". So, should this all be using SGML markup? ~~~ 2. + + effective_wal_level configuration parameter + Should this even be called a "configuration parameter", given that you can't configure it? == doc/src/sgml/logicaldecoding.sgml 3. + + When wal_level is set to replica, + logical decoding is automatically activated upon creation of the first + logical replication slot. This activation process involves several steps + and requires waiting for any concurrent transactions to finish, ensuring + system-wide consistency. Conversely, when the last logical replication slot + is dropped from a system with wal_level set to + replica, logical decoding is automatically disabled. + Note that the deactivation of logical decoding might take some time as + it is performed asynchronously by the checkpointer process. + Here it seems to be describing again the concept of the "effective_wal_level" as in the config.sgml, so should this paragraph also make mention of that terminology, and link to that concept like was done before? ~~~ 4. Existing logical slots on standby also get invalidated if - wal_level on the primary is reduced to less than - logical. + logical decoding is disabled on the primary. Would it be easier just to leave the text as it was, but say "effective_wal_level" instead of "wal_level"? == doc/src/sgml/system-views.sgml 5. wal_level_insufficient means that the - primary doesn't have a sufficient to - perform logical decoding. It is set only for logical slots. + logical decoding is disabled on primary (See + . It is set + only for logical slots. 5a. Should this also make use of the "effective_wal_level" terminology? e.g "...means that the logical decoding is disabled on primary (i.e the effective_wal_level is not logical)" ~ 5b. Would it be better to list all these reasons in alphabetical order? ~ 5c. I felt the "It is set only for logical slots" to be a bit vague. IMO it is clearer to say "This reason value is only used for logical slots". Alternatively, maybe the whole list should be restructured like below SUGGESTION The reason for the slot's invalidation. NULL if the slot is not invalidated. Possible reason values for logical or physical slots are: - wal_removed means ... - idle_timeout means ... Possible reason values, only for logical slots are: - rows_removed means ... - wal_level_insufficient means ... ~ 5d. A missing closing parenthesis. "(See xxx" == Kind Regards, Peter Smith. Fujitsu Australia.
Re: Executing pg_createsubscriber with a non-compatible control file
On Mon, Oct 13, 2025 at 03:35:06PM -0700, Masahiko Sawada wrote:
> v1-0001-Introduce-API-able-to-retrieve-contents-of-PG_VER.patch:
>
> v1-0002-pg_upgrade-Use-PG_VERSION-generic-routine.patch:
> v1-0003-pg_createsubscriber-Use-PG_VERSION-generic-routin.patch:
Applied both of these.
> The new log detail message uses the same message as what pg_resetwal
> uses, but pg_createsubscriber shows an integer major version whereas
> pg_resetwal shows the raw version string. I guess it's better to unify
> the usage for better consistency.
OK, done as suggested to limit the translation work.
> v1-0004-pg_combinebackup-use-PG_VERSION-generic-routine.patch:
>
> + pg_log_debug("read server version %u from file \"%s\"",
> +GET_PG_MAJORVERSION_NUM(version), "PG_VERSION");
>
> We used to show the full path of PG_VERSION file in the debug message.
> While probably we can live with such a small difference, do you think
> it's a good idea to move the debug message to get_pg_version()?
I cannot get into doing that, leaving that up to the caller with the
error message they want. That's a minor point, I guess, I can see
both sides of the coin.
Switched this one to report the full path, like previously.
> v1-0005-pg_resetwal-use-PG_VERSION-generic-routine.patch:
>
> /* Check that data directory matches our server version */
> - CheckDataVersion();
> + CheckDataVersion(DataDir);
>
> With the patch, pg_resetwal fails to handle a relative path of PGDATA
> as follows:
>
> $ bin/pg_resetwal data
> pg_resetwal: error: could not open version file "data/PG_VERSION": No
> such file or directory
>
> This is because pg_resetwal does chdir() to the given path of the data
> directory before checking the version.
Right. I've tested with absolute paths and forgot relative paths.
For this one, I would just use a ".", as the chdir is before the
version check. Or we could reverse the chdir() and the version check,
but there is no real benefit in doing so.
Updated patch set attached.
--
Michael
From 1446b30bea9786432dcd9029ce61958ae334e46b Mon Sep 17 00:00:00 2001
From: Michael Paquier
Date: Tue, 14 Oct 2025 16:41:19 +0900
Subject: [PATCH v2 1/3] pg_createsubscriber: Use PG_VERSION generic routine
---
src/bin/pg_basebackup/pg_createsubscriber.c | 19 ++-
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index d29407413d96..9fa5caaf91d6 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -26,6 +26,7 @@
#include "fe_utils/recovery_gen.h"
#include "fe_utils/simple_list.h"
#include "fe_utils/string_utils.h"
+#include "fe_utils/version.h"
#include "getopt_long.h"
#define DEFAULT_SUB_PORT "50432"
@@ -407,7 +408,8 @@ static void
check_data_directory(const char *datadir)
{
struct stat statbuf;
- char versionfile[MAXPGPATH];
+ uint32 major_version;
+ char *version_str;
pg_log_info("checking if directory \"%s\" is a cluster data directory",
datadir);
@@ -420,11 +422,18 @@ check_data_directory(const char *datadir)
pg_fatal("could not access directory \"%s\": %m", datadir);
}
- snprintf(versionfile, MAXPGPATH, "%s/PG_VERSION", datadir);
- if (stat(versionfile, &statbuf) != 0 && errno == ENOENT)
+ /*
+ * Retrieve the contents of this cluster's PG_VERSION. We require
+ * compatibility with the same major version as the one this tool is
+ * compiled with.
+ */
+ major_version = GET_PG_MAJORVERSION_NUM(get_pg_version(datadir, &version_str));
+ if (major_version != PG_MAJORVERSION_NUM)
{
- pg_fatal("directory \"%s\" is not a database cluster directory",
- datadir);
+ pg_log_error("data directory is of wrong version");
+ pg_log_error_detail("File \"%s\" contains \"%s\", which is not compatible with this program's version \"%s\".",
+ "PG_VERSION", version_str, PG_MAJORVERSION);
+ exit(1);
}
}
--
2.51.0
From 7185c5d6eb3ce0afb22d7be197e6901f2cab55c4 Mon Sep 17 00:00:00 2001
From: Michael Paquier
Date: Tue, 14 Oct 2025 16:47:03 +0900
Subject: [PATCH v2 2/3] pg_combinebackup: use PG_VERSION generic routine
---
src/bin/pg_combinebackup/pg_combinebackup.c | 65 +++--
1 file changed, 9 insertions(+), 56 deletions(-)
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index a330c09d939d..3a3251272092 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -34,6 +34,7 @@
#include "common/relpath.h"
#include "copy_file.h"
#include "fe_utils/option_utils.h"
+#include "fe_utils/version.h"
#include "getopt_long.h"
#include "lib/stringinfo.h"
#include "load_manifest.h"
@@ -117,7 +118,6 @@ static void process_directory_recursively(Oid tsoid,
manifest_data **manifests,
manifest_writer *mwriter,
cb_options *opt);
-static int read_pg_vers
Re: A tidyup of pathkeys.c
On Tue, Oct 14, 2025 at 3:03 PM David Rowley wrote: > Here's a list of what I've changed: > > 1. Add count_common_leading_pathkeys_ordered() function to check for > leading common pathkeys and use that for sort_pathkeys, > window_pathkeys and window_pathkeys. > 2. Add count_common_leading_pathkeys_unordered() to check for leading > common pathkeys that exist in any portion of the other list of > pathkeys. Use this for group_pathkeys and distinct_pathkeys. > 3. Add some short-circuiting to truncate_useless_pathkeys() as there's > no point in trying to trim down the list when some other operation has > already figured out that it needs all of the pathkeys. > 4. Remove the stray " if (root->group_pathkeys != NIL) return true" > from has_useful_pathkeys(). +1. I think this is a nice tidy-up. FWIW, I complained about the stray check in has_useful_pathkeys() in [1] last week, but you were quicker than me in making the code change to remove it. [1] https://postgr.es/m/CAMbWs4_zW5QU=zk32s17p8qwy+ga-3zutons+y+wopguiop...@mail.gmail.com - Richard
Re: Implement waiting for wal lsn replay: reloaded
Hi, On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou wrote: > > Hi, > > On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou wrote: > > > > Hi, > > > > On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou wrote: > > > > > > Hi Álvaro, > > > > > > Thanks for your review. > > > > > > On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera > > > wrote: > > > > > > > > On 2025-Sep-15, Alexander Korotkov wrote: > > > > > > > > > > It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and > > > > > > CREATE > > > > > > PUBLICATION - all use minimal grammar rules that produce generic > > > > > > option lists, with the actual interpretation done in their > > > > > > respective > > > > > > implementation files. The moderate complexity in wait.c seems > > > > > > acceptable. > > > > > > > > Actually I find the code in ExecWaitStmt pretty unusual. We tend to use > > > > lists of DefElem (a name optionally followed by a value) instead of > > > > individual scattered elements that must later be matched up. Why not > > > > use utility_option_list instead and then loop on the list of DefElems? > > > > It'd be a lot simpler. > > > > > > I took a look at commands like VACUUM and EXPLAIN and they do follow > > > this pattern. v11 will make use of utility_option_list. > > > > > > > Also, we've found that failing to surround the options by parens leads > > > > to pain down the road, so maybe add that. Given that the LSN seems to > > > > be mandatory, maybe make it something like > > > > > > > > WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ] > > > > > > > > This requires that you make LSN a keyword, albeit unreserved. Or you > > > > could make it > > > > WAIT FOR Ident [the rest] > > > > and then ensure in C that the identifier matches the word LSN, such as > > > > we do for "permissive" and "restrictive" in > > > > RowSecurityDefaultPermissive. > > > > > > Shall make LSN an unreserved keyword as well. > > > > Here's the updated v11. Many thanks Jian for off-list discussions and > > review. > > v12 removed unused > +WaitStmt > +WaitStmtParam in pgindent/typedefs.list. > Hi, I’ve split the patch into multiple patch sets for easier review, per Michael’s advice [1]. [1] https://www.postgresql.org/message-id/aOMsv9TszlB1n-W7%40paquier.xyz Best, Xuneng From c3dd9972d8043c07247bb3e2b476026268ee1bad Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Tue, 14 Oct 2025 20:50:04 +0800 Subject: [PATCH v13 3/3] Implement WAIT FOR command WAIT FOR is to be used on standby and specifies waiting for the specific WAL location to be replayed. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes are on standby. WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records, implying a kind of self-deadlock. This is why separate utility command seems appears to be the most robust way to implement this functionality. It's not possible to implement this as a function. Previous experience shows that stored procedures also have limitation in this aspect. Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com Author: Kartyshov Ivan Author: Alexander Korotkov Co-authored-by: Xuneng Zhou Reviewed-by: Michael Paquier Reviewed-by: Peter Eisentraut Reviewed-by: Dilip Kumar Reviewed-by: Amit Kapila Reviewed-by: Alexander Lakhin Reviewed-by: Bharath Rupireddy Reviewed-by: Euler Taveira Reviewed-by: Heikki Linnakangas Reviewed-by: Kyotaro Horiguchi Reviewed-by: jian he --- doc/src/sgml/high-availability.sgml | 54 doc/src/sgml/ref/allfiles.sgml| 1 + doc/src/sgml/ref/wait_for.sgml| 234 + doc/src/sgml/reference.sgml | 1 + src/backend/access/transam/xact.c | 6 + src/backend/access/transam/xlog.c | 7 + src/backend/access/transam/xlogrecovery.c | 11 + src/backend/access/transam/xlogwait.c | 27 +- src/backend/commands/Makefile | 3 +- src/backend/commands/meson.build | 1 + src/backend/commands/wait.c | 212 src/backend/parser/gram.y | 33 ++- src/backend/storage/lmgr/proc.c | 5 + src/backend/tcop/pquery.c | 12 +- src/backend/tcop/utility.c| 22 ++ src/include/access/xlogwait.h | 3 +- src/include/commands/wait.h | 22 ++ src/include/nodes/parsenodes.h| 8 + src/include/parser/kwlist.h | 2 + src/include/tcop/cmdtaglist.h | 1 + src/test/recovery/meson.build | 3 +- src/test/recovery/t/049_wait_for_lsn.pl | 293 ++ src/tools/pgindent/typedefs.list | 3 + 23 files changed, 951 insertions(+), 13 deletio
RE: Logical Replication of sequences
Dear Hou,
Thanks for updating the patch. Here are comments for recent 0002.
Others are still being reviewed
01. pg_subscription_rel.h
```
+#include "nodes/primnodes.h"
```
The inclusion is not needed because the
02. typedefs.list
```
+SubscriptionRelKind
```
Missing update.
03. subscritioncmds.c
```
+#include "catalog/pg_sequence.h"
```
I could build without the inclusion. Can you remove?
04. check_publications_origin
```
+
+ query = "SELECT DISTINCT P.pubname AS pubname\n"
+ "FROM pg_publication P,\n"
+ " LATERAL %s GPR\n"
...
```
pgindent does not like the notation. How aboout chaning the line after the "="?
I.e.,
```
query =
"SELECT DISTINCT P.pubname AS pubname\n"
"FROM pg_publication P,\n"
" LATERAL %s GPR\n"
...
```
05. AddSubscriptionRelState
```
if (HeapTupleIsValid(tup))
elog(ERROR, "subscription table %u in subscription %u already
exists",
relid, subid);
```
Theoretically subid might be the sequence, right? Should we say "relation"
instead of "table" as well?
06. AlterSubscription_refresh_seq
```
+ /* Get local relation list. */
```
In contrast can we say "sequence"?
07. check_publications_origin
```
if (res->status != WALRCV_OK_TUPLES)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("could not receive list of replicated
tables from the publisher: %s",
res->err)));
```
Should we say "relations" instead of "tables"? Similar lines are:
```
/* Process tables. */
...
* Log a warning if the publisher has subscribed to the same table from
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Re: pageinspect some function no need superuser priv
On Tue, Oct 14, 2025 at 10:29:39AM -0400, Tom Lane wrote: > Yeah, I do not think it follows that being table owner should > entitle you to such low-level access. I'm inclined to reject > this proposal. -1 here, too. IMHO all of pageinspect should remain superuser-only since it is meant for development/debugging. The proposal doesn't describe a use-case for the relaxed privileges, either. -- nathan
Re: Patch for migration of the pg_commit_ts directory
Hi,I applied the diff file.Deleted the $ts variable. Кому: 'ls' ([email protected]);Копия: [email protected], [email protected], [email protected];Тема: Patch for migration of the pg_commit_ts directory;14.10.2025, 09:27, "Hayato Kuroda (Fujitsu)":Hi,Thanks for updating the patch.```+my ($xid,$ts) = $resold =~ /\s*(\d+)\s*\|(.*)/;```Per my understanding $ts is not used here. Can you remove it?Apart from above, I found some cosmetic issues. Please see attachedmy fix which can be applied atop HEAD. Can you check and include ifit is OK?Best regards,Hayato KurodaFUJITSU LIMITED From bc582227cdd51d252db00fccc4d572524f9a4b91 Mon Sep 17 00:00:00 2001 From: Sergey Levin Date: Sat, 4 Oct 2025 20:22:46 +0500 Subject: [PATCH] Migration of the pg_commit_ts directory --- src/bin/pg_upgrade/check.c| 24 +++ src/bin/pg_upgrade/controldata.c | 20 ++ src/bin/pg_upgrade/meson.build| 1 + src/bin/pg_upgrade/pg_upgrade.c | 28 +++- src/bin/pg_upgrade/pg_upgrade.h | 2 + .../pg_upgrade/t/007_transfer_commit_ts.pl| 67 +++ 6 files changed, 139 insertions(+), 3 deletions(-) create mode 100644 src/bin/pg_upgrade/t/007_transfer_commit_ts.pl diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 1e17d64b3ec..382da0485a7 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -32,6 +32,7 @@ static void check_new_cluster_replication_slots(void); static void check_new_cluster_subscription_configuration(void); static void check_old_cluster_for_valid_slots(void); static void check_old_cluster_subscription_state(void); +static void check_new_cluster_pg_commit_ts(void); /* * DataTypesUsageChecks - definitions of data type checks for the old cluster @@ -767,9 +768,32 @@ check_new_cluster(void) check_new_cluster_replication_slots(); check_new_cluster_subscription_configuration(); + + check_new_cluster_pg_commit_ts(); + } +void +check_new_cluster_pg_commit_ts(void) +{ + PGconn *conn; + PGresult *res; + bool track_commit_timestamp_on; + + prep_status("Checking for new cluster configuration for commit timestamp"); + conn = connectToServer(&new_cluster, "template1"); + res = executeQueryOrDie(conn, "SELECT setting FROM pg_settings " + "WHERE name = 'track_commit_timestamp'"); + track_commit_timestamp_on = strcmp(PQgetvalue(res, 0, 0), "on") == 0; + PQclear(res); + PQfinish(conn); + if (!track_commit_timestamp_on && + old_cluster.controldata.chkpnt_newstCommitTsxid > 0) + pg_fatal("\"track_commit_timestamp\" must be \"on\" but is set to \"off\""); + + check_ok(); +} void report_clusters_compatible(void) { diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c index 90cef0864de..594dc1c0a9f 100644 --- a/src/bin/pg_upgrade/controldata.c +++ b/src/bin/pg_upgrade/controldata.c @@ -321,6 +321,26 @@ get_control_data(ClusterInfo *cluster) cluster->controldata.chkpnt_nxtmulti = str2uint(p); got_multi = true; } + else if ((p = strstr(bufin, "Latest checkpoint's oldestCommitTsXid:")) != NULL) + { + p = strchr(p, ':'); + + if (p == NULL || strlen(p) <= 1) +pg_fatal("%d: controldata retrieval problem", __LINE__); + + p++;/* remove ':' char */ + cluster->controldata.chkpnt_oldstCommitTsxid = str2uint(p); + } + else if ((p = strstr(bufin, "Latest checkpoint's newestCommitTsXid:")) != NULL) + { + p = strchr(p, ':'); + + if (p == NULL || strlen(p) <= 1) +pg_fatal("%d: controldata retrieval problem", __LINE__); + + p++;/* remove ':' char */ + cluster->controldata.chkpnt_newstCommitTsxid = str2uint(p); + } else if ((p = strstr(bufin, "Latest checkpoint's oldestXID:")) != NULL) { p = strchr(p, ':'); diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build index ac992f0d14b..030618152d0 100644 --- a/src/bin/pg_upgrade/meson.build +++ b/src/bin/pg_upgrade/meson.build @@ -47,6 +47,7 @@ tests += { 't/004_subscription.pl', 't/005_char_signedness.pl', 't/006_transfer_modes.pl', + 't/007_transfer_commit_ts.pl', ], 'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow }, diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 490e98fa26f..70297f38dca 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -772,6 +772,9 @@ copy_subdir_files(const char *old_subdir, const char *new_subdir) static void copy_xact_xlog_xid(void) { + bool is_copy_commit_ts; + uint32 oldest_xid, newest_xid; + /* * Copy old commit logs to new data dir. pg_clog has been renamed to * pg_xact in post-10 clusters. @@ -781,6 +784,22 @@ copy_xact_xlog_xid(void) GET_MAJOR_VERSION(new_cluster.major_version) <= 906 ? "pg_clog" : "pg_xact"); + /* + * Copy old commit_timestamp data to new, if available. + */ + is
Re: Improved TAP tests by replacing sub-optimal uses of ok() with better Test::More functions
On 2025-10-14 Tu 4:01 AM, Michael Paquier wrote: On Fri, Oct 10, 2025 at 09:33:10AM -0400, Andrew Dunstan wrote: Great, I think this is a definite improvement. I saw someone recently complaining about this overuse of ok(), so thanks for doing the work to improve it. Yeah, it's really cool to see someone step up and do all this leg work for the existing code. I have not checked the patch in details or if there are missing spots. Andrew, is that something you are planning to do? I believe Sadhuprasad used this recipe to find these: find src contrib -type f -name '*.p[lm]' -print | \ xargs grep -P '\bok[(].*[~=]' Maybe that would miss a few, but I bet not too many. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: A tidyup of pathkeys.c
> On Oct 14, 2025, at 19:22, David Rowley wrote:
>
> What makes you think making them inline would make the performance the
> same as before? The previous functions were not inlined, and I've not
> made any changes that should affect the compiler's ability to choose
> to inline these functions or not.
Ah… You are right. The old code:
static int
pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
{
int n_common_pathkeys;
(void) pathkeys_count_contained_in(root->sort_pathkeys, pathkeys,
&n_common_pathkeys);
return n_common_pathkeys;
}
Your patch’s code:
static int
count_common_leading_pathkeys_ordered(List *keys1, List *keys2)
{
int ncommon;
(void) pathkeys_count_contained_in(keys1, keys2, &ncommon);
return ncommon;
}
They both call pathkeys_count_contained_in(), you are NOT adding an extra
wrapper. So, I withdraw the “inline” comment.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
Hi, On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou wrote: > > Hi, > > The following is the split patch set. There are certain limitations to > this simplification effort, particularly in patch 2. The > read_local_xlog_page_guts callback demands more functionality from the > facility than the WAIT FOR patch — specifically, it must wait for WAL > flush events, though it does not require timeout handling. In some > sense, parts of patch 3 can be viewed as a superset of the WAIT FOR > patch, since it installs wake-up hooks in more locations. Unlike the > WAIT FOR patch, which only needs wake-ups triggered by replay, > read_local_xlog_page_guts must also handle wake-ups triggered by WAL > flushes. > > Workload characteristics play a key role here. A sorted dlist performs > well when insertions and removals occur in order, achieving O(1) > complexity in the best case. In synchronous replication, insertion > patterns seem generally monotonic with commit LSNs, though not > strictly ordered due to timing variations and contention. When most > insertions remain ordered, a dlist can be efficient. However, as the > number of elements grows and out-of-order insertions become more > frequent, the insertion cost can degrade to O(n) more often. > > By contrast, a pairing heap maintains stable O(1) insertion for both > ordered and disordered inputs, with amortized O(log n) removals. Since > LSNs in the WAIT FOR command are likely to arrive in a non-sequential > fashion, the pairing heap introduced in v6 provides more predictable > performance under such workloads. > > At this stage (v7), no consolidation between syncrep and xlogwait has > been implemented. This is mainly because the dlist and pairing heap > each works well under different workloads — neither is likely to be > universally optimal. Introducing the facility with a pairing heap > first seems reasonable, as it offers flexibility for future > refactoring: we could later replace dlist with a heap or adopt a > modular design depending on observed workload characteristics. > v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay, as the likelihood of a server state change is small compared to the branching cost and added code complexity. Best, Xuneng From 32dab7ed64eecb62adce6b1d124b1fa389515e74 Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Fri, 10 Oct 2025 16:35:38 +0800 Subject: [PATCH v8 2/3] Add infrastructure for efficient LSN waiting Implement a new facility that allows processes to wait for WAL to reach specific LSNs, both on primary (waiting for flush) and standby (waiting for replay) servers. The implementation uses shared memory with per-backend information organized into pairing heaps, allowing O(1) access to the minimum waited LSN. This enables fast-path checks: after replaying or flushing WAL, the startup process or WAL writer can quickly determine if any waiters need to be awakened. Key components: - New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush() - Separate pairing heaps for replay and flush waiters - WaitLSN lightweight lock for coordinating shared state - Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring This infrastructure can be used by features that need to wait for WAL operations to complete. Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com Author: Kartyshov Ivan Author: Alexander Korotkov Author: Xuneng Zhou Reviewed-by: Michael Paquier Reviewed-by: Peter Eisentraut Reviewed-by: Dilip Kumar Reviewed-by: Amit Kapila Reviewed-by: Alexander Lakhin Reviewed-by: Bharath Rupireddy Reviewed-by: Euler Taveira Reviewed-by: Heikki Linnakangas Reviewed-by: Kyotaro Horiguchi --- src/backend/access/transam/Makefile | 3 +- src/backend/access/transam/meson.build| 1 + src/backend/access/transam/xlogwait.c | 525 ++ src/backend/storage/ipc/ipci.c| 3 + .../utils/activity/wait_event_names.txt | 3 + src/include/access/xlogwait.h | 112 src/include/storage/lwlocklist.h | 1 + 7 files changed, 627 insertions(+), 1 deletion(-) create mode 100644 src/backend/access/transam/xlogwait.c create mode 100644 src/include/access/xlogwait.h diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile index 661c55a9db7..a32f473e0a2 100644 --- a/src/backend/access/transam/Makefile +++ b/src/backend/access/transam/Makefile @@ -36,7 +36,8 @@ OBJS = \ xlogreader.o \ xlogrecovery.o \ xlogstats.o \ - xlogutils.o + xlogutils.o \ + xlogwait.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build index e8ae9b13c8e..74a62ab3eab 100644 --- a/src/backend/access/trans
Re: pg_createsubscriber - more logging to say if there are no pubs to drop
On Tue, Oct 14, 2025 at 5:26 PM Peter Smith wrote: > > On Wed, Oct 15, 2025 at 11:11 AM Masahiko Sawada > wrote: > > > > On Tue, Oct 14, 2025 at 4:29 PM Peter Smith wrote: > > > > > > On Wed, Oct 15, 2025 at 4:42 AM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Oct 8, 2025 at 6:34 PM Peter Smith > > > > wrote: > > > > > > > > > > Hi hackers, > > > > > > > > > > While reviewing pg_createsubscriber in another thread, I found some of > > > > > the current logging to be confusing. Specifically, there is one part > > > > > that drops all existing publications. Sometimes it might look like > > > > > this: > > > > > > > > > > -- > > > > > pg_createsubscriber: dropping all existing publications in database > > > > > "db2" > > > > > pg_createsubscriber: dropping publication "pub_exists1" in database > > > > > "db2" > > > > > pg_createsubscriber: dropping publication "pub_exists2" in database > > > > > "db2" > > > > > pg_createsubscriber: dropping publication "pub_exists3" in database > > > > > "db2" > > > > > -- > > > > > > > > > > ~~~ > > > > > > > > > > OTOH, if there is nothing found to be dropped, then the logging just > > > > > says: > > > > > > > > > > -- > > > > > pg_createsubscriber: dropping all existing publications in database > > > > > "db2" > > > > > -- > > > > > > > > > > That's the scenario that I found ambiguous. You can't be sure from the > > > > > logs what happened: > > > > > - Were there publications found, and were they dropped silently? > > > > > - Did it not find anything to drop? > > > > > > > > > > ~~~ > > > > > > > > > > Here is a small patch to remove that doubt. Now, if there is nothing > > > > > found, the logging would look like: > > > > > > > > > > -- > > > > > pg_createsubscriber: dropping all existing publications in database > > > > > "db2" > > > > > pg_createsubscriber: no publications found > > > > > -- > > > > > > > > > > Thoughts? > > > > > > > > Thank you for the patch! > > > > > > > > It sounds like a reasonable improvement. I'll push the patch, barring > > > > any objections. > > > > > > > > > > Hm.. I'm wondering now if this patch was correct :-( > > > > > > I had encountered this problem and made this patch while reviewing > > > another thread [1], but that other patch is still in development flux > > > and so the logging was changing, and may have been broken at the time > > > I wrote this one. > > > > > > In hindsight, I am not sure if this "no publication found" log is > > > working as intended for the master code. > > > > > > ~~~ > > > > > > For example, during a '--dry-run' the create/drop publication in > > > pg_createsubscriber are NOP but they still do the appropriate logging. > > > > > > So, it is true that PQntuples(res) can be 0, because for dry run > > > create_publication is NOP. > > > But the dry run will still show logging for the create and for the drop. > > > > > > I hacked one of the test cases: > > > > > > # PS HACK > > > command_ok( > > > [ > > > 'pg_createsubscriber', > > > '--verbose', > > > '--dry-run', > > > '--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default, > > > '--pgdata' => $node_s->data_dir, > > > '--publisher-server' => $node_p->connstr($db1), > > > '--socketdir' => $node_s->host, > > > '--subscriber-port' => $node_s->port, > > > '--publication' => 'pub1', > > > '--publication' => 'pub2', > > > '--subscription' => 'sub1', > > > '--subscription' => 'sub2', > > > '--database' => $db1, > > > '--database' => $db2, > > > '--clean' => 'publications', > > > ], > > > 'PS HACK'); > > > > > > This gives a logging result like: > > > > > > pg_createsubscriber: dropping all existing publications in database "xxx" > > > 2025-10-15 10:18:53.583 AEDT client backend[31402] > > > 040_pg_createsubscriber.pl LOG: statement: SELECT pubname FROM > > > pg_catalog.pg_publication; > > > pg_createsubscriber: no publications found > > > pg_createsubscriber: dropping publication "pub2" in database "xxx" > > > > > > ~~~ > > > > > > First saying "no publications found", and then logging a drop anyway > > > may seem contrary to the user. > > > > > > That was not the intended outcome for this patch. > > > > Hmm. So the subscriber should have at least one publication in > > non-dry-run cases, and in dry-run cases it has no subscription but > > shows the "dropping publication" log, is that right? If the newly > > added log could rather confuse users, we should revert the change. > > > > IIUC the created publication (on publisher) ends up on the subscriber > because of the physical replication that occurred before the tool > switches over to use logical replication. > > So the created publication has ended up on subscriber, and now the > --clean will remove it. > > But for dry-run that create_publication is NOP so it never *actually* > propagated to the subscribe
Re: pg_createsubscriber - more logging to say if there are no pubs to drop
On Wed, Oct 8, 2025 at 6:34 PM Peter Smith wrote: > > Hi hackers, > > While reviewing pg_createsubscriber in another thread, I found some of > the current logging to be confusing. Specifically, there is one part > that drops all existing publications. Sometimes it might look like > this: > > -- > pg_createsubscriber: dropping all existing publications in database "db2" > pg_createsubscriber: dropping publication "pub_exists1" in database "db2" > pg_createsubscriber: dropping publication "pub_exists2" in database "db2" > pg_createsubscriber: dropping publication "pub_exists3" in database "db2" > -- > > ~~~ > > OTOH, if there is nothing found to be dropped, then the logging just says: > > -- > pg_createsubscriber: dropping all existing publications in database "db2" > -- > > That's the scenario that I found ambiguous. You can't be sure from the > logs what happened: > - Were there publications found, and were they dropped silently? > - Did it not find anything to drop? > > ~~~ > > Here is a small patch to remove that doubt. Now, if there is nothing > found, the logging would look like: > > -- > pg_createsubscriber: dropping all existing publications in database "db2" > pg_createsubscriber: no publications found > -- > > Thoughts? Thank you for the patch! It sounds like a reasonable improvement. I'll push the patch, barring any objections. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Making pg_rewind faster
Hi, On Fri, Oct 10, 2025 at 12:45 AM Srinath Reddy Sadipiralla wrote: > > ... > XLogFilePath , then validate this new path with the given path > ,this helps to catch invalid xlog files like pg_wal/0001FF10. > ... Are you concerned that somehow these files, which are named like XLog files but actually aren't, are somehow created therefore we should sync them in case? I'm trying to understand how these files would be generated in the first place. > > instead of passing last_common_segno down the call stack directly, > we can have a struct maybe called "rewindState" which will have the common > information related to both clusters involved in rewind ,so that in future > if we want to pass more such information down we won't be increasing > the number of arguments that need to be passed down to all functions in stack. > I don't feel too strongly about this. Not sure how we view future-proofing things, as in do we proactively wrap around structs or wait until there's an actual need. This is the first time it's been touched since it was introduced in 14 for what it's worth. > > we can improve the TAP test file header comment as > # Test the situation where we skip copying the same WAL files from source to > target > # except if WAL file size differs. > > let me put it this way > 1) WALs are written to 00010002 in primary. > ... > 7) so the last common WAL segment was 00010003. Updated comments. > > ... > where entry->target_size != entry->source_size and we do copy, > like if stat->mtime is changed (which means file has been copied) > and target_stat->size != soucrce_stat->size here no error expected. > Updated patch with one additional segment has a byte appended to it on target but not source. Thanks, John H v9-0001-Avoid-copying-WAL-segments-before-divergence-to-s.patch Description: Binary data
Re: Non-blocking archiver process
Here are a few comments from me: I think that by calling system(), anything that is printed out by the child process ends up in the PostgreSQL log. With the patch, the output of the command seems like it will be read into a buffer and discarded. That's a significant behavior difference. This patch has a 10ms polling loop after which it checks for interrupts, and then repeats. I think our normal convention in these kinds of cases is to use WaitLatchOrSocket(). That allows for a longer sleep time (like 1s) which results in fewer processor wakeups, while actually still being more responsive because the arrival of an interrupt should set the process latch and terminate the wait. In general, I agree with Artem's comments about writing code that fits the PostgreSQL style. We don't want to invent new ways of solving problems for which we already have infrastructure, nor do we want to solve problems in this case in a way that is different from what we do in other cases. Using ereport appropriately is part of that, but there's a lot more to it. Artem mentioned malloc and free, but notice, for example, that we also have our own wrapper around popen() in OpenPipeStream(). Maybe we shouldn't use that here, but we shouldn't *accidentally* not use that here. I wonder whether we should really be using popen() in any form, actually. If what we want is for the output of the command to go to our standard output, as it does with system(), that's exactly what popen() is designed not to do, so it doesn't seem like a natural fit. Perhaps we should be using fork() + exec()? Or maybe we already have some good infrastructure for this we can reuse somewhere? -- Robert Haas EDB: http://www.enterprisedb.com
Re: plpython: Remove support for major version conflict detection
On 08.10.25 18:41, Nathan Bossart wrote: On Wed, Oct 08, 2025 at 12:28:45PM -0300, Mario González Troncoso wrote: I'm attaching the same patch after rebasing from master. LGTM Committed, thanks.
Re: get rid of RM_HEAP2_ID
On Wed, Oct 15, 2025 at 6:55 AM Michael Paquier wrote: > > On Tue, Oct 14, 2025 at 03:20:16PM +0300, Heikki Linnakangas wrote: > > I'm not sure I agree with the premise that we should try to get rid of > > RM_HEAP2_ID. There's nothing wrong with that scheme as such. As an > > alternative, we could easily teach e.g pg_waldump to treat RM_HEAP_ID and > > RM_HEAP2_ID the same for statistics purposes. > > Yeah, I'd rather keep heap2 as well. As long as there is more room > for the record IDs, we're still going to need it in the long run. Okay, I'll drop that aspect. > We could move out xl_xid, which should not be required for all > records, shaving 4 bytes from the base XLogRecord. I'm afraid of the > duplication this would create if we push this data to each RMGR, which > would, I guess, require a new RMGR callback to retrieve this field on > a per-record basis. But perhaps it would not be that bad. I've wondered if it would be possible to make xl_tot_len a varint that starts in the last byte of the header, with the next bytes being like XLogRecordDataHeader[Short|Long], but likewise using a varint. -- John Naylor Amazon Web Services
Re: get rid of RM_HEAP2_ID
On Wed, Oct 15, 2025 at 12:01:44PM +0700, John Naylor wrote: > On Wed, Oct 15, 2025 at 6:55 AM Michael Paquier wrote: >> We could move out xl_xid, which should not be required for all >> records, shaving 4 bytes from the base XLogRecord. I'm afraid of the >> duplication this would create if we push this data to each RMGR, which >> would, I guess, require a new RMGR callback to retrieve this field on >> a per-record basis. But perhaps it would not be that bad. > > I've wondered if it would be possible to make xl_tot_len a varint that > starts in the last byte of the header, with the next bytes being like > XLogRecordDataHeader[Short|Long], but likewise using a varint. This suggestion gives me some shivers, TBH. We've had a couple of issues with the read of records that spawn across multiple pages, relying now very strongly on xl_tot_len being the first field in XLogRecord. Making this field variable may be really tricky while the code we have now in xlogreader.h is pretty stable. -- Michael signature.asc Description: PGP signature
Re: Optimize LISTEN/NOTIFY
> On Oct 15, 2025, at 05:19, Tom Lane wrote: > > "Joel Jacobson" writes: >> Having investigated this, the "direct advancement" approach seems >> correct to me. > >> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock >> is of course needed because there are other operations that don't >> acquire the heavyweight-lock, that take shared/exclusive lock on >> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on >> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the >> QUEUE_HEAD.) > > Right. What the heavyweight lock buys for us in this context is that > we can be sure no other would-be notifier can insert any messages > in between ours, even though we may take and release NotifyQueueLock > several times to allow readers to sneak in. That in turn means that > it's safe to advance readers over that whole set of messages if we > know we didn't wake them up for any of those messages. > > There is a false-positive possibility if a reader was previously > signaled but hasn't yet awoken: we will think that maybe we signaled > it and hence not advance its pointer. This is an error in the safe > direction however, and it will advance its pointer when it does > wake up. > > A potential complaint is that we are doubling down on the need for > that heavyweight lock, despite the upthread discussion about maybe > getting rid of it for better scalability. However, this patch > only requires holding a lock across all the insertions, not holding > it through commit which I think is the true scalability blockage. > If we did want to get rid of that lock, we'd only need to stop > releasing NotifyQueueLock at insertion page boundary crossings, > which I suspect isn't really that useful anyway. (In connection > with that though, I think you ought to capture both the "before" and > "after" pointers within that lock interval, not expend another lock > acquisition later.) > > It would be good if the patch's comments made these points ... > also, the comments above struct AsyncQueueControl need to be > updated, because changing some other backend's queue pos is > not legal under any of the stated rules. > I used to think “direct advancement” was a good idea. After reading Tom’s explanation, and reading v16 again carefully, now I also consider it’s adding complexity and could be fragile. I just composed an example of race condition, please see if it is valid. Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an interval in between, something may happen. Say a listener A, it’s head pointing to 1. And current QueueHead is 1. Now two notifiers B and C are committing: * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3. * Now QueueHead is 3. * C enters PreCommit_Notify(), it records headBeforeWrite = 3 and writes to 5, and records headAfterWrite = 5. * Now QueueHead is 5 * C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s headBeforeWrite, C won’t advance A’s head. * A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B will advance A’s head to 3. * At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” will never work for A until A wakes up next time. I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue lag exceeds a threshold, then wake it up. This solution is simpler and reliable, also reducing the total wake-up count. > >> Given all the experiments since my earlier message, here is a fresh, >> self-contained write-up: > > I'm getting itchy about removing the local listenChannels list, > because what you've done is to replace it with a shared data > structure that can't be accessed without a good deal of locking > overhead. That seems like it could easily be a net loss. > > Also, I really do not like this implementation of > GetPendingNotifyChannels, as it looks like O(N^2) effort. > The potentially large length of the list it builds is scary too, > considering the comments that SignalBackends had better not fail. > If we have to do it that way it'd be better to collect the list > during PreCommit_Notify. > I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary. In PreCommit_Notify(), we can maintain a local hash table to record pending nofications’ channel names. dahash also supports hash table in local memory. Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can just iterate all keys of the local channel name hash. And the local static numChannelsListeningOn is also not needed. We can get the count from the local hash. WRT to v6, I got a few new comments: 1 - 0002 ``` * After commit we are called another time (AtCommit_Notify()). Here we - * make any actual updates to the effective listen st
Re: new environment variable INITDB_LOCALE_PROVIDER
On 10.10.25 20:09, Jeff Davis wrote: On Fri, 2025-10-10 at 11:32 +0200, Peter Eisentraut wrote: * Use environment variable name PG_LOCALE_PROVIDER, which seems more consistent. Is this not something that could already be done using PG_TEST_INITDB_EXTRA_OPTS ? 1. PG_LOCALE_PROVIDER is a documented user-facing option, which will make it easier for users to set their preferred provider in scripts, etc. 2. This change also creates default locales for the builtin and ICU providers, so that initdb without any other locale options will succeed regardless of the provider. I broke these up into two patches as v3 to make it easier to understand. These patches are independently useful, but also important if we ever want to change the initdb default to builtin or ICU. I'm skeptical that we want user-facing environment variables to provide initdb defaults. The use for that hasn't really been explained. For example, I don't recall anyone asking for an environment variable to determine the checksum default. If we did that, then we might end up with an environment variable per option, which would be a lot. The locale options are already complicated enough; adding more ways to set them with new ways that they interact with other options, this adds a lot more complications. I think in practice initdb is mostly run through packager-provided infrastructure, so this facility would probably have very little impact in practice.
Re: Clarification on Role Access Rights to Table Indexes
Jeff Davis writes: > On Mon, 2025-10-13 at 17:21 -0400, Tom Lane wrote: >> I don't think so. Even AccessShareLock is enough to block another >> session trying to acquire AccessExclusiveLock, and then not only >> have you DoS'd that session, but everything else trying to access >> the table will queue up behind the AccessExclusiveLock request. >> So it's only not-a-problem if nothing anywhere in the system wants >> non-sharable locks. > I tried imagining how that could be a problem, but couldn't come up > with anything. If the privilege check is right after the lock, then > either: > (a) The malicious AccessShareLock is granted, then is quickly released > when the privilege check fails and the transaction aborts; or > (b) The malicious AccessShareLock is queued behind a legitimate > AccessExclusiveLock, in which case every other lock would be queued up > as well. As soon as the AccessExclusiveLock is released, the > AccessShareLock would be granted, but quickly released when the > privilege check fails. > For it to be a problem, the malicious lock needs to be strong enough to > conflict with a lock level weaker than itself, i.e. ShareLock or > stronger. Robert might remember better, but I think the assumption behind the current design of RangeVarGetRelidExtended is that it's not okay to take a lock in the first place if you don't have the privilege to do so. Your analysis here supposes that it's okay to take a lock without privileges so long as you can't block someone else for very long, where "very long" is not tightly defined but hopefully isn't controllable by the malicious user. So that's moving the goalposts somewhat, but you might get people to sign onto it with more careful analysis of the worst-case delay. (The thing I'd worry about is whether it's possible to block execution of the privilege check, or even just make it slow.) Given that definition, I think you're right that it's possible to identify cases where lock-then-check can't cause meaningful DoS. RangeVarGetRelidExtended has to cope with the general case, so that's not an argument for simplifying it, but we might not need equivalent complexity everywhere. regards, tom lane
Re: Optimize LISTEN/NOTIFY
On Sat, Oct 11, 2025, at 09:43, Joel Jacobson wrote: > On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote: >> In addition to previously suggested optimization, there is another major ... >> I'm not entirely sure this approach is correct though Having investigated this, the "direct advancement" approach seems correct to me. (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock is of course needed because there are other operations that don't acquire the heavyweight-lock, that take shared/exclusive lock on NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on NotifyQueueLock in PreCommit_Notify is needed, since it modifies the QUEUE_HEAD.) Given all the experiments since my earlier message, here is a fresh, self-contained write-up: This series has two patches: * 0001-optimize_listen_notify-v16.patch: Improve test coverage of async.c. Adds isolation specs covering previously untested paths (subxact LISTEN reparenting/merge/abort, simple NOTIFY reparenting, notification_match dedup, and an array-growth case used by the follow-on patch. * 0002-optimize_listen_notify-v16.patch: Optimize LISTEN/NOTIFY by maintaining a shared channel map and using direct advancement to avoid useless wakeups. Problem --- Today SignalBackends wakes all listeners in the same database, with no knowledge of which backends listen on which channels. When some backends are listening on different channels, each NOTIFY causes unnecessary wakeups and context switches, which can become the bottleneck in workloads. Overview of the solution (patch 0002) - * Introduce a lazily-created DSA+dshash map (dboid, channel) -> [ProcNumber] (channelHash). AtCommit_Notify maintains it for LISTEN/UNLISTEN, and SignalBackends consults it to signal only listeners on the channels notified within the transaction. * Add a per-backend wakeupPending flag to suppress duplicate signals. * Direct advancement: while queuing, PreCommit_Notify records the queue head before and after our writes. Writers are globally serialized, so the interval [oldHead, newHead) contains only our entries. SignalBackends advances any backend still at oldHead directly to newHead, avoiding a pointless wakeup. * Keep the queue healthy by signaling backends that have fallen too far behind (lag >= QUEUE_CLEANUP_DELAY) so the global tail can advance. * pg_listening_channels and IsListeningOn now read from channelHash. * Add LWLock tranche NOTIFY_CHANNEL_HASH and wait event NotifyChannelHash. No user-visible semantic changes are intended; this is an internal performance improvement. Benchmark - Using a patched pgbench (adds --listen-notify-benchmark; attached as .txt to avoid confusing cfbot). Each run performs 10 000 round trips and adds 100 idle listeners per iteration. master (HEAD): % ./pgbench_patched --listen-notify-benchmark --notify-round-trips=1 --notify-idle-step=100 idle_listeners round_trips_per_sec max_latency_usec 0 32123.7 893 100 1952.5 1465 200991.4 3438 300663.5 2454 400494.6 2950 500398.6 3394 600332.8 4272 700287.1 4692 800252.6 5208 900225.4 5614 1000202.5 6212 0002-optimize_listen_notify-v16.patch: % ./pgbench_patched --listen-notify-benchmark --notify-round-trips=1 --notify-idle-step=100 idle_listeners round_trips_per_sec max_latency_usec 0 31832.6 1067 100 32341.0 1035 200 31562.5 1054 300 30040.1 1057 400 29287.1 1023 500 28191.9 1201 600 28166.5 1019 700 26994.3 1094 800 26501.0 1043 900 25974.2 1005 1000 25720.6 1008 Benchmarked on MacBook Pro Apple M3 Max. Files - * 0001-optimize_listen_notify-v16.patch - tests only. * 0002-optimize_listen_notify-v16.patch - implementation. * pgbench-listen-notify-benchmark-patch.txt - adds --listen-notify-benchmark. Feedback and review much welcomed. /Joel 0001-optimize_listen_notify-v16.patch Description: Binary data 0002-optimize_listen_notify-v16.patch Description: Binary data diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 1515ed405ba..3f47c50847d 100644 --- a/src
