Re: Can can I make an injection point wait occur no more than once?

2025-07-08 Thread Noah Misch
On Tue, Jul 08, 2025 at 11:21:20AM -0400, Peter Geoghegan wrote:
> On Mon, Jul 7, 2025 at 9:53 PM Noah Misch  wrote:
> > If it continues to be a problem, consider sharing the patch that's behaving
> > this way for you.
> 
> Attached patch shows my current progress with the isolation test.

Nothing looks suspicious in that code.

> I also attach diff output of the FreeBSD failures. Notice that the
> line "backwards_scan_session: NOTICE:  notice triggered for injection
> point lock-and-validate-new-lastcurrblkno" is completely absent from
> the test output. This absence indicates that the desired test coverage
> is totally missing on FreeBSD -- so the test is completely broken on
> FreeBSD.
> 
> I ran "meson test --suite setup --suite nbtree -q --print-errorlogs"
> in a loop 500 times on my Debian workstation without seeing any
> failures. Seems stable there. Whereas the FreeBSD target hasn't even
> passed once out of more than a dozen attempts. Seems to be reliably
> broken on FreeBSD.

> -backwards_scan_session: NOTICE:  notice triggered for injection point 
> lock-and-validate-new-lastcurrblkno
> +ERROR:  could not find injection point lock-and-validate-left to wake up

Agreed.  Perhaps it's getting a different plan type on FreeBSD, so it's not
even reaching the INJECTION_POINT() calls?  That would be consistent with
these output diffs having no ERROR from attach/detach.  Some things I'd try:

- Add a plain elog(WARNING) before each INJECTION_POINT()
- Use debug_print_plan or similar to confirm the plan type




Re: A assert failure when initdb with track_commit_timestamp=on

2025-07-08 Thread Michael Paquier
On Wed, Jul 09, 2025 at 02:39:22AM +, Hayato Kuroda (Fujitsu) wrote:
> +# Some GUCs like track_commit_timestamp cannot be set to non-default value in
> +# bootstrap mode becasue they may cause crash. Ensure settings can be surely
> +# ignored.
> +command_ok(
> +   [
> +   'initdb', "$tempdir/dataXX",
> +   '-c' => 'track_commit_timestamp=on',
> +   '-c' => 'transaction_timeout=200s'
> +   ],
> +   'successful creation with ignored settings');
> +
> ```

I'd suggest to keep them separate across multiple scripts, where they
hold meaning, as one failure may get hidden by the other.
track_commit_timestamp makes sense in the test module commit_ts, we
should select a second location for transaction_timeout if we keep it
at the end.

> But both Andy's patch and mine assume that post-bootstrap transactions can be
> finished within the specified time. Extremely long value is set above but I
> cannot say all machine won't spend 200s here...

A fresh initdb can be longer than this threshold under
CLOBBER_CACHE_ALWAYS, if my memory serves me well.  There are some
machines with a valgrind setup, additionally, that can take some time,
but I am not sure about their timings when it comes to a bootstrap
setup.
--
Michael


signature.asc
Description: PGP signature


Should TRUNCATE fire DDL triggers

2025-07-08 Thread Hari Krishna Sunder
First of all, is TRUNCATE a DDL or a DML? This
 doc refers to
it as a DDL, whereas other docs like this

 and this
 treat
it as a DML, so which one is it?

A lot of other SQL databases treat TRUNCATE as a DDL, so assuming that is
true, can we add it to the command tags supported by "ddl_command_start"
and "ddl_command_end" triggers?

---
Hari Krishna Sunder


Re: Should TRUNCATE fire DDL triggers

2025-07-08 Thread David G. Johnston
On Tuesday, July 8, 2025, Hari Krishna Sunder  wrote:

> First of all, is TRUNCATE a DDL or a DML? This
>  doc refers to
> it as a DDL, whereas other docs like this
> 
>  and this
>  treat
> it as a DML, so which one is it?
>

Neither…classification systems are often imperfect…but it sure quacks like
DML to my ears.  I’d probably remove the term “DDL” from that first link
and avoid the grey area.  Listing the two commands suffices.


>
> A lot of other SQL databases treat TRUNCATE as a DDL, so assuming that is
> true, can we add it to the command tags supported by "ddl_command_start"
> and "ddl_command_end" triggers?
>
>
Seems worthy of consideration regardless of how one answers the prior
question; for much the same reason.

David J.


Re: [PATCH] Add support for displaying database service in psql prompt

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 05:41:32PM -0700, Noah Misch wrote:
> I'd prefer not to get involved in decisions affecting only psql efficiency and
> psql code cosmetics.  Please make that decision without my input.

Okay, I have used this more extensible routine then, planning to use
it for the other patch.  The prompt shortcut retrieves the value using
a GetVariable(), rather than looking at the connection options all the
time.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE

2025-07-08 Thread jian he
On Mon, Jun 30, 2025 at 5:34 PM sundayjiang(蒋浩天)
 wrote:
>
> The purpose of this patch is to prevent replacing a function via `CREATE OR 
> REPLACE FUNCTION` with a new definition that is not marked as `IMMUTABLE`, if 
> the existing function is referenced by an index expression.
>
> Replacing such functions may lead to index corruption or runtime semantic 
> inconsistencies, especially when the function’s output is not stable for the 
> same input.
>
> If a function is used in an index, it can only be replaced if it is declared 
> as `IMMUTABLE`.
>

looking at the right above code ``if (oldproc->prokind != prokind)``

+ if(oldproc->prokind == PROKIND_FUNCTION && volatility !=
PROVOLATILE_IMMUTABLE){

we can change it to

+if(prokind == PROKIND_FUNCTION && oldproc->provolatile ==
PROVOLATILE_IMMUTABLE &&
+   volatility != PROVOLATILE_IMMUTABLE)
 + {
curly brace generally begins with a new line.

if (index_found)
+ ereport(ERROR,
+ (errcode(ERRCODE_DEPENDENT_OBJECTS_STILL_EXIST),
+ errmsg("cannot replace function \"%s\" with a non-IMMUTABLE function
because it is used by an index",
+ procedureName)));
Here, errcode ERRCODE_FEATURE_NOT_SUPPORTED would be more appropriate.

you can add a simple test in src/test/regress/sql/create_function_sql.sql
for example:
CREATE OR REPLACE FUNCTION get_a_int(int default 2) RETURNS int
IMMUTABLE AS 'select $1' LANGUAGE sql;
create table t1(a int);
create index on t1((get_a_int(a)));
CREATE OR REPLACE FUNCTION get_a_int(int default 2) RETURNS int AS
'select $1' LANGUAGE sql;




Re: Using failover slots for PG-non_PG logical replication

2025-07-08 Thread shveta malik
On Tue, Jul 8, 2025 at 7:29 PM Ashutosh Bapat
 wrote:
>
> On Tue, Jul 8, 2025 at 4:03 PM shveta malik  wrote:
> >
> > On Mon, Jul 7, 2025 at 7:00 PM Ashutosh Bapat
> >  wrote:
> > >
> > > On Mon, Jul 7, 2025 at 9:53 AM shveta malik  
> > > wrote:
> > > >
> > > > Thanks for the patch.
> > > >
> > > > > I couldn't figure out whether the query on primary to fetch all the
> > > > > slots to be synchronized should filter based on invalidation_reason
> > > > > and conflicting or not. According to synchronize_slots(), it seems
> > > > > that we retain invalidated slots on standby when failover = true and
> > > > > they would remain with synced = true on standby. Is that right?
> > > > >
> > > >
> > > > Yes, that’s correct. We sync the invalidation status of replication
> > > > slots from the primary to the standby and then stop synchronizing any
> > > > slots that have been marked as invalidated, retaining synced flag as
> > > > true. IMO, there's no need to filter out conflicting slots on the
> > > > primary, because instead of excluding them there and showing
> > > > everything as failover-ready on the standby, the correct approach is
> > > > to reflect the actual state on standby.This means conflicting slots
> > > > will appear as non-failover-ready on the standby. That’s why Step 3
> > > > also considers conflicting flag in its evaluation.
> > >
> > > Thanks for the explanation. WFM.
> > >
> > > If a slot is invalidated because RS_INVAL_WAL_REMOVED, conflicting =
> > > false, but the slot is not useful standby. But then it's not useful on
> > > primary as well. Is that why we are not including (invalidation_reason
> > > IS NOT NULL) condition along in (synced AND NOT temporary AND NOT
> > > conflicting)?
> >
> > Thanks for bringing it up. Even if the slot is not useful on the
> > primary node as well, we shall still show failover-ready as false on
> > standby. We should modify the query of step3 to check
> > 'invalidation_reason IS NULL' instead of  'NOT conflicting'. That will
> > cover all the cases where the slot  is invalidated and thus not
> > failover ready.
>
> Thanks for the confirmation.
>
> >
> > > >
> > > > 1)
> > > > +/* primary # */ SELECT array_agg(quote_literal(r.slot_name)) AS slots
> > > > +   FROM pg_replication_slots r
> > > > +   WHERE r.failover AND NOT r.temporary;
> > > >
> > > > On primary, there is no  need to check temporary-status. We do not
> > > > allow setting failover as true for temporary slots.
> > >
> > > Why does synchronize_slots() has it in its query?
> > >
> >
> > It is not needed but no harm in maintaining it.
> > If we want documents to be in sync with code, we can have 'not
> > temporary' check in doc as well.
> >
>
> I think it's better to keep the code and the doc in sync otherwise we
> developers would get confused.
>
> > > >
> > > > 2)
> > > > Although not directly related to the concerns addressed in the given
> > > > patch, I think it would be helpful to add a note in the original doc
> > > > stating that Steps 1 and 2 should be executed on each subscriber node
> > > > that will be served by the standby after failover.
> > >
> > > There's a bit of semantic repeatition here since an earlier paragraph
> > > mentions a given subscriber. But I think overall it's better to be
> > > more clear than being less clear.
> > > >
> > > > I have attached a top-up patch with the above changes and a few more
> > > > trivial changes. Please include it if you find it okay.
> > >
> > > Thanks. Included. I have made a few edits and included them in the
> > > attached patch.
> > >
> >
> > Thanks. The existing changes (originally targeted in this patch) look
> > good to me.
> >
> > I have attached a top-up patch for step-3 correction. Please include
> > if you find it okay to be fixed in the same patch, otherwise we can
> > handle it separately.
>
> I have split your top up patch into 2 - one related to the document
> change being the subject of this thread and the other for fixing the
> query. Committer may squash the patch, if they think so.
>

The changes look good to me.

thanks
Shveta




RE: A assert failure when initdb with track_commit_timestamp=on

2025-07-08 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san, Andy,

> Shouldn't we also add a TAP test to verify that initdb works correctly
> with GUCs marked as GUC_NOT_IN_BOOTSTRAP?

Another place we can put the test is 001_initdb.pl, like:

```
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -331,4 +331,15 @@ command_fails(
[ 'pg_checksums', '--pgdata' => $datadir_nochecksums ],
"pg_checksums fails with data checksum disabled");
 
+# Some GUCs like track_commit_timestamp cannot be set to non-default value in
+# bootstrap mode becasue they may cause crash. Ensure settings can be surely
+# ignored.
+command_ok(
+   [
+   'initdb', "$tempdir/dataXX",
+   '-c' => 'track_commit_timestamp=on',
+   '-c' => 'transaction_timeout=200s'
+   ],
+   'successful creation with ignored settings');
+
```

But both Andy's patch and mine assume that post-bootstrap transactions can be
finished within the specified time. Extremely long value is set above but I
cannot say all machine won't spend 200s here...

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: A recent message added to pg_upgade

2025-07-08 Thread Dilip Kumar
On Tue, Jul 8, 2025 at 5:24 PM Amit Kapila  wrote:
>
> On Tue, Jul 8, 2025 at 4:49 PM Dilip Kumar  wrote:
> >
> > On Tue, Jul 8, 2025 at 2:29 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jul 8, 2025 at 11:32 AM Dilip Kumar  wrote:
> > > >
> > > > On Mon, Jul 7, 2025 at 11:22 PM Tom Lane  wrote:
> > > > >
> > > >
> > > > > There's a bigger picture here, though.  The fundamental thing that
> > > > > I find wrong with the current code is that knowledge of and
> > > > > responsibility for this max_slot_wal_keep_size hack is spread across
> > > > > both pg_upgrade and the server.  It would be better if it were on
> > > > > just one side.  Now, unless we want to change that Assert that
> > > > > 8bfb231b4 put into InvalidatePossiblyObsoleteSlot(), the server side
> > > > > is going to be aware of this decision.  So I'm inclined to think
> > > > > that we should silently enforce max_slot_wal_keep_size = -1 in
> > > > > binary-upgrade mode in the server's GUC check hook, and then remove
> > > > > knowledge of it from pg_upgrade altogether.  Maybe the same for
> > > > > idle_replication_slot_timeout, which really has got the same issue
> > > > > that we don't want users overriding that choice.
> > > >
> > > > Yeah this change makes sense,
> > > >
> > >
> > > Agreed.
> > >
> > > One other idea to achieve similar functionality is that during
> > > BinaryUpgrade, avoid removing WAL due to max_slot_wal_keep_size, and
> > > also skip InvalidateObsoleteReplicationSlots. The one advantage of
> > > such a change is that after this, we can remove Assert in
> > > InvalidatePossiblyObsoleteSlot, remove check_hook functions for GUCs
> > > max_slot_wal_keep_size and idle_replication_slot_timeout, and remove
> > > special settings for these GUCs in pg_upgrade.
> >
> > Yeah that could also be possible, not sure which one is better though,
> > with this idea we will have to put BinaryUpgrade check in KeepLogSeg()
> > as well as in InvalidateObsoleteReplicationSlots() whereas forcing the
> > GUC to be -1 during binary upgrade we just need check at one place.
> >
>
> But OTOH, as mentioned, we can remove all other codes like check_hooks
> and probably assert as well.
>

After further consideration, I believe your proposed method is
superior to forcing the max_slot_wal_keep_size to -1 via a check hook.
The ultimate goal is to prevent WAL removal during a binary upgrade,
and your approach directly addresses this issue rather than
controlling it by forcing the GUC value.  I am planning to send a
patch using this approach for both max_slot_wal_keep_size as well as
for idle_replication_slot_timeout.

-- 
Regards,
Dilip Kumar
Google




Re: Can can I make an injection point wait occur no more than once?

2025-07-08 Thread Peter Geoghegan
On Tue, Jul 8, 2025 at 11:04 PM Noah Misch  wrote:
> > -backwards_scan_session: NOTICE:  notice triggered for injection point 
> > lock-and-validate-new-lastcurrblkno
> > +ERROR:  could not find injection point lock-and-validate-left to wake up
>
> Agreed.  Perhaps it's getting a different plan type on FreeBSD, so it's not
> even reaching the INJECTION_POINT() calls?  That would be consistent with
> these output diffs having no ERROR from attach/detach.  Some things I'd try:
>
> - Add a plain elog(WARNING) before each INJECTION_POINT()
> - Use debug_print_plan or similar to confirm the plan type

I added a pair of elog(WARNING) traces before each of the new
INJECTION_POINT() calls.

When I run the test against the FreeBSD CI target with this new
instrumentation, I see a WARNING that indicates that we've reached the
top of _bt_lock_and_validate_left as expected. I don't see any second
WARNING indicating that we've taken _bt_lock_and_validate_left's
unhappy path, though (and the test still fails). This doesn't look
like an issue with the planner.

I attach the relevant regression test output, that shows all this.

Thanks
-- 
Peter Geoghegan
diff -U3 /tmp/cirrus-ci-build/src/test/modules/nbtree/expected/backwards.out 
/tmp/cirrus-ci-build/build/testrun/nbtree/isolation/results/backwards.out
--- /tmp/cirrus-ci-build/src/test/modules/nbtree/expected/backwards.out 
2025-07-09 03:22:42.701999000 +
+++ /tmp/cirrus-ci-build/build/testrun/nbtree/isolation/results/backwards.out   
2025-07-09 03:26:13.80234 +
@@ -1,24 +1,8 @@
 Parsed test spec with 2 sessions
 
 starting permutation: b_scan i_insert i_detach
-step b_scan: SELECT * FROM backwards_scan_tbl WHERE col % 100 = 1 ORDER BY col 
DESC; 
-step i_insert: INSERT INTO backwards_scan_tbl SELECT i FROM 
generate_series(-2000, 700) i;
-step i_detach: 
-  SELECT injection_points_detach('lock-and-validate-left');
-  SELECT injection_points_wakeup('lock-and-validate-left');
-
-injection_points_detach

-   
-(1 row)
-
-injection_points_wakeup

-   
-(1 row)
-
-backwards_scan_session: NOTICE:  notice triggered for injection point 
lock-and-validate-new-lastcurrblkno
-step b_scan: <... completed>
+backwards_scan_session: WARNING:  INJECTION_POINT lock-and-validate-left
+step b_scan: SELECT * FROM backwards_scan_tbl WHERE col % 100 = 1 ORDER BY col 
DESC;
 col
 ---
 601
@@ -30,3 +14,14 @@
   1
 (7 rows)
 
+step i_insert: INSERT INTO backwards_scan_tbl SELECT i FROM 
generate_series(-2000, 700) i;
+step i_detach: 
+  SELECT injection_points_detach('lock-and-validate-left');
+  SELECT injection_points_wakeup('lock-and-validate-left');
+
+injection_points_detach
+---
+   
+(1 row)
+
+ERROR:  could not find injection point lock-and-validate-left to wake up


RE: A assert failure when initdb with track_commit_timestamp=on

2025-07-08 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> I'd suggest to keep them separate across multiple scripts, where they
> hold meaning, as one failure may get hidden by the other.
> track_commit_timestamp makes sense in the test module commit_ts, we
> should select a second location for transaction_timeout if we keep it
> at the end.

OK, so track_commit_timestamp can be tested like what initially did:

```
--- a/src/test/modules/commit_ts/t/001_base.pl
+++ b/src/test/modules/commit_ts/t/001_base.pl
@@ -11,8 +11,7 @@ use Test::More;
 use PostgreSQL::Test::Cluster;
 
 my $node = PostgreSQL::Test::Cluster->new('foxtrot');
-$node->init;
-$node->append_conf('postgresql.conf', 'track_commit_timestamp = on');
+$node->init(extra => [ '-c', 'track_commit_timestamp=on' ]);
 $node->start;
```

> A fresh initdb can be longer than this threshold under
> CLOBBER_CACHE_ALWAYS, if my memory serves me well.  There are some
> machines with a valgrind setup, additionally, that can take some time,
> but I am not sure about their timings when it comes to a bootstrap
> setup.

Hmm. So I felt that we should not add tests for transaction_timeout for such a 
slow
environment. Thought?

Best regards,
Hayato Kuroda
FUJITSU LIMITED





Re: Improve pg_sync_replication_slots() to wait for primary to advance

2025-07-08 Thread shveta malik
Please find few more comments:

1)
In  pg_sync_replication_slots() doc, we have this:

"Note that this function is primarily intended for testing and
debugging purposes and should be used with caution. Additionally, this
function cannot be executed if "

We can get rid of this info as well and change to:

"Note that this function cannot be executed if"

2)
We got rid of NOTE in logicaldecoding.sgml, but now the page does not
mention pg_sync_replication_slots() at all. We need to bring back the
change removed by [1] (or something on similar line) which is this:

- CREATE SUBSCRIPTION during slot creation, and
then calling
- 
- pg_sync_replication_slots
- on the standby. By setting 
+ CREATE SUBSCRIPTION during slot creation.
+ Additionally, enabling 
+ sync_replication_slots on the standby
+ is required. By enabling 

3)
wait_for_primary_slot_catchup():
+ /*
+ * It is possible to get null values for confirmed_lsn and
+ * catalog_xmin if on the primary server the slot is just created with
+ * a valid restart_lsn and slot-sync worker has fetched the slot
+ * before the primary server could set valid confirmed_lsn and
+ * catalog_xmin.
+ */

Do we need this special handling? We already have one such handling in
synchronize_slots(). please see:
/*
 * If restart_lsn, confirmed_lsn or catalog_xmin is
invalid but the
 * slot is valid, that means we have fetched the
remote_slot in its
 * RS_EPHEMERAL state. In such a case, don't sync it;
we can always
 * sync it in the next sync cycle when the remote_slot
is persisted
 * and has valid lsn(s) and xmin values.
 */
if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
 XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
 !TransactionIdIsValid(remote_slot->catalog_xmin)) &&
remote_slot->invalidated == RS_INVAL_NONE)
pfree(remote_slot);


Due to the above check in synchronize_slots(), we will not reach
wait_for_primary_slot_catchup() when any of confirmed_lsn or
catalog_xmin is not initialized.


[1]: 
https://www.postgresql.org/message-id/CAJpy0uAD_La2vi%2BB%2BiSBbCYTMayMstvbF9ndrAJysL9t5fHtbQ%40mail.gmail.com


thanks
Shveta




Re: POC: Parallel processing of indexes in autovacuum

2025-07-08 Thread Daniil Davydov
Hi,

On Tue, Jul 8, 2025 at 10:20 PM Matheus Alcantara
 wrote:
>
> On Sun Jul 6, 2025 at 5:00 AM -03, Daniil Davydov wrote:
> > I will keep the 'max_worker_processes' limit, so autovacuum will not
> > waste time initializing a parallel context if there is no chance that
> > the request will succeed.
> > But it's worth remembering that actually the
> > 'autovacuum_max_parallel_workers' parameter will always be implicitly
> > capped by 'max_parallel_workers'.
> >
> > What do you think about it?
> >
>
> It make sense to me. The main benefit that I see on capping
> autovacuum_max_parallel_workers parameter is that users will see
> "invalid value for parameter "autovacuum_max_parallel_workers"" error on
> logs instead of need to search for "planned vs. launched", which can be
> trick if log_min_messages is not set to at least the info level (the
> default warning level will not show this log message).
>

I think I can refer to (for example) 'max_parallel_workers_per_gather'
parameter, which allows
setting values higher than 'max_parallel_workers' without throwing an
error or warning.

'autovacuum_max_parallel_workers' will behave the same way.

> If we decide to not cap this on code I think that at least would be good to 
> mention this
> on documentation.

Sure, it is worth noticing in documentation.

--
Best regards,
Daniil Davydov




Re: amcheck support for BRIN indexes

2025-07-08 Thread Arseniy Mukhin
On Mon, Jul 7, 2025 at 3:21 PM Álvaro Herrera  wrote:
>
> On 2025-Jul-07, Tomas Vondra wrote:
>
> > Alvaro, what's your opinion on the introduction of the new WITHIN_RANGE?
> > I'd probably try to do this using the regular consistent function:
> >
> > (a) we don't need to add stuff to all BRIN opclasses to support this
> >
> > (b) it gives us additional testing of the consistent function
> >
> > (c) building a scan key for equality seems pretty trivial
> >
> > What do you think?
>
> Oh yeah, if we can build this on top of the existing primitives, then
> I'm all for that.

Thank you for the feedback! I agree with the benefits. Speaking of
(с), it seems most of the time to be really trivial to build such a
ScanKey, but not every opclass supports '=' operator. amcheck should
handle these cases somehow then. I see two options here. The first is
to not provide  'heap all indexed' check for such opclasses, which is
sad because even one core opclass (box_inclusion_ops) doesn't support
'=' operator, postgis brin opclasses don't support it too AFAICS. The
second option is to let the user define which operator to use during
the check, which, I think, makes user experience much worse in this
case. So both options look not good from the user POV as for me, so I
don't know. What do you think about it?

And should I revert the patchset to the consistent function version then?


Best regards,
Arseniy Mukhin




Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

2025-07-08 Thread Tomas Vondra
On 7/8/25 03:55, Cédric Villemain wrote:
> Hi Andres,
> 
>> Hi,
>>
>> On 2025-07-05 07:09:00 +, Cédric Villemain wrote:
>>> In my work on more careful PostgreSQL resource management, I've come
>>> to the
>>> conclusion that we should avoid pushing policy too deeply into the
>>> PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
>>> NUMA-specific management directly into core PostgreSQL in such a way.
>>
>> I think it's actually the opposite - whenever we pushed stuff like this
>> outside of core it has hurt postgres substantially. Not having
>> replication in
>> core was a huge mistake. Not having HA management in core is probably the
>> biggest current adoption hurdle for postgres.
>>
>> To deal better with NUMA we need to improve memory placement and various
>> algorithms, in an interrelated way - that's pretty much impossible to do
>> outside of core.
> 
> Except the backend pinning which is easy to achieve, thus my comment on
> the related patch.
> I'm not claiming NUMA memory and all should be managed outside of core
> (though I didn't read other patches yet).
> 

But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.


regards

-- 
Tomas Vondra





Re: amcheck support for BRIN indexes

2025-07-08 Thread Tomas Vondra



On 7/8/25 14:40, Arseniy Mukhin wrote:
> On Mon, Jul 7, 2025 at 3:21 PM Álvaro Herrera  wrote:
>>
>> On 2025-Jul-07, Tomas Vondra wrote:
>>
>>> Alvaro, what's your opinion on the introduction of the new WITHIN_RANGE?
>>> I'd probably try to do this using the regular consistent function:
>>>
>>> (a) we don't need to add stuff to all BRIN opclasses to support this
>>>
>>> (b) it gives us additional testing of the consistent function
>>>
>>> (c) building a scan key for equality seems pretty trivial
>>>
>>> What do you think?
>>
>> Oh yeah, if we can build this on top of the existing primitives, then
>> I'm all for that.
> 
> Thank you for the feedback! I agree with the benefits. Speaking of
> (с), it seems most of the time to be really trivial to build such a
> ScanKey, but not every opclass supports '=' operator. amcheck should
> handle these cases somehow then. I see two options here. The first is
> to not provide  'heap all indexed' check for such opclasses, which is
> sad because even one core opclass (box_inclusion_ops) doesn't support
> '=' operator, postgis brin opclasses don't support it too AFAICS. The
> second option is to let the user define which operator to use during
> the check, which, I think, makes user experience much worse in this
> case. So both options look not good from the user POV as for me, so I
> don't know. What do you think about it?
> 
> And should I revert the patchset to the consistent function version then?
> 

Yeah, that's a good point. The various opclasses may support different
operators, and we don't know which "strategy" to fill into the scan key.
Minmax needs BTEqualStrategyNumber, inclusion RTContainsStrategyNumber,
and so on.

I wonder if there's a way to figure this out from the catalogs, but I
can't think of anything. Maybe requiring the user to supply the operator
(and not checking heap if it's not provided)

  SELECT brin_index_check(..., '@>');

would not be too bad ...


Alvaro, any ideas?


-- 
Tomas Vondra





Re: mention unused_oids script in pg_proc.dat

2025-07-08 Thread Florents Tselai
On Tue, Jul 1, 2025 at 12:25 PM Peter Eisentraut 
wrote:

> On 24.05.25 12:34, Florents Tselai wrote:
> >
> >
> >> On 24 May 2025, at 12:24 PM, Álvaro Herrera 
> wrote:
> >>
> >> On 2025-May-24, Florents Tselai wrote:
> >>
> >>> I aggree with having more READMEs around the src tree.
> >>> It’s true that a lot of doc content found in VII. Internals is
> dev-oriented,
> >>> but imho it should be closer to the src (and more succinct/less
> verbose too).
> >>
> >> Maybe these READMEs can simply contain little more than links to the
> >> built HTML docs, to avoid having multiple sources of info.
> >>
> >>> Looking at some of the text in 67.2.2. OID Assignment for example,
> >>> a few things look outdated wrt choosing OIDs.
> >>>
> >>> One can look through the history of similar commits to see what
> >>> needs to be changed; So it’s not THAT bad.
> >>
> >> Sure, outdated docs is something that happens.  Patches welcome.
> >>
> >
> > Something like this ?
>
>  > diff --git a/src/include/catalog/README b/src/include/catalog/README
>  > new file mode 100644
>  > index 000..5bd81c6d86c
>  > --- /dev/null
>  > +++ b/src/include/catalog/README
>  > @@ -0,0 +1,11 @@
>  > +# Catalog
>  > +
>  > +For more details see https://www.postgresql.org/docs/current/bki.html
>
> I can see that having a README in src/include/catalog/ would be useful.
> That directory contains some "unusual" files, and some hints about them
> would be good.  Before commit 372728b0d49, we used to have a README in
> src/backend/catalog/README, but then this was moved to the SGML
> documentation.  Maybe you can read up there why this was chosen.  But I
> think src/include/catalog/ would be a better location in any case.
>

In practice people go to the git history to see the files touched by
similar commits,
and devise a to-do-list for their case.
so I think it helps them a lot to see some high-level READMEs waiting for
them there.

In this particular case, unused_oids isn't included in any commits so you
can't be aware of its existence even


> Maybe for now just a link like you show would be ok.  (But it should
> probably be s/current/devel/.)
>

Done


>
>  > +
>  > +Below are some checklists for common scenarios.
>  > +
>  > +## Adding a New Built-in SQL Command
>  > +
>  > +1. Implement the function in C and place it in the appropriate
> directory under `src` (typically in `src/utils/adt`)
>  > +2. Each function should have a unique integer OID. Run the script
> `src/include/catalog/unused_oids` to find available ranges.
>  > +3. Add the entry to `pg_proc.dat` following the surrounding
> conventions.
>
> Presumably, if you're developing this kind of thing, you are already
> reading some other part of the documentation, maybe xfunc.sgml?  Do we
> need to add more information there, or some links?
>

I agree. Added a link reference to that xfunc-c

Some basic stuff that was missing in xfunc, I've already added in 86e4efc .
For now I think it's good enough.


v2-0001-Add-catalog-README.patch
Description: Binary data


Re: Using failover slots for PG-non_PG logical replication

2025-07-08 Thread Ashutosh Bapat
On Tue, Jul 8, 2025 at 4:03 PM shveta malik  wrote:
>
> On Mon, Jul 7, 2025 at 7:00 PM Ashutosh Bapat
>  wrote:
> >
> > On Mon, Jul 7, 2025 at 9:53 AM shveta malik  wrote:
> > >
> > > Thanks for the patch.
> > >
> > > > I couldn't figure out whether the query on primary to fetch all the
> > > > slots to be synchronized should filter based on invalidation_reason
> > > > and conflicting or not. According to synchronize_slots(), it seems
> > > > that we retain invalidated slots on standby when failover = true and
> > > > they would remain with synced = true on standby. Is that right?
> > > >
> > >
> > > Yes, that’s correct. We sync the invalidation status of replication
> > > slots from the primary to the standby and then stop synchronizing any
> > > slots that have been marked as invalidated, retaining synced flag as
> > > true. IMO, there's no need to filter out conflicting slots on the
> > > primary, because instead of excluding them there and showing
> > > everything as failover-ready on the standby, the correct approach is
> > > to reflect the actual state on standby.This means conflicting slots
> > > will appear as non-failover-ready on the standby. That’s why Step 3
> > > also considers conflicting flag in its evaluation.
> >
> > Thanks for the explanation. WFM.
> >
> > If a slot is invalidated because RS_INVAL_WAL_REMOVED, conflicting =
> > false, but the slot is not useful standby. But then it's not useful on
> > primary as well. Is that why we are not including (invalidation_reason
> > IS NOT NULL) condition along in (synced AND NOT temporary AND NOT
> > conflicting)?
>
> Thanks for bringing it up. Even if the slot is not useful on the
> primary node as well, we shall still show failover-ready as false on
> standby. We should modify the query of step3 to check
> 'invalidation_reason IS NULL' instead of  'NOT conflicting'. That will
> cover all the cases where the slot  is invalidated and thus not
> failover ready.

Thanks for the confirmation.

>
> > >
> > > 1)
> > > +/* primary # */ SELECT array_agg(quote_literal(r.slot_name)) AS slots
> > > +   FROM pg_replication_slots r
> > > +   WHERE r.failover AND NOT r.temporary;
> > >
> > > On primary, there is no  need to check temporary-status. We do not
> > > allow setting failover as true for temporary slots.
> >
> > Why does synchronize_slots() has it in its query?
> >
>
> It is not needed but no harm in maintaining it.
> If we want documents to be in sync with code, we can have 'not
> temporary' check in doc as well.
>

I think it's better to keep the code and the doc in sync otherwise we
developers would get confused.

> > >
> > > 2)
> > > Although not directly related to the concerns addressed in the given
> > > patch, I think it would be helpful to add a note in the original doc
> > > stating that Steps 1 and 2 should be executed on each subscriber node
> > > that will be served by the standby after failover.
> >
> > There's a bit of semantic repeatition here since an earlier paragraph
> > mentions a given subscriber. But I think overall it's better to be
> > more clear than being less clear.
> > >
> > > I have attached a top-up patch with the above changes and a few more
> > > trivial changes. Please include it if you find it okay.
> >
> > Thanks. Included. I have made a few edits and included them in the
> > attached patch.
> >
>
> Thanks. The existing changes (originally targeted in this patch) look
> good to me.
>
> I have attached a top-up patch for step-3 correction. Please include
> if you find it okay to be fixed in the same patch, otherwise we can
> handle it separately.

I have split your top up patch into 2 - one related to the document
change being the subject of this thread and the other for fixing the
query. Committer may squash the patch, if they think so.



--
Best Wishes,
Ashutosh Bapat
From fd0b6dca15b8455597ce4e7ae7e6802335e99b69 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Fri, 4 Jul 2025 19:35:47 +0530
Subject: [PATCH 1/2] Clarify logical replication failover document

Clarify that the existing steps in logical replication failover section
are meant for a given subscriber. Also clarify that the first two steps
are for a PostgreSQL subscriber. Add a separate paragraph to specify the
query to be used to make sure that all the replication slots have been
synchronized to the required standby in case of a planned failover.

Author: Ashutosh Bapat 
Author: Shveta Malik 
Discussion: https://www.postgresql.org/message-id/CAExHW5uiZ-fF159=jwbwpmbjzezdtmctbn+hd4mrurlcg2u...@mail.gmail.com
---
 doc/src/sgml/logical-replication.sgml | 40 ---
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index f317ed9c50e..7b3558ff248 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -709,8 +709,8 @@ HINT:  To initiate replication, you must manually create the

Re: Adding basic NUMA awareness

2025-07-08 Thread Tomas Vondra
On 7/8/25 05:04, Andres Freund wrote:
> Hi,
> 
> On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
>> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra  wrote:
>>> I don't think the splitting would actually make some things simpler, or
>>> maybe more flexible - in particular, it'd allow us to enable huge pages
>>> only for some regions (like shared buffers), and keep the small pages
>>> e.g. for PGPROC. So that'd be good.
>>
>> You have made assumption that this is good, but small pages (4KB) are
>> not hugetlb, and are *swappable* (Transparent HP are swappable too,
>> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
>> most frequent problem I see these days are OOMs, and it makes me
>> believe that making certain critical parts of shared memory being
>> swappable just to make pagesize granular is possibly throwing the baby
>> out with the bathwater. I'm thinking about bad situations like: some
>> wrong settings of vm.swapiness that people keep (or distros keep?) and
>> general inability of PG to restrain from allocating more memory in
>> some cases.
> 
> The reason it would be advantageous to put something like the procarray onto
> smaller pages is that otherwise the entire procarray (unless particularly
> large) ends up on a single NUMA node, increasing the latency for backends on
> every other numa node and increasing memory traffic on that node.
> 

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.

regards

-- 
Tomas Vondra





Re: A recent message added to pg_upgade

2025-07-08 Thread Amit Kapila
On Tue, Jul 8, 2025 at 4:49 PM Dilip Kumar  wrote:
>
> On Tue, Jul 8, 2025 at 2:29 PM Amit Kapila  wrote:
> >
> > On Tue, Jul 8, 2025 at 11:32 AM Dilip Kumar  wrote:
> > >
> > > On Mon, Jul 7, 2025 at 11:22 PM Tom Lane  wrote:
> > > >
> > >
> > > > There's a bigger picture here, though.  The fundamental thing that
> > > > I find wrong with the current code is that knowledge of and
> > > > responsibility for this max_slot_wal_keep_size hack is spread across
> > > > both pg_upgrade and the server.  It would be better if it were on
> > > > just one side.  Now, unless we want to change that Assert that
> > > > 8bfb231b4 put into InvalidatePossiblyObsoleteSlot(), the server side
> > > > is going to be aware of this decision.  So I'm inclined to think
> > > > that we should silently enforce max_slot_wal_keep_size = -1 in
> > > > binary-upgrade mode in the server's GUC check hook, and then remove
> > > > knowledge of it from pg_upgrade altogether.  Maybe the same for
> > > > idle_replication_slot_timeout, which really has got the same issue
> > > > that we don't want users overriding that choice.
> > >
> > > Yeah this change makes sense,
> > >
> >
> > Agreed.
> >
> > One other idea to achieve similar functionality is that during
> > BinaryUpgrade, avoid removing WAL due to max_slot_wal_keep_size, and
> > also skip InvalidateObsoleteReplicationSlots. The one advantage of
> > such a change is that after this, we can remove Assert in
> > InvalidatePossiblyObsoleteSlot, remove check_hook functions for GUCs
> > max_slot_wal_keep_size and idle_replication_slot_timeout, and remove
> > special settings for these GUCs in pg_upgrade.
>
> Yeah that could also be possible, not sure which one is better though,
> with this idea we will have to put BinaryUpgrade check in KeepLogSeg()
> as well as in InvalidateObsoleteReplicationSlots() whereas forcing the
> GUC to be -1 during binary upgrade we just need check at one place.
>

But OTOH, as mentioned, we can remove all other codes like check_hooks
and probably assert as well.

-- 
With Regards,
Amit Kapila.




Re: Remove redundant comment regarding RelationBuildRowSecurity in relcache.c

2025-07-08 Thread Dean Rasheed
On Mon, 28 Apr 2025 at 16:02, Khan, Tanzeel  wrote:
>
> A comment in relcache.c mentions that RelationBuildRowSecurity
> adds a default-deny policy when no policy exists on table. This
> does not seem to be the case. The default deny policy is added
> later on, inside get_row_security_policies(). Also it mentions that
> there can never be zero policies for a relation which is true
> but here it kind of hints that rd_rsdesc->policies can never be
> null.
>
> Added a patch to remove both of these comments.

Hmm, well I agree that the comment is wrong, but I don't think that
simply deleting it is the answer.

I think that the point the comment is trying to make is that it's
perfectly valid for there to be no policies while relrowsecurity is
true, which is why the code block for policies needs to be different
from the preceding code blocks for rules and triggers.

So perhaps it should say something like this:

/*
 * Re-load the row security policies if the relation has them, since
 * they are not preserved in the cache.  Note that relrowsecurity may
 * be true even when there are no policies defined in pg_policy, in
 * which case RelationBuildRowSecurity will create a RowSecurityDesc
 * with an empty policy list, and rd_rsdesc will always be non-NULL.
 * When the table is queried, if there are no policies granting access
 * to rows in the table, a single always-false, default-deny policy
 * will be applied instead, see get_row_security_policies.
 */

Regards,
Dean




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Hannu Krosing
On Tue, Jul 8, 2025 at 8:07 PM Tom Lane  wrote:
>
> BTW, returning to the original topic of this thread:
>
> The new exact-delays table from pg_test_timing is really quite
> informative.

Maybe we should collect some of it in the PostgreSQL Wiki for easy reference ?

I had some interesting results with some RISC-V SBC which were similar
to ARM but
seemed to indicate that the chosen "CPU CLOCK"  used by the
rtdsc-equivalent instruction was counting in exact multiples of a
nanosecond

And "the rtdsc-equivalent instruction" on both ARM and RISC-V is
reading a special register which exposes the faked "clock counter".


> For example, on my M4 Macbook:
>
> Observed timing durations up to 99.9900%:
>   ns   % of total  running %  count
>0  62.212462.2124  118127987
>   41  12.582674.7951   23891661
>   42  25.165399.9604   47783489
>   83   0.019499.9797  36744
>   84   0.009699.9893  18193
>  125   0.002099.9913   3784
> ...
>31042   0.   100.  1
>
> The fact that the clock tick is about 40ns is extremely
> obvious in this presentation.
>
> Even more interesting is what I got from an ancient PPC Macbook
> (mamba's host, running NetBSD):
>
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 731.26 ns
> ...
> Observed timing durations up to 99.9900%:
>   ns   % of total  running %  count
>  705  39.916239.91621637570
>  706  17.604057.5203 722208
>  759  18.679776.2000 766337
>  760  23.785199.9851 975787
>  813   0.000299.9853  9
>  814   0.000499.9857 17
>  868   0.000199.9858  4
>  922   0.000199.9859  3

Do we have a fencepost error in the limit code so that it stops before
printing out the 99.9900% limit row ?

The docs in your latest patch indicated that it prints out the row >=  99.9900%

---
Hannu




Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Álvaro Herrera
On 2025-Jul-08, Hannu Krosing wrote:

> I still think we should go with direct toast tid pointers in varlena
> and not some kind of oid.

I think this can be made to work, as long as we stop seeing the toast
table just like a normal heap table containing normal tuples.  A lot to
reimplement though -- vacuum in particular.  Maybe it can be thought of
as a new table AM.  Not an easy project, I reckon.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Tom Lane
Andrey Borodin  writes:
>> On 8 Jul 2025, at 23:07, Tom Lane  wrote:
>> The fact that the clock tick is about 40ns is extremely 
>> obvious in this presentation.

> FWIW while working on UUID v7 Masahiko found that if we try to read real time 
> with clock_gettime(CLOCK_REALTIME,) measurement is truncated to microseconds.

Yeah.  Bear in mind that instr_time.h uses CLOCK_MONOTONIC_RAW on
macOS, so that's what we're looking at here.

regards, tom lane




Re: C11 / VS 2019

2025-07-08 Thread Tom Lane
Andrew Dunstan  writes:
> It's done and running. Testing before I re-enabled the animal it shows 
> it was happy.

In the no-good-deed-goes-unpunished department ... drongo is now spewing
a boatload of these warnings:

C:\\Program Files (x86)\\Windows 
Kits\\10\\include\\10.0.18362.0\\um\\winbase.h(9305): warning C5105: macro 
expansion producing 'defined' has undefined behavior

Looks like this comes out once per .c file -- probably it's
in an inclusion from .  Dunno if there's anything
we can do but ignore it.  I wonder though why we have not seen
this on other buildfarm animals.

regards, tom lane




Re: Horribly slow pg_upgrade performance with many Large Objects

2025-07-08 Thread Nathan Bossart
On Sun, Jul 06, 2025 at 02:48:08PM +0200, Hannu Krosing wrote:
> Did a quick check of the patch and it seems to work ok.

Thanks for taking a look.

> What do you think of the idea of not dumping pg_shdepend here, but
> instead adding the required entries after loading
> pg_largeobject_metadata based on the contents of it ?

While not dumping it might save a little space during upgrade, the query
seems to be extremely slow.  So, I don't see any strong advantage.

-- 
nathan




Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 12:58:26PM -0400, Burd, Greg wrote:
> All that aside, I think you're right to tackle this one step at a
> time and try not to boil too much of the ocean at once (the patch
> set is already large enough).  With that in mind I've read once or
> twice over your changes and have a few basic comments/questions. 
> 
> 
> v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid
> 
> This set of changes make sense and as you say are mechanical in
> nature, no real comments other than I think that using uint64 rather
> than Oid is the right call and addresses #2 on Tom's list.
> 
> 
> v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code
> 
> I like this as well, clarifies the code and reduces repetition.

Thanks.  These are independent pieces if you want to link the code
less to TOAST, assuming that an area of 8 bytes would be good enough
for any TOAST "value" concept.  TIDs were mentioned as well on a
different part of the thread, ItemPointerData is 6 bytes.

> v2-0003 Refactor external TOAST pointer code for better pluggability
> 
> + * For now there are only two types, all defined in this file. For now this
> + * is the maximum value of vartag_external, which is a historical choice.
> 
> This provides a bridge for compatibility, but doesn't open the door
> to a truly pluggable API.  I'm guessing the goal is incremental
> change rather than wholesale rewrite.

Nope, it does not introduce a pluggable thing, but it does untangle
the fact that one needs to change 15-ish code paths when they want to
add a new type of external TOAST pointer, without showing an actual
impact AFAIK when we insert a TOAST value or fetch it, as long as we
know that we're dealing with an on-disk thing that requires an
external lookup.

> + * The different kinds of on-disk external TOAST pointers. divided by
> + * vartag_external.
> 
> Extra '.' in "TOAST pointers. divided" I'm guessing.

Indeed, thanks.

> v2-0007 Introduce global 64-bit TOAST ID counter in control file
> 
> Do you have any concern that this might become a bottleneck when
> there are many relations and many backends all contending for a new
> id?  I'd imagine that this would show up in a flame graph, but I
> think your test focused on the read side detoast_attr_slice() rather
> than insert/update and contention on the shared counter.  Would this
> be even worse on NUMA systems?

That may be possible, see below.

> Thanks for the flame graphs examining a heavy detoast_attr_slice()
> workload.  I agree that there is little or no difference between
> them which is nice.

Cool.  Yes.  I was wondering why detoast_attr_slice() does not show up
in the profile on HEAD, perhaps it just got optimized away (I was
under -O2 for these profiles).

> I think the only call out Tom made [4] that isn't addressed was the
> ask for localized ID selection.  That may make sense at some point,
> especially if there is contention on GetNewToastId().  I think that
> case is worth a separate performance test, something with a large
> number of relations and backends all performing a lot of updates
> generating a lot of new IDs.  What do you think?

Yeah, I need to do more benchmark for the int8 part, I was holding on
such evaluations because this part of the patch does not fly if we
don't do the refactoring pieces first.  Anyway, I cannot get excited
about the extra workload that this would require in the catalogs,
because we would need one TOAST sequence tracked in there, linked to
the TOAST relation so it would not be free.  Or we invent a new
facility just for this purpose, meaning that we get far away even more
from being able to resolve the original problem with the values and
compression IDs.  We're talking about two instructions.  Well, I guess
that we could optimize it more some atomics or even cache a range of
values to save in ToastIdGenLock acquisitions in a single backend.  I
suspect that the bottleneck is going to be the insertion of the TOAST
entries in toast_save_datum() anyway with the check for conflicting
values, even if your relation is unlogged or running-on-scissors in
memory.

> [2] "Yes, the idea is to put the tid pointer directly in the varlena
> external header and have a tid array in the toast table as an extra
> column. If all of the TOAST fits in the single record, this will be
> empty, else it will have an array of tids for all the pages for this
> toasted field." - Hannu Krosing in an email to me after
> PGConf.dev/2025

Sure, you could do that as well, but I suspect that we'll need the
steps of at least up to 0003 to be able to handle more easily multiple
external TOAST pointer types, or the code will be messier than it
currently is.  :D
--
Michael


signature.asc
Description: PGP signature


Re: mention unused_oids script in pg_proc.dat

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 05:40:39PM +0300, Florents Tselai wrote:
> On Tue, Jul 1, 2025 at 12:25 PM Peter Eisentraut  wrote:
>>> +
>>> +Below are some checklists for common scenarios.
>>> +
>>> +## Adding a New Built-in SQL Command
>>> +
>>> +1. Implement the function in C and place it in the appropriate
>> directory under `src` (typically in `src/utils/adt`)
>>> +2. Each function should have a unique integer OID. Run the script
>> `src/include/catalog/unused_oids` to find available ranges.
>>> +3. Add the entry to `pg_proc.dat` following the surrounding
>> conventions.
>>
>> Presumably, if you're developing this kind of thing, you are already
>> reading some other part of the documentation, maybe xfunc.sgml?  Do we
>> need to add more information there, or some links?
> 
> I agree. Added a link reference to that xfunc-c

I understand Peter's comment that we need something more general than
what your v2-0001 is proposing, the point being that modifying
pg_proc.dat may apply to some concepts, but it's far from being all of
them.  If you need to add a new in-core type, or something else to a
different catalog, then you need to be aware of a completely different
.dat file.  This set of instructions read as being a bit misleading,
IMO.
--
Michael


signature.asc
Description: PGP signature


Re: A recent message added to pg_upgade

2025-07-08 Thread Dilip Kumar
On Wed, Jul 9, 2025 at 9:07 AM Dilip Kumar  wrote:
>
> On Tue, Jul 8, 2025 at 5:24 PM Amit Kapila  wrote:
> >
> > On Tue, Jul 8, 2025 at 4:49 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Jul 8, 2025 at 2:29 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Jul 8, 2025 at 11:32 AM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Mon, Jul 7, 2025 at 11:22 PM Tom Lane  wrote:
> > > > > >
> > > > >
> > > > > > There's a bigger picture here, though.  The fundamental thing that
> > > > > > I find wrong with the current code is that knowledge of and
> > > > > > responsibility for this max_slot_wal_keep_size hack is spread across
> > > > > > both pg_upgrade and the server.  It would be better if it were on
> > > > > > just one side.  Now, unless we want to change that Assert that
> > > > > > 8bfb231b4 put into InvalidatePossiblyObsoleteSlot(), the server side
> > > > > > is going to be aware of this decision.  So I'm inclined to think
> > > > > > that we should silently enforce max_slot_wal_keep_size = -1 in
> > > > > > binary-upgrade mode in the server's GUC check hook, and then remove
> > > > > > knowledge of it from pg_upgrade altogether.  Maybe the same for
> > > > > > idle_replication_slot_timeout, which really has got the same issue
> > > > > > that we don't want users overriding that choice.
> > > > >
> > > > > Yeah this change makes sense,
> > > > >
> > > >
> > > > Agreed.
> > > >
> > > > One other idea to achieve similar functionality is that during
> > > > BinaryUpgrade, avoid removing WAL due to max_slot_wal_keep_size, and
> > > > also skip InvalidateObsoleteReplicationSlots. The one advantage of
> > > > such a change is that after this, we can remove Assert in
> > > > InvalidatePossiblyObsoleteSlot, remove check_hook functions for GUCs
> > > > max_slot_wal_keep_size and idle_replication_slot_timeout, and remove
> > > > special settings for these GUCs in pg_upgrade.
> > >
> > > Yeah that could also be possible, not sure which one is better though,
> > > with this idea we will have to put BinaryUpgrade check in KeepLogSeg()
> > > as well as in InvalidateObsoleteReplicationSlots() whereas forcing the
> > > GUC to be -1 during binary upgrade we just need check at one place.
> > >
> >
> > But OTOH, as mentioned, we can remove all other codes like check_hooks
> > and probably assert as well.
> >
>
> After further consideration, I believe your proposed method is
> superior to forcing the max_slot_wal_keep_size to -1 via a check hook.
> The ultimate goal is to prevent WAL removal during a binary upgrade,
> and your approach directly addresses this issue rather than
> controlling it by forcing the GUC value.  I am planning to send a
> patch using this approach for both max_slot_wal_keep_size as well as
> for idle_replication_slot_timeout.

PFA, with this approach.

-- 
Regards,
Dilip Kumar
Google


v2-0001-Better-way-to-prevent-wal-removal-and-slot-invali.patch
Description: Binary data


Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-07-08 Thread Richard Guo
On Mon, Jun 30, 2025 at 4:26 PM Richard Guo  wrote:
> On Wed, May 28, 2025 at 6:28 PM Richard Guo  wrote:
> > This patchset does not apply anymore due to 2c0ed86d3.  Here is a new
> > rebase.

> This patchset does not apply anymore, due to 5069fef1c this time.
> Here is a new rebase.

Here is a new rebase.  I moved the call to preprocess_relation_rtes to
a later point within convert_EXISTS_sublink_to_join, so we can avoid
the work if it turns out that the EXISTS SubLink cannot be flattened.
Nothing essential has changed.

The NOT-IN pullup work depends on the changes in this patchset (it
also relies on the not-null information), so I'd like to move it
forward.

Hi Tom, Robert -- just to be sure, are you planning to take another
look at it?

Thanks
Richard


v7-0001-Expand-virtual-generated-columns-before-sublink-p.patch
Description: Binary data


v7-0002-Centralize-collection-of-catalog-info-needed-earl.patch
Description: Binary data


v7-0003-Reduce-Var-IS-NOT-NULL-quals-during-constant-fold.patch
Description: Binary data


Re: A recent message added to pg_upgade

2025-07-08 Thread Amit Kapila
On Tue, Jul 8, 2025 at 11:32 AM Dilip Kumar  wrote:
>
> On Mon, Jul 7, 2025 at 11:22 PM Tom Lane  wrote:
> >
>
> > There's a bigger picture here, though.  The fundamental thing that
> > I find wrong with the current code is that knowledge of and
> > responsibility for this max_slot_wal_keep_size hack is spread across
> > both pg_upgrade and the server.  It would be better if it were on
> > just one side.  Now, unless we want to change that Assert that
> > 8bfb231b4 put into InvalidatePossiblyObsoleteSlot(), the server side
> > is going to be aware of this decision.  So I'm inclined to think
> > that we should silently enforce max_slot_wal_keep_size = -1 in
> > binary-upgrade mode in the server's GUC check hook, and then remove
> > knowledge of it from pg_upgrade altogether.  Maybe the same for
> > idle_replication_slot_timeout, which really has got the same issue
> > that we don't want users overriding that choice.
>
> Yeah this change makes sense,
>

Agreed.

One other idea to achieve similar functionality is that during
BinaryUpgrade, avoid removing WAL due to max_slot_wal_keep_size, and
also skip InvalidateObsoleteReplicationSlots. The one advantage of
such a change is that after this, we can remove Assert in
InvalidatePossiblyObsoleteSlot, remove check_hook functions for GUCs
max_slot_wal_keep_size and idle_replication_slot_timeout, and remove
special settings for these GUCs in pg_upgrade.

-- 
With Regards,
Amit Kapila.




Re: Explicitly enable meson features in CI

2025-07-08 Thread Nazir Bilal Yavuz
Hi,

On Wed, 2 Jul 2025 at 10:22, Nazir Bilal Yavuz  wrote:
>
> Also, libcurl is disabled in the OpenBSD CI task until the fix in this
> thread [1] is committed.

This fix is committed in 7376e60854 so libcurl is enabled for OpenBSD in v4.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 4cbdc65f0f473754d425b3824f22c88b86d78492 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 2 Jul 2025 18:07:51 +0300
Subject: [PATCH v4 1/3] ci: Remove PG_TEST_EXTRA from NetBSD and OpenBSD

PG_TEST_EXTRA environment variable is already set at the top level.
NetBSD and OpenBSD tasks will use this top level PG_TEST_EXTRA as
default.

Discussion: https://www.postgresql.org/message-id/flat/CAN55FZ0aO8d_jkyRijcGP8qO%3DXH09qG%3Dpw0ZZDvB4LMzuXYU1w%40mail.gmail.com
---
 .cirrus.tasks.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 92057006c93..a7f75c5b59f 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -318,7 +318,6 @@ task:
 --pkg-config-path ${PKGCONFIG_PATH} \
 -Dcassert=true -Dinjection_points=true \
 -Dssl=openssl ${UUID} ${TCL} \
--DPG_TEST_EXTRA="$PG_TEST_EXTRA" \
 build
 EOF
 
-- 
2.49.0

From 0e9ee191aab8c5d71569ee00831e9fff6f1225b1 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 25 Jun 2025 11:14:26 +0300
Subject: [PATCH v4 2/3] ci: meson: Explicitly enable meson features

By default, Meson silently disables features it cannot detect, which can
lead to incomplete builds. Explicitly enabling these features causes the
build to fail if they are missing, making such issues visible early.

Suggested-by: Jacob Champion 
Reviewed-by: Peter Eisentraut 
Discussion: https://www.postgresql.org/message-id/flat/CAN55FZ0aO8d_jkyRijcGP8qO%3DXH09qG%3Dpw0ZZDvB4LMzuXYU1w%40mail.gmail.com
---
 .cirrus.tasks.yml | 108 ++
 1 file changed, 90 insertions(+), 18 deletions(-)

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index a7f75c5b59f..e8f5fee2627 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -31,6 +31,25 @@ env:
   TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
   PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance oauth
 
+  # Meson feature flags, shared between all meson tasks except the
+  # 'SanityCheck' and 'Windows - VS' tasks
+  MESON_COMMON_FEATURES: >-
+-Dauto_features=disabled
+-Dldap=enabled
+-Dssl=openssl
+-Dtap_tests=enabled
+-Dplperl=enabled
+-Dplpython=enabled
+-Ddocs=enabled
+-Dicu=enabled
+-Dlibxml=enabled
+-Dlibxslt=enabled
+-Dlz4=enabled
+-Dpltcl=enabled
+-Dreadline=enabled
+-Dzlib=enabled
+-Dzstd=enabled
+
 
 # What files to preserve in case tests fail
 on_failure_ac: &on_failure_ac
@@ -164,6 +183,15 @@ task:
   -c debug_parallel_query=regress
 PG_TEST_PG_UPGRADE_MODE: --link
 
+MESON_FEATURES: >-
+  -Ddtrace=enabled
+  -Dgssapi=enabled
+  -Dlibcurl=enabled
+  -Dnls=enabled
+  -Dpam=enabled
+  -Dtcl_version=tcl86
+  -Duuid=bsd
+
   <<: *freebsd_task_template
 
   depends_on: SanityCheck
@@ -198,8 +226,8 @@ task:
   meson setup \
 --buildtype=debug \
 -Dcassert=true -Dinjection_points=true \
--Duuid=bsd -Dtcl_version=tcl86 -Ddtrace=auto \
 -Dextra_lib_dirs=/usr/local/lib -Dextra_include_dirs=/usr/local/include/ \
+${MESON_COMMON_FEATURES} ${MESON_FEATURES} \
 build
 EOF
   build_script: su postgres -c 'ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}'
@@ -269,6 +297,12 @@ task:
 LC_ALL: "C"
 # -Duuid is not set for the NetBSD, see the comment below, above
 # configure_script, for more information.
+MESON_FEATURES: >-
+  -Dgssapi=enabled
+  -Dlibcurl=enabled
+  -Dnls=enabled
+  -Dpam=enabled
+
   setup_additional_packages_script: |
 #pkgin -y install ...
   <<: *netbsd_task_template
@@ -279,8 +313,13 @@ task:
 OS_NAME: openbsd
 IMAGE_FAMILY: pg-ci-openbsd-postgres
 PKGCONFIG_PATH: '/usr/lib/pkgconfig:/usr/local/lib/pkgconfig'
-UUID: -Duuid=e2fs
-TCL: -Dtcl_version=tcl86
+
+MESON_FEATURES: >-
+  -Dbsd_auth=enabled
+  -Dlibcurl=enabled
+  -Dtcl_version=tcl86
+  -Duuid=e2fs
+
   setup_additional_packages_script: |
 #pkg_add -I ...
   # Always core dump to ${CORE_DUMP_DIR}
@@ -317,7 +356,7 @@ task:
 --buildtype=debugoptimized \
 --pkg-config-path ${PKGCONFIG_PATH} \
 -Dcassert=true -Dinjection_points=true \
--Dssl=openssl ${UUID} ${TCL} \
+${MESON_COMMON_FEATURES} ${MESON_FEATURES} \
 build
 EOF
 
@@ -364,10 +403,6 @@ LINUX_CONFIGURE_FEATURES: &LINUX_CONFIGURE_FEATURES >-
   --with-uuid=ossp
   --with-zstd
 
-LINUX_MESON_FEATURES: &LINUX_MESON_FEATURES >-
-  -Dllvm=enabled
-  -Duuid=e2fs
-
 
 # Check SPECIAL in the matrix: below
 task:
@

Re: C11 / VS 2019

2025-07-08 Thread Andrew Dunstan



On 2025-07-08 Tu 3:45 PM, Tom Lane wrote:

Andrew Dunstan  writes:

It's done and running. Testing before I re-enabled the animal it shows
it was happy.

In the no-good-deed-goes-unpunished department ... drongo is now spewing
a boatload of these warnings:

C:\\Program Files (x86)\\Windows 
Kits\\10\\include\\10.0.18362.0\\um\\winbase.h(9305): warning C5105: macro 
expansion producing 'defined' has undefined behavior

Looks like this comes out once per .c file -- probably it's
in an inclusion from .  Dunno if there's anything
we can do but ignore it.  I wonder though why we have not seen
this on other buildfarm animals.





*sigh*


will investigate.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: like pg_shmem_allocations, but fine-grained for DSM registry ?

2025-07-08 Thread Nathan Bossart
Here is what I have staged for commit, which I'm planning to do tomorrow.

-- 
nathan
>From eae91e016509605630616a314a8ee2eaa1113b15 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 8 Jul 2025 14:52:36 -0500
Subject: [PATCH v6 1/1] Introduce pg_dsm_registry_allocations view.

This commit adds a new system view that provides information about
entries in the dynamic shared memory (DSM) registry.  Specifically,
it returns the name, type, and size of each entry.  Note that since
we cannot discover the size of dynamic shared memory areas (DSAs)
and hash tables backed by DSAs (dshashes) without first attaching
to them, the size column is left as NULL for those.

XXX: NEEDS CATVERSION BUMP

Author: Florents Tselai 
Reviewed-by: Sungwoo Chang 
Discussion: https://postgr.es/m/4D445D3E-81C5-4135-95BB-D414204A0AB4%40gmail.com
---
 doc/src/sgml/system-views.sgml| 74 +++
 src/backend/catalog/system_views.sql  |  8 ++
 src/backend/storage/ipc/dsm_registry.c| 49 
 src/include/catalog/pg_proc.dat   |  7 ++
 .../expected/test_dsm_registry.out| 18 +
 .../sql/test_dsm_registry.sql |  7 ++
 src/test/regress/expected/privileges.out  | 15 +++-
 src/test/regress/expected/rules.out   |  4 +
 src/test/regress/sql/privileges.sql   |  5 +-
 9 files changed, 185 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index e1ac544ee40..d3ff8c35738 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -81,6 +81,11 @@
   open cursors
  
 
+ 
+  pg_dsm_registry_allocations
+  shared memory allocations tracked in the DSM registry
+ 
+
  
   pg_file_settings
   summary of configuration file contents
@@ -1086,6 +1091,75 @@ AND c1.path[c2.level] = c2.path[c2.level];
 
  
 
+ 
+  pg_dsm_registry_allocations
+
+  
+   pg_dsm_registry_allocations
+  
+
+  
+   The pg_dsm_registry_allocations view shows shared
+   memory allocations tracked in the dynamic shared memory (DSM) registry.
+   This includes memory allocated by extensions using the mechanisms detailed
+   in .
+  
+
+  
+   pg_dsm_registry_allocations Columns
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   name text
+  
+  
+   The name of the allocation in the DSM registry.
+  
+ 
+
+ 
+  
+   type text
+  
+  
+   The type of allocation.  Possible values are segment,
+   area, and hash, which correspond
+   to dynamic shared memory segments, areas, and hash tables, respectively.
+  
+ 
+
+ 
+  
+   size int8
+  
+  
+   Size of the allocation in bytes.  NULL for entries of type
+   area and hash.
+  
+ 
+
+   
+  
+
+  
+   By default, the pg_dsm_registry_allocations view
+   can be read only by superusers or roles with privileges of the
+   pg_read_all_stats role.
+  
+ 
+
  
   pg_file_settings
 
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index e5dbbe61b81..b2d5332effc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -666,6 +666,14 @@ GRANT SELECT ON pg_shmem_allocations_numa TO 
pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
 
+CREATE VIEW pg_dsm_registry_allocations AS
+SELECT * FROM pg_get_dsm_registry_allocations();
+
+REVOKE ALL ON pg_dsm_registry_allocations FROM PUBLIC;
+GRANT SELECT ON pg_dsm_registry_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_dsm_registry_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_dsm_registry_allocations() TO 
pg_read_all_stats;
+
 CREATE VIEW pg_backend_memory_contexts AS
 SELECT * FROM pg_get_backend_memory_contexts();
 
diff --git a/src/backend/storage/ipc/dsm_registry.c 
b/src/backend/storage/ipc/dsm_registry.c
index 828c2ff0c7f..1682cc6d34c 100644
--- a/src/backend/storage/ipc/dsm_registry.c
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -40,10 +40,12 @@
 
 #include "postgres.h"
 
+#include "funcapi.h"
 #include "lib/dshash.h"
 #include "storage/dsm_registry.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "utils/builtins.h"
 #include "utils/memutils.h"
 
 #define DSMR_NAME_LEN  128
@@ -88,6 +90,13 @@ typedef enum DSMREntryType
DSMR_ENTRY_TYPE_DSH,
 } DSMREntryType;
 
+static const char *const DSMREntryTypeNames[] =
+{
+   [DSMR_ENTRY_TYPE_DSM] = "segment",
+   [DSMR_ENTRY_TYPE_DSA] = "area",
+   [DSMR_ENTRY_TYPE_DSH] = "hash",
+};
+
 typedef struct DSMRegistryEntry
 {
charname[DSMR_NAME_LEN];
@@ -435,3 +444,43 @@ GetNamedDSHash(const char *name, const dshash_parameter

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Nikita Malakhov
Hi,

Hannu, we have some thoughts on direct tids storage,
it was some time ago and done by another developer,
so I have to look. I'll share it as soon as I find it, if you
are interested.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


IO worker crash in test_aio/002_io_workers

2025-07-08 Thread Andres Freund
Hi,

Tomas off-list reported (he hit with a WIP patch applied) a crash he had been
seeing in test_aio/002_io_workers.  While I can't reproduce it as easily as he
can, I hit it after a few hundred iterations of test_aio/002_io_workers.

The one time I hit it I had forgotten to enable core dumps in the relevant
terminal, so I don't have a core dump yet. Tomas had reported this:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0095e638 in SetLatch (latch=0x0) at latch.c:304
304if (latch->is_set)
(gdb) bt
#0  0x0095e638 in SetLatch (latch=0x0) at latch.c:304
#1  0x0093ed0f in IoWorkerMain (startup_data=0x0, startup_data_len=0) 
at method_worker.c:499
#2  0x008ada1e in postmaster_child_launch (child_type=B_IO_WORKER, 
child_slot=230, startup_data=0x0, startup_data_len=0, client_sock=0x0) at 
launch_backend.c:290
#3  0x008b4011 in StartChildProcess (type=B_IO_WORKER) at 
postmaster.c:3973
#4  0x008b4848 in maybe_adjust_io_workers () at postmaster.c:4404
#5  0x008b0bee in PostmasterMain (argc=4, argv=0x115570a0) at 
postmaster.c:1382
#6  0x00765082 in main (argc=4, argv=0x115570a0) at main.c:227


Staring at the code for a while I think I see the problem:  If a worker was
idle at the time it exited, it is not removed from the set of idle
workers. Which in turn means that pgaio_choose_idle_worker() in

/* Got one.  Clear idle flag. */
io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << 
MyIoWorkerId);

/* See if we can wake up some peers. */
nwakeups = Min(pgaio_worker_submission_queue_depth(),
   IO_WORKER_WAKEUP_FANOUT);
for (int i = 0; i < nwakeups; ++i)
{
if ((worker = pgaio_choose_idle_worker()) < 0)
break;
latches[nlatches++] = 
io_worker_control->workers[worker].latch;
}

can return a worker that's actually not currently running and thus does not
have a latch set.


I suspect the reason that this was hit with Tomas' patch is that it adds use
of streaming reads to index scans, and thus makes it plausible at all to hit
AIO in the path.  Tomas reported a 25% failure rate, but for me, even with his
patch applied, it is more like 0.1% or so. Not sure what explains that.


Maybe 002_io_workers should run AIO while starting / stopping workers to make
it easier to hit problems like that?


The fix seems relatively clear - unset the bit in pgaio_worker_die(). And add
an assertion in pgaio_choose_idle_worker() ensuring that the worker "slot" is
in use.

Greetings,

Andres Freund




Re: IO worker crash in test_aio/002_io_workers

2025-07-08 Thread Thomas Munro
On Wed, Jul 9, 2025 at 8:45 AM Andres Freund  wrote:
> /* Got one.  Clear idle flag. */
> io_worker_control->idle_worker_mask &= ~(UINT64_C(1) 
> << MyIoWorkerId);
>
> /* See if we can wake up some peers. */
> nwakeups = Min(pgaio_worker_submission_queue_depth(),
>IO_WORKER_WAKEUP_FANOUT);
> for (int i = 0; i < nwakeups; ++i)
> {
> if ((worker = pgaio_choose_idle_worker()) < 0)
> break;
> latches[nlatches++] = 
> io_worker_control->workers[worker].latch;
> }
>
> can return a worker that's actually not currently running and thus does not
> have a latch set.

Ugh, right, thanks.  Annoyingly, I think I had already seen and
understood this while working on the dynamic worker pool sizing
patch[1] which starts and stops workers more often, and that patch of
course had to address that problem, but I somehow failed to spot or
maybe just remember that master needs that change too.  Will fix.

> I suspect the reason that this was hit with Tomas' patch is that it adds use
> of streaming reads to index scans, and thus makes it plausible at all to hit
> AIO in the path.

Cool, been meaning to try that out...

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com




Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Hannu Krosing
chunk_

On Tue, Jul 8, 2025 at 9:37 PM Álvaro Herrera  wrote:
>
> On 2025-Jul-08, Hannu Krosing wrote:
>
> > I still think we should go with direct toast tid pointers in varlena
> > and not some kind of oid.
>
> I think this can be made to work, as long as we stop seeing the toast
> table just like a normal heap table containing normal tuples.  A lot to
> reimplement though -- vacuum in particular.

Non-FULL vacuum should already work. Only commands like VACUUM FULL
and CLUSTER which move tuples around should be disabled on TOAST
tables.

What other parts do you think need re-implementing in addition to
skipping the index lookup part and using the tid directly ?


The fact that per-page chunk_tid arrays allow also tree structures
should allow us much more flexibility in implementing
in-place-updatable structured storage in something otherways very
similar to toast, but this is not required for just moving from oid +
index ==> tid to using the tid directly.

I think that having a toast table as a normal table with full MVCC is
actually a good thing, as it can implement the "array element update"
as a real partial update of only the affected parts and not the
current 'copy everything' way of doing this. We already do collect the
array element update in the parse tree in a special way, now we just
need to have types that can do the partial update by changing a tid or
two in the chunk_tids array (and adjust the offsets array if needed)

This should make both
UPDATE t SET theintarray[3] = 5, theintarray[4] = 7 WHERE id  = 1;

and even do partial up[dates for something like this

hannuk=# select * from jtab;
 id | j
+
  1 | {"a": 3, "b": 2}
  2 | {"c": 1, "d": [10, 20, 3]}
(2 rows)
hannuk=# update jtab SET j['d'][3] = '7' WHERE id = 2;
UPDATE 1
hannuk=# select * from jtab;
 id |   j
+---
  1 | {"a": 3, "b": 2}
  2 | {"c": 1, "d": [10, 20, 3, 7]}
(2 rows)

when the JSON data is so large that changed part is in it's own chunk.

> Maybe it can be thought of
> as a new table AM.  Not an easy project, I reckon.

I would prefer it to be an extension of current toast - just another
varatt_* type - as then you can upgrade to new storage CONCURRENTLY,
same way as you can currently switch compression methods.

> --
> Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

2025-07-08 Thread Tomas Vondra



On 7/8/25 18:06, Cédric Villemain wrote:
> 
> 
> 
> 
> 
> 
>> On 7/8/25 03:55, Cédric Villemain wrote:
>>> Hi Andres,
>>>
 Hi,

 On 2025-07-05 07:09:00 +, Cédric Villemain wrote:
> In my work on more careful PostgreSQL resource management, I've come
> to the
> conclusion that we should avoid pushing policy too deeply into the
> PostgreSQL core itself. Therefore, I'm quite skeptical about
> integrating
> NUMA-specific management directly into core PostgreSQL in such a way.

 I think it's actually the opposite - whenever we pushed stuff like this
 outside of core it has hurt postgres substantially. Not having
 replication in
 core was a huge mistake. Not having HA management in core is
 probably the
 biggest current adoption hurdle for postgres.

 To deal better with NUMA we need to improve memory placement and
 various
 algorithms, in an interrelated way - that's pretty much impossible
 to do
 outside of core.
>>>
>>> Except the backend pinning which is easy to achieve, thus my comment on
>>> the related patch.
>>> I'm not claiming NUMA memory and all should be managed outside of core
>>> (though I didn't read other patches yet).
>>>
>>
>> But an "optimal backend placement" seems to very much depend on where we
>> placed the various pieces of shared memory. Which the external module
>> will have trouble following, I suspect.
>>
>> I still don't have any idea what exactly would the external module do,
>> how would it decide where to place the backend. Can you describe some
>> use case with an example?
>>
>> Assuming we want to actually pin tasks from within Postgres, what I
>> think might work is allowing modules to "advise" on where to place the
>> task. But the decision would still be done by core.
> 
> Possibly exactly what you're doing in proc.c when managing allocation of
> process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
> candidates), I didn't get that they require information not available to
> any process executing code from a module.
> 

Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...

> Parts of your code where you assign/define policy could be in one or
> more relevant routines of a "numa profile manager", like in an
> initProcessRoutine(), and registered in pmroutine struct:
> 
> pmroutine = GetPmRoutineForInitProcess();
> if (pmroutine != NULL &&
>     pmroutine->init_process != NULL)
>     pmroutine->init_process(MyProc);
> 
> This way it's easier to manage alternative policies, and also to be able
> to adjust when hardware and linux kernel changes.
> 

I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?


regards

-- 
Tomas Vondra





Re: Non-text mode for pg_dumpall

2025-07-08 Thread Noah Misch
On Fri, Apr 04, 2025 at 04:11:05PM -0400, Andrew Dunstan wrote:
> Thanks. I have pushed these now with a few further small tweaks.

This drops all databases:

pg_dumpall --clean -Fd -f /tmp/dump
pg_restore -d template1 --globals-only /tmp/dump

That didn't match my expectations given this help text:

$ pg_restore --help|grep global
  -g, --globals-only   restore only global objects, no databases

This happens in dropDBs().  I found that by searching pg_dumpall.c for "OPF",
which finds all the content we can write to globals.dat.

commit 1495eff wrote:
> --- a/src/bin/pg_dump/pg_dumpall.c
> +++ b/src/bin/pg_dump/pg_dumpall.c

> @@ -1612,9 +1683,27 @@ dumpDatabases(PGconn *conn)
>   continue;
>   }
>  
> + /*
> +  * If this is not a plain format dump, then append dboid and 
> dbname to
> +  * the map.dat file.
> +  */
> + if (archDumpFormat != archNull)
> + {
> + if (archDumpFormat == archCustom)
> + snprintf(dbfilepath, MAXPGPATH, 
> "\"%s\"/\"%s\".dmp", db_subdir, oid);
> + else if (archDumpFormat == archTar)
> + snprintf(dbfilepath, MAXPGPATH, 
> "\"%s\"/\"%s\".tar", db_subdir, oid);
> + else
> + snprintf(dbfilepath, MAXPGPATH, 
> "\"%s\"/\"%s\"", db_subdir, oid);

Use appendShellString() instead.  Plain mode already does that for the
"pg_dumpall -f" argument, which is part of db_subdir here.  We don't want
weird filename characters to work out differently for plain vs. non-plain
mode.  Also, it's easier to search for appendShellString() than to search for
open-coded shell quoting.

> @@ -1641,19 +1727,30 @@ dumpDatabases(PGconn *conn)
>   if (filename)
>   fclose(OPF);
>  
> - ret = runPgDump(dbname, create_opts);
> + ret = runPgDump(dbname, create_opts, dbfilepath, 
> archDumpFormat);
>   if (ret != 0)
>   pg_fatal("pg_dump failed on database \"%s\", exiting", 
> dbname);
>  
>   if (filename)
>   {
> - OPF = fopen(filename, PG_BINARY_A);
> + charglobal_path[MAXPGPATH];
> +
> + if (archDumpFormat != archNull)
> + snprintf(global_path, MAXPGPATH, 
> "%s/global.dat", filename);
> + else
> + snprintf(global_path, MAXPGPATH, "%s", 
> filename);
> +
> + OPF = fopen(global_path, PG_BINARY_A);
>   if (!OPF)
>   pg_fatal("could not re-open the output file 
> \"%s\": %m",
> -  filename);
> +  global_path);

Minor item: plain mode benefits from reopening, because pg_dump appended to
the plain output file.  There's no analogous need to reopen global.dat, since
just this one process writes to global.dat.

> @@ -1672,17 +1770,36 @@ runPgDump(const char *dbname, const char *create_opts)
>   initPQExpBuffer(&connstrbuf);
>   initPQExpBuffer(&cmd);
>  
> - printfPQExpBuffer(&cmd, "\"%s\" %s %s", pg_dump_bin,
> -   pgdumpopts->data, create_opts);
> -
>   /*
> -  * If we have a filename, use the undocumented plain-append pg_dump
> -  * format.
> +  * If this is not a plain format dump, then append file name and dump
> +  * format to the pg_dump command to get archive dump.
>*/
> - if (filename)
> - appendPQExpBufferStr(&cmd, " -Fa ");
> + if (archDumpFormat != archNull)
> + {
> + printfPQExpBuffer(&cmd, "\"%s\" -f %s %s", pg_dump_bin,
> +   dbfile, create_opts);
> +
> + if (archDumpFormat == archDirectory)
> + appendPQExpBufferStr(&cmd, "  --format=directory ");
> + else if (archDumpFormat == archCustom)
> + appendPQExpBufferStr(&cmd, "  --format=custom ");
> + else if (archDumpFormat == archTar)
> + appendPQExpBufferStr(&cmd, "  --format=tar ");
> + }
>   else
> - appendPQExpBufferStr(&cmd, " -Fp ");
> + {
> + printfPQExpBuffer(&cmd, "\"%s\" %s %s", pg_dump_bin,
> +   pgdumpopts->data, 
> create_opts);

This uses pgdumpopts for plain mode only, so many pg_dumpall options silently
have no effect in non-plain mode.  Example:

strace -f pg_dumpall --lock-wait-timeout=10 2>&1 >/dev/null | grep exec
strace -f pg_dumpall --lock-wait-timeout=10 -Fd -f /tmp/dump3 2>&1 >/dev/null | 
grep exec

> --- a/src/bin/pg_dump/pg_restore.c
> +++ b/src/bin/pg_dump/pg_restore.c

> +/*
> + * read_one_statement
> + *
> + * This will start reading from passed

Fix comment in btree_gist--1.8--1.9.sql

2025-07-08 Thread Paul Jungwirth

Hello Hackers,

I noticed that the comment at the top of btree_gist--1.8--1.9.sql says it is the 1.7--1.8 file. Here 
is a one-line patch to fix that.


A related question: In the v18 release we are adding two upgrade files: btree_gist--1.7--1.8.sql and 
btree_gist--1.8-1.9.sql. Why not collapse them into one? Technically we briefly had 
btree_gist--1.7--1.8.sql in 17devel, but it was reverted before release. Would you like a patch 
combining these files? Is it too late for that?


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com
From 518a70382e4628facb86b8f4f09fb73d38881931 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Tue, 8 Jul 2025 16:42:54 -0700
Subject: [PATCH v1] Fix comment in btree_gist--1.8--1.9.sql

The comment said this was the 1.7--1.8 file, but actually it's the file
for 1.8--1.9.
---
 contrib/btree_gist/btree_gist--1.8--1.9.sql | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/btree_gist/btree_gist--1.8--1.9.sql b/contrib/btree_gist/btree_gist--1.8--1.9.sql
index 4b38749bf5f..cc1207ed022 100644
--- a/contrib/btree_gist/btree_gist--1.8--1.9.sql
+++ b/contrib/btree_gist/btree_gist--1.8--1.9.sql
@@ -1,4 +1,4 @@
-/* contrib/btree_gist/btree_gist--1.7--1.8.sql */
+/* contrib/btree_gist/btree_gist--1.8--1.9.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.9'" to load this file. \quit
-- 
2.39.5



Re: A assert failure when initdb with track_commit_timestamp=on

2025-07-08 Thread Andy Fan
Fujii Masao  writes:

> Shouldn't we also add a TAP test to verify that initdb works correctly
> with GUCs marked as GUC_NOT_IN_BOOTSTRAP?

After revert 5a6c39b6d, the test case could be as simply as below: I
also tested this change. Just FYI.

modified   src/test/modules/commit_ts/t/001_base.pl
@@ -11,8 +11,7 @@ use Test::More;
 use PostgreSQL::Test::Cluster;
 
 my $node = PostgreSQL::Test::Cluster->new('foxtrot');
-$node->init;
-$node->append_conf('postgresql.conf', 'track_commit_timestamp = on');
+$node->init(extra => ['-c', 'track_commit_timestamp=on', "-c", 
"transaction_timeout=10s" ]);
 $node->start;
 
# Create a table, compare "now()" to the commit TS of its xmin

-- 
Best Regards
Andy Fan





Re: array_random

2025-07-08 Thread jian he
On Sat, Jul 5, 2025 at 3:32 PM Vik Fearing  wrote:
>
> On 30/06/2025 17:04, jian he wrote:
>
> reasons for adding array_random is:
> 1. This is better than array_fill. This can fill random and constant
> values (random, min and max the same).
> 2.  Building a multi-dimensional PL/pgSQL function equivalent to
> array_random is not efficient and is also not easier.
>
>
> I am not against this at all, but what is the actual use case?
>
> --

it seems not trivial to wrap up all the generated random values into a specific
multi-dimensional array (more than 2 dimensions).
for example, say we generated 24 random values and wanted to arrange them into a
3-dimensional array with shape [4, 3, 2].
we can easily use:
SELECT array_random(1, 6, array[4,3, 2]);

of course, we can use plpgsql to do it, but the c function would be
more convenient.
does this make sense?




Re: Small optimization with expanding dynamic hash table

2025-07-08 Thread cca5507
> Will this still work if new_bucket is not equal to hctl->low_mask 
+ 1?Yes, for example:


low_mask: 0x011, high_mask: 0x111, old_bucket: 0x010, new_bucket: 0x110


The old_bucket's hash value like 0x***010 or 0x***110, the later is in the 
old_bucket is because we didn't have new_bucket before, so only hash value like 
0x***110 needs relocation: hashvalue & (low_mask + 1) != 0


--
Regards,
ChangAo Chen

Re: Using failover slots for PG-non_PG logical replication

2025-07-08 Thread shveta malik
On Mon, Jul 7, 2025 at 7:00 PM Ashutosh Bapat
 wrote:
>
> On Mon, Jul 7, 2025 at 9:53 AM shveta malik  wrote:
> >
> > Thanks for the patch.
> >
> > > I couldn't figure out whether the query on primary to fetch all the
> > > slots to be synchronized should filter based on invalidation_reason
> > > and conflicting or not. According to synchronize_slots(), it seems
> > > that we retain invalidated slots on standby when failover = true and
> > > they would remain with synced = true on standby. Is that right?
> > >
> >
> > Yes, that’s correct. We sync the invalidation status of replication
> > slots from the primary to the standby and then stop synchronizing any
> > slots that have been marked as invalidated, retaining synced flag as
> > true. IMO, there's no need to filter out conflicting slots on the
> > primary, because instead of excluding them there and showing
> > everything as failover-ready on the standby, the correct approach is
> > to reflect the actual state on standby.This means conflicting slots
> > will appear as non-failover-ready on the standby. That’s why Step 3
> > also considers conflicting flag in its evaluation.
>
> Thanks for the explanation. WFM.
>
> If a slot is invalidated because RS_INVAL_WAL_REMOVED, conflicting =
> false, but the slot is not useful standby. But then it's not useful on
> primary as well. Is that why we are not including (invalidation_reason
> IS NOT NULL) condition along in (synced AND NOT temporary AND NOT
> conflicting)?

Thanks for bringing it up. Even if the slot is not useful on the
primary node as well, we shall still show failover-ready as false on
standby. We should modify the query of step3 to check
'invalidation_reason IS NULL' instead of  'NOT conflicting'. That will
cover all the cases where the slot  is invalidated and thus not
failover ready.

> >
> > 1)
> > +/* primary # */ SELECT array_agg(quote_literal(r.slot_name)) AS slots
> > +   FROM pg_replication_slots r
> > +   WHERE r.failover AND NOT r.temporary;
> >
> > On primary, there is no  need to check temporary-status. We do not
> > allow setting failover as true for temporary slots.
>
> Why does synchronize_slots() has it in its query?
>

It is not needed but no harm in maintaining it.
If we want documents to be in sync with code, we can have 'not
temporary' check in doc as well.

> >
> > 2)
> > Although not directly related to the concerns addressed in the given
> > patch, I think it would be helpful to add a note in the original doc
> > stating that Steps 1 and 2 should be executed on each subscriber node
> > that will be served by the standby after failover.
>
> There's a bit of semantic repeatition here since an earlier paragraph
> mentions a given subscriber. But I think overall it's better to be
> more clear than being less clear.
> >
> > I have attached a top-up patch with the above changes and a few more
> > trivial changes. Please include it if you find it okay.
>
> Thanks. Included. I have made a few edits and included them in the
> attached patch.
>

Thanks. The existing changes (originally targeted in this patch) look
good to me.

I have attached a top-up patch for step-3 correction. Please include
if you find it okay to be fixed in the same patch, otherwise we can
handle it separately.

thanks
Shveta


0001-top-up-patch.patch
Description: Binary data


Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-07-08 Thread shveta malik
On Sun, Jul 6, 2025 at 8:33 PM Masahiko Sawada  wrote:
>
> On Thu, Jul 3, 2025 at 3:32 PM shveta malik  wrote:
> >
> > On Wed, Jul 2, 2025 at 9:46 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jun 18, 2025 at 1:07 PM shveta malik  
> > > wrote:
> > > >
> > > > On Wed, Jun 18, 2025 at 6:06 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > Thank you for the comments!
> > > > >
> > > > > >
> > > > > > 2)
> > > > > > I see that when primary switches back its effective wal_level to
> > > > > > replica while standby has wal_level=logical in conf file, then 
> > > > > > standby
> > > > > > has this status:
> > > > > >
> > > > > > postgres=# show wal_level;
> > > > > >  wal_level
> > > > > > ---
> > > > > >  logical
> > > > > >
> > > > > > postgres=# show effective_wal_level;
> > > > > >  effective_wal_level
> > > > > > -
> > > > > >  replica
> > > > > >
> > > > > > Is this correct? Can effective_wal_level be < wal_level anytime? I
> > > > > > feel it can be greater but never lesser.
> > > > >
> > > > > Hmm, I think we need to define what value we should show in
> > > > > effective_wal_level on standbys because the standbys actually are not
> > > > > writing any WALs and whether or not the logical decoding is enabled on
> > > > > the standbys depends on the primary.
> > > > >
> > > > > In the previous version patch, the standby's effective_wal_level value
> > > > > depended solely on the standby's wal_level value. However, it was
> > > > > confusing in a sense because it's possible that the logical decoding
> > > > > could be available even though effective_wal_level is 'replica' if the
> > > > > primary already enables it. One idea is that given that the logical
> > > > > decoding availability and effective_wal_level value are independent in
> > > > > principle, it's better to provide a SQL function to get the logical
> > > > > decoding status so that users can check the logical decoding
> > > > > availability without checking effective_wal_level. With that function,
> > > > > it might make sense to revert back the behavior to the previous one.
> > > > > That is, on the primary the effective_wal_level value is always
> > > > > greater than or equal to wal_level whereas on the standbys it's always
> > > > > the same as wal_level, and users would be able to check the logical
> > > > > decoding availability using the SQL function. Or it might also be
> > > > > worth considering to show effective_wal_level as NULL on standbys.
> > > >
> > > > Yes, that is one idea. It will resolve the confusion.
> > > > But I was thinking, instead of having one new GUC + a SQL function,
> > > > can we have a GUC alone, which shows logical_decoding status plus the
> > > > cause of that. The new GUC will be applicable on both primary and
> > > > standby. As an example, let's say we name it as
> > > > logical_decoding_status, then it can have these values (
> > > > _):
> > > >
> > > > enabled_wal_level_logical:  valid both
> > > > for primary, standby
> > > > enabled_effective_wal_level_logical:   valid only for 
> > > > primary
> > > > enabled_cascaded_logical_decoding   valid only for 
> > > > standby
> > > > disabled :
> > > >   valid both for primary, standby
> > > >
> > > > 'enabled_cascaded_logical_decoding'  will indicate that logical
> > > > decoding is enabled on standby (even when its own wal_level=replica)
> > > > as a cascaded effect from primary. It can be possible either due to
> > > > primary's wal_level=logical or logical slot being present on primary.
> > >
> > > I'm not sure it's a good idea to combine two values into one GUC
> > > because the tools would have to parse the string in order to know when
> > > they want to know either information.
> >
> > Okay. Agreed.
> >
> > > As for the effective_wal_level shown on the standby, if it shows the
> > > effective WAL level it might make sense to show as 'replica' even if
> > > the standby's wal_level is 'logical'
> >
> > Alright. It depends on the definition we choose to assign to
> > effective_wal_level.
> >
> > > because the standby cannot write
> > > any WAL and need to follow the primary.
> >
> > When the standby’s wal_level is set to 'logical', the requirement for
> > logical decoding is already fulfilled. Or do you mean that the
> > effective_wal_level on standby should not be shown as logical until
> > both the primary and standby have wal_level set to logical and we also
> > have a logical slot present on standby?
>
> Even when the standby's wal_level is 'logical', the logical decoding
> cannot be used on the standby if the primary doesn't enable it. IOW,
> even if the standby's wal_level is 'replica', the logical decoding is
> available on the standby if the primary enables it.
>
> >
> > > While it might be worth
> > > considering to accept the case of effective_wal_level (replica) <
> > > wal_level (logical) only on the standbys, we need to keep the
> > > princ

Re: 024_add_drop_pub.pl might fail due to deadlock

2025-07-08 Thread Ajin Cherian
On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian  wrote:
>
> On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin  wrote:
> >
> > --- a/src/backend/replication/logical/origin.c
> > +++ b/src/backend/replication/logical/origin.c
> > @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool 
> > missing_ok, bool nowait)
> >   * the specific origin and then re-check if the origin still 
> > exists.
> >   */
> >  rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
> > +pg_usleep(30);
> >
> > Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
> > 024_add_drop_pub.pl was added in v15).
> >
> > [1] 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
> >
> > Best regards,
> > Alexander
> >
>
> Hi Alexander,
>
> Yes, the problem can be reproduced by the changes you suggested. I
> will look into what is happening and how we can fix this.

The issue appears to be a deadlock caused by inconsistent lock
acquisition order between two processes:

Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1):
In AlterSubscription_refresh(), it first acquires an
AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then
later tries to acquire an ExclusiveLock on ReplicationOriginRelationId
(resource 2).

Process B (apply worker):
In process_syncing_tables_for_apply(), it first acquires an
ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls
UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock
on SubscriptionRelRelationId (resource 1).

This leads to a deadlock:
Process A holds a lock on resource 1 and waits for resource 2, while
process B holds a lock on resource 2 and waits for resource 1.

Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

regards,
Ajin Cherian
Fujitsu Australia.


0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patch
Description: Binary data


Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Laurenz Albe
On Tue, 2025-07-08 at 18:17 -0400, Tom Lane wrote:
> One other thing that bothers me as I look at the output is
> 
>   Per loop time including overhead: 731.26 ns
> 
> That's stated in a way that makes it sound like that is a
> very solid number, when in fact it's just the average.
> We see from these test cases that there are frequently
> a few outliers that are far from the average.  I'm tempted
> to rephrase as
> 
>   Average loop time including overhead: 731.26 ns
> 
> or some variant of that.  Thoughts?

I think that is a good idea.

Yours,
Laurenz Albe




Re: Improving and extending int128.h to more of numeric.c

2025-07-08 Thread John Naylor
On Mon, Jun 23, 2025 at 3:01 PM Dean Rasheed  wrote:
> 0001 is a trivial bug fix for the test code in src/tools/testint128.c
> -- it was using "union" instead of "struct" for test128.hl, which
> meant that it was only ever setting and checking half of each 128-bit
> integer in the tests.

Hi Dean, I went to take a look at this and got stuck at building the
test file. The usual pointing gcc to the src and build include
directories didn't cut it. How did you get it to work?

--
John Naylor
Amazon Web Services




Re: [V2] Adding new CRC32C implementation for IBM S390X

2025-07-08 Thread John Naylor
On Tue, Jul 8, 2025 at 6:46 PM Eduard Stefes  wrote:
>
> Hi,
>
> here is V3 of the patch. Changes from V2:
>
> - removed IBM copyright
> - removed GETAUXVAL check in favor of the already provided check
> - moved runtime selection code from pg_crc32c_s390x_choose.c to
> pg_crc32c_s390x.c and removed _choose.c file
> - removed the config time compiler check and let the buildsystem fall
> back to sb8
> - changed buffer limit back to 32 bytes before calling s390x specific
> implementation

Hi Eduard, I look a brief look at v3 and it seems mostly okay at a
glance. There is just one major thing that got left out:

On Wed, Jul 2, 2025 at 3:27 PM Eduard Stefes  wrote:
> On Wed, 2025-06-11 at 13:48 +0700, John Naylor wrote:
> > As I alluded to before, I'm not in favor of having both direct-call
> > and runtime-check paths here. The reason x86 and Arm have it is
> > because their hardware support works on any length input. Is there
> > any
> > reason to expect auxv.h to be unavailable?
>
> I tried to find a reason but did not find any. So I'll remove it.

v3 still has direct-call and runtime-check paths. Let's keep only
USE_S390X_CRC32C_WITH_RUNTIME_CHECK and discard the direct call
configure checks. Once that's done I'll take a closer look and test as
well. The rest should be small details.

-- 
John Naylor
Amazon Web Services




Re: Should TRUNCATE fire DDL triggers

2025-07-08 Thread Laurenz Albe
On Tue, 2025-07-08 at 21:23 -0700, David G. Johnston wrote:
> On Tuesday, July 8, 2025, Hari Krishna Sunder  wrote:
> > First of all, is TRUNCATE a DDL or a DML? This doc refers to it as a DDL,
> > whereas other docs like this and this treat it as a DML, so which one is it?
> 
> Neither…classification systems are often imperfect…but it sure quacks like

> DML to my ears.  I’d probably remove the term “DDL” from that first link and
> avoid the grey area.  Listing the two commands suffices.

I agree with that.
 
> > A lot of other SQL databases treat TRUNCATE as a DDL, so assuming that is
> > true, can we add it to the command tags supported by "ddl_command_start"
> > and "ddl_command_end" triggers?
> 
> Seems worthy of consideration regardless of how one answers the prior
> question; for much the same reason.

I disagree here.  There are regular ON TRUNCATE triggers on tables, so I don't
see the need.  You can define a trigger with the same trigger function on
several tables.

Yours,
Laurenz Albe




Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

2025-07-08 Thread Cédric Villemain

On 7/8/25 18:06, Cédric Villemain wrote:








On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,


Hi,

On 2025-07-05 07:09:00 +, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about
integrating
NUMA-specific management directly into core PostgreSQL in such a way.


I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is
probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and
various
algorithms, in an interrelated way - that's pretty much impossible
to do
outside of core.


Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).



But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.


Possibly exactly what you're doing in proc.c when managing allocation of
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good
candidates), I didn't get that they require information not available to
any process executing code from a module.



Well, it needs to understand how some other stuff (especially PGPROC
entries) is distributed between nodes. I'm not sure how much of this
internal information we want to expose outside core ...


Parts of your code where you assign/define policy could be in one or
more relevant routines of a "numa profile manager", like in an
initProcessRoutine(), and registered in pmroutine struct:

pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
     pmroutine->init_process != NULL)
     pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able
to adjust when hardware and linux kernel changes.



I'm not against making this extensible, in some way. But I still
struggle to imagine a reasonable alternative policy, where the external
module gets the same information and ends up with a different decision.

So what would the alternate policy look like? What use case would the
module be supporting?



That's the whole point: there are very distinct usages of PostgreSQL in 
the field. And maybe not all of them will require the policy defined by 
PostgreSQL core.


May I ask the reverse: what prevent external modules from taking those 
decisions ? There are already a lot of area where external code can take 
over PostgreSQL processing, like Neon is doing.


There are some very early processing for memory setup that I can see as 
a current blocker, and here I'd refer a more compliant NUMA api as 
proposed by Jakub so it's possible to arrange based on workload, 
hardware configuration or other matters. Reworking to get distinct 
segment and all as you do is great, and combo of both approach probably 
of great interest. There is also this weighted interleave discussed and 
probably much more to come in this area in Linux.


I think some points raised already about possible distinct policies, I 
am precisely claiming that it is hard to come with one good policy with 
limited setup options, thus requirement to keep that flexible enough 
(hooks, api, 100 GUc ?).


There is an EPYC story here also, given the NUMA setup can vary 
depending on BIOS setup, associated NUMA policy must probably take that 
into account (L3 can be either real cache or 4 extra "local" NUMA nodes 
- with highly distinct access cost from a RAM module).
Does that change how PostgreSQL will place memory and process? Is it 
important or of interest ?



--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D





Re: amcheck support for BRIN indexes

2025-07-08 Thread Arseniy Mukhin
On Tue, Jul 8, 2025 at 4:21 PM Tomas Vondra  wrote:
>
>
>
> On 7/8/25 14:40, Arseniy Mukhin wrote:
> > On Mon, Jul 7, 2025 at 3:21 PM Álvaro Herrera  wrote:
> >>
> >> On 2025-Jul-07, Tomas Vondra wrote:
> >>
> >>> Alvaro, what's your opinion on the introduction of the new WITHIN_RANGE?
> >>> I'd probably try to do this using the regular consistent function:
> >>>
> >>> (a) we don't need to add stuff to all BRIN opclasses to support this
> >>>
> >>> (b) it gives us additional testing of the consistent function
> >>>
> >>> (c) building a scan key for equality seems pretty trivial
> >>>
> >>> What do you think?
> >>
> >> Oh yeah, if we can build this on top of the existing primitives, then
> >> I'm all for that.
> >
> > Thank you for the feedback! I agree with the benefits. Speaking of
> > (с), it seems most of the time to be really trivial to build such a
> > ScanKey, but not every opclass supports '=' operator. amcheck should
> > handle these cases somehow then. I see two options here. The first is
> > to not provide  'heap all indexed' check for such opclasses, which is
> > sad because even one core opclass (box_inclusion_ops) doesn't support
> > '=' operator, postgis brin opclasses don't support it too AFAICS. The
> > second option is to let the user define which operator to use during
> > the check, which, I think, makes user experience much worse in this
> > case. So both options look not good from the user POV as for me, so I
> > don't know. What do you think about it?
> >
> > And should I revert the patchset to the consistent function version then?
> >
>
> Yeah, that's a good point. The various opclasses may support different
> operators, and we don't know which "strategy" to fill into the scan key.
> Minmax needs BTEqualStrategyNumber, inclusion RTContainsStrategyNumber,
> and so on.
>
> I wonder if there's a way to figure this out from the catalogs, but I
> can't think of anything. Maybe requiring the user to supply the operator
> (and not checking heap if it's not provided)
>
>   SELECT brin_index_check(..., '@>');
>
> would not be too bad ...

This doesn't change much, but it seems we need an array of operators
in the case of a multi-column index. And one more thing, it's
theoretically possible that opclass doesn't have the operator we need
at all. I'm not aware of such cases, but perhaps in the future
something like GIN tsvector_ops could be created for BRIN.
tsvector_ops doesn't have an operator that could be used here.


Best regards,
Arseniy Mukhin




Re: Small optimization with expanding dynamic hash table

2025-07-08 Thread Rahila Syed
Hi,

Yes, for example:
>>
>> low_mask: 0x011, high_mask: 0x111, old_bucket: 0x010, new_bucket: 0x110
>>
>> The old_bucket's hash value like 0x***010 or 0x***110, the later is in
>> the old_bucket is because we didn't have new_bucket before, so only hash
>> value like 0x***110 needs relocation: hashvalue & (low_mask + 1) != 0
>>
>>
> Thanks for explaining, that clarifies things for me.
> It may be worthwhile to check if this change has led to any performance
> improvements.
>
>

One thing to note is that in this scenario, there is no safeguard if the
hashvalue is 0x111 and new_bucket is 0x110.
This means max_bucket is 0x110, but with your change, even 0x111 would meet
the condition for relocation
to the new bucket, which would be incorrect.
Although it’s unlikely that 0x111 would be passed into this check, if it is
passed, there is currently no check to
prevent it from being relocated to the new bucket. In the current code,
however, we do have such a check in
place in calc_bucket, specifically: if (bucket > hctl->max_bucket)

Thank you,
Rahila Syed


Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root

2025-07-08 Thread Mikhail Kharitonov
On Tue, Jul 8, 2025 at 11:53 AM Mikhail Kharitonov
 wrote:
>
> Hi all,
>
> I’m sending v2 of the patch. This is a clean rebase onto current master
> (commit a27893df45e) and a squash of the fix together with the TAP
> test into a single patch file.
>
> I would appreciate your thoughts and comments on the current problem.
>
> Thank you!
>
> --
> Best regards,
> Mikhail Kharitonov
>

Sorry, I forgot the attachment in my previous message.
Please find the v2 patch attached.


0002-replica_identity.patch
Description: Binary data


Re: Expand applicability of aggregate's sortop optimization

2025-07-08 Thread Matthias van de Meent
On Tue, 4 Mar 2025 at 15:26, Andrei Lepikhov  wrote:
>
> Rebase on current master.

FYI, as this is a significantly different patch than the one I started
this thread with, I'm closing my CF entry. Feel free to open a new one
for your patch.


Kind regards,

Matthias van de Meent




Re: A assert failure when initdb with track_commit_timestamp=on

2025-07-08 Thread Fujii Masao




On 2025/07/06 3:00, Tom Lane wrote:

I wrote:

Fujii Masao  writes:

Or GUC ignore_system_indexes also should be treated in the same way
as transaction_timeout?



Yes, I'd say we ought to mark that GUC as don't-accept-in-bootstrap
too.  I've not done any research about what other GUCs can break
initdb, but now I'm starting to suspect there are several.


Here's a fleshed-out implementation of Hayato-san's idea.  I've
not done anything about reverting 5a6c39b6d, nor have I done any
checks to see if there are other GUCs we ought to mark similarly.
(But at this point I'd be prepared to bet that there are.)


Thanks for the patch! It looks good to me.

Shouldn't we also add a TAP test to verify that initdb works correctly
with GUCs marked as GUC_NOT_IN_BOOTSTRAP?

Regards,

--
Fujii Masao
NTT DATA Japan Corporation





Re: Fix replica identity checks for MERGE command on published table.

2025-07-08 Thread Dean Rasheed
On Mon, 7 Jul 2025 at 18:17, Dean Rasheed  wrote:
>
> This makes me wonder though, does INSERT ... ON CONFLICT DO UPDATE
> have the same problem as MERGE?
>

Answering my own question, INSERT ... ON CONFLICT DO UPDATE does have
the same problem as MERGE. To reproduce the error, all you need to do
is create the unique index it needs *after* creating the publication,
for example:

CREATE TABLE foo (a int, b text);
INSERT INTO foo VALUES (1, 'xxx');
CREATE PUBLICATION pub1 FOR TABLE foo;
CREATE UNIQUE INDEX foo_a_idx ON foo(a);

Then a plain UPDATE is blocked:

UPDATE foo SET b = 'yyy' WHERE a = 1;

ERROR:  cannot update table "foo" because it does not have a replica
identity and publishes updates
HINT:  To enable updating the table, set REPLICA IDENTITY using ALTER TABLE.

but the equivalent with INSERT ... ON CONFLICT DO UPDATE is allowed on
the publisher:

INSERT INTO foo VALUES (1)
  ON CONFLICT (a) DO UPDATE SET b = 'yyy';

but fails on the subscriber:

ERROR:  logical replication target relation "public.foo" has neither
REPLICA IDENTITY index nor PRIMARY KEY and published relation does not
have REPLICA IDENTITY FULL
CONTEXT:  processing remote data for replication origin "pg_16535"
during message type "UPDATE" for replication target relation
"public.foo" in transaction 872, finished at 0/01E65718

So INSERT ... ON CONFLICT DO UPDATE needs to check that the result
relation is valid for UPDATEs.

Regards,
Dean




Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root

2025-07-08 Thread Mikhail Kharitonov
Hi all,

I’m sending v2 of the patch. This is a clean rebase onto current master
(commit a27893df45e) and a squash of the fix together with the TAP
test into a single patch file.

I would appreciate your thoughts and comments on the current problem.

Thank you!

--
Best regards,
Mikhail Kharitonov

On Thu, May 29, 2025 at 9:30 AM Mikhail Kharitonov
 wrote:
>
> Hi,
>
> Thank you for the feedback.
>
> I would like to clarify that the current behavior does not break replication
> between PostgreSQL instances. The logical replication stream is still accepted
> by the subscriber, and the data is applied correctly. However, the protocol
> semantics are violated, which may cause issues for external systems that rely
> on interpreting this stream.
>
> When using publish_via_partition_root = true and setting REPLICA IDENTITY FULL
> only on the parent table (but not on all partitions), logical replication
> generates messages with the tag 'O' (old tuple) for updates and deletes even
> for partitions that do not have full identity configured.
>
> In those cases, only key columns are sent, and the rest of the tuple is 
> omitted.
>  This contradicts the meaning of tag 'O', which, according
>  to the documentation [1], indicates that the full old tuple is included.
>
> This behavior is safe for the standard PostgreSQL subscriber, which does not
> rely on the tag when applying changes. However, third-party tools that consume
> the logical replication stream and follow the protocol strictly can be misled.
> For example, one of our clients uses a custom CDC mechanism that extracts
> changes and sends them to Oracle. Their handler interprets the 'O' tag as a
> signal that the full old row is available. When it is not - the data is
> processed incorrectly.
>
> The attached patch changes the behavior so that the 'O' or 'K' tag is chosen
> based on the REPLICA IDENTITY setting of the actual partition where the row
> ends up not only the parent.
> - If the partition has REPLICA IDENTITY FULL, the full tuple is
> sent and tagged 'O'.
> - Otherwise, only the key columns are sent, and the tag 'K' is used.
>
> This aligns the behavior with the protocol documentation.
> I have also included a TAP test: 036_partition_replica_identity.pl,
> located in src/test/subscription/t/
>
> It demonstrates two cases:
> - An update/delete on a partition with REPLICA IDENTITY FULL correctly
> emits an 'O' tag with the full old row.
> - An update/delete on a partition without REPLICA IDENTITY FULL currently
> also emits an 'O' tag, but only with key fields - this is the problem.
>
> After applying the patch, the second case correctly uses the 'K' tag.
>
> This patch is a minimal change it does not alter protocol structure
> or introduce new behavior. It only ensures the implementation matches
> the documentation. In the future, we might consider a broader redesign
> of logical replication for partitioned tables (see [2]), but this is
> a narrow fix that solves a real inconsistency.
>
> Looking forward to your comments.
>
> Best regards,
> Mikhail Kharitonov
>
> [1] 
> https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html
> [2] 
> https://www.postgresql.org/message-id/201902041630.gpadougzab7v@alvherre.pgsql
>
> On Mon, May 12, 2025 at 5:25 PM Maxim Orlov  wrote:
> >
> > Hi!
> >
> > This is probably not the most familiar part of Postgres to me, but does it 
> > break anything? Or is it just inconsistency in the replication protocol?
> >
> > A test for the described scenario would be a great addition. And, if it is 
> > feasible, provide an example of what would be broken with the way 
> > partitioned tables are replicated now.
> >
> > There is a chance that the replication protocol for partitioned tables 
> > needs to be rewritten, and I sincerely hope that I am wrong about this. It 
> > seems Alvaro Herrera tried this here [0].
> >
> >
> > [0] 
> > https://www.postgresql.org/message-id/201902041630.gpadougzab7v@alvherre.pgsql
> >
> >
> > --
> > Best regards,
> > Maxim Orlov.




Re: Elimination of the repetitive code at the SLRU bootstrap functions.

2025-07-08 Thread Evgeny
Álvaro, thank you for pushing, attention and final editing of the patch! 
I got your recommendations and requirements. I will take them into 
account at a next patch.


Andrey, Alexander thank you for your proposals, attention, advice, and 
review!


Best regards,
Evgeny Voropaev,
Tantor Labs, LLC.




Re: Small optimization with expanding dynamic hash table

2025-07-08 Thread Rahila Syed
Hi,


Yes, for example:
>
> low_mask: 0x011, high_mask: 0x111, old_bucket: 0x010, new_bucket: 0x110
>
> The old_bucket's hash value like 0x***010 or 0x***110, the later is in the
> old_bucket is because we didn't have new_bucket before, so only hash value
> like 0x***110 needs relocation: hashvalue & (low_mask + 1) != 0
>
>
Thanks for explaining, that clarifies things for me.
It may be worthwhile to check if this change has led to any performance
improvements.

Thank you,
Rahila syed


Re: A recent message added to pg_upgade

2025-07-08 Thread Dilip Kumar
On Tue, Jul 8, 2025 at 11:32 AM Dilip Kumar  wrote:
>
> On Mon, Jul 7, 2025 at 11:22 PM Tom Lane  wrote:
> >
> > Dilip Kumar  writes:
> > >> IMHO we can just query the 'max_slot_wal_keep_size' after
> > >> start_postmaster() and check what is the final resultant value. So now
> > >> we will only throw an error if the final value is not -1.   And we can
> > >> remove the hook from the server all together.  Thoughts?
> >
> > > I could come up with an attachment patch.
> >
> > I don't love this patch.  It's adding more cycles and more complexity
> > to pg_upgrade, when there is a simpler and more direct solution:
> > re-order the construction of the postmaster command line in
> > start_postmaster() so that our "-c max_slot_wal_keep_size" will
> > override anything the user supplies.
>
> Yeah that's right, one of the purposes of this change was to keep all
> logic at the pg_upgrade itself and remove the server hook altogether.
> But I think it was not a completely successful attempt to do that
> because still there was some awareness of this
> InvalidatePossiblyObsoleteSlot().  And I agree it would add an extra
> call in pg_upgrade.
>
> > There's a bigger picture here, though.  The fundamental thing that
> > I find wrong with the current code is that knowledge of and
> > responsibility for this max_slot_wal_keep_size hack is spread across
> > both pg_upgrade and the server.  It would be better if it were on
> > just one side.  Now, unless we want to change that Assert that
> > 8bfb231b4 put into InvalidatePossiblyObsoleteSlot(), the server side
> > is going to be aware of this decision.  So I'm inclined to think
> > that we should silently enforce max_slot_wal_keep_size = -1 in
> > binary-upgrade mode in the server's GUC check hook, and then remove
> > knowledge of it from pg_upgrade altogether.  Maybe the same for
> > idle_replication_slot_timeout, which really has got the same issue
> > that we don't want users overriding that choice.
>
> Yeah this change makes sense, currently we are anyway trying to force
> this to be -1 from pg_upgrade and server is also trying to validate if
> anything else is set during binary upgrade, so better to keep logic at
> one place.  I will work on this patch, thanks.

For now I have created 2 different patches, maybe we can merge them as well.

-- 
Regards,
Dilip Kumar
Google


v1-0001-Force-max_slot_wal_keep_size-to-1-during-binary-u.patch
Description: Binary data


v1-0002-Force-idle_replication_slot_timeout-to-0-during-b.patch
Description: Binary data


Re: like pg_shmem_allocations, but fine-grained for DSM registry ?

2025-07-08 Thread Florents Tselai


> On 25 Jun 2025, at 4:33 AM, Nathan Bossart  wrote:
> 
> for DSAs and dshash tables, we probably want to use dsa_get_total_size()
> for the "size" column.
> 

The thing is that dsa_get_total_size expects an already attached dsa.

I can't see how it's possible to get the actual size for the dsa and dsh case,
without attaching and then using, dsa_get_total_size on the attached dsa.
And I don't think we wanna do that.

Instead maybe we can just report NULL for the dsa and dsh cases? 

Here’s a v5 that does it like that



v5-0001-Update-pg_dsm_registry_allocations-to-include-stu.patch
Description: Binary data


Re: A recent message added to pg_upgade

2025-07-08 Thread Dilip Kumar
On Tue, Jul 8, 2025 at 2:29 PM Amit Kapila  wrote:
>
> On Tue, Jul 8, 2025 at 11:32 AM Dilip Kumar  wrote:
> >
> > On Mon, Jul 7, 2025 at 11:22 PM Tom Lane  wrote:
> > >
> >
> > > There's a bigger picture here, though.  The fundamental thing that
> > > I find wrong with the current code is that knowledge of and
> > > responsibility for this max_slot_wal_keep_size hack is spread across
> > > both pg_upgrade and the server.  It would be better if it were on
> > > just one side.  Now, unless we want to change that Assert that
> > > 8bfb231b4 put into InvalidatePossiblyObsoleteSlot(), the server side
> > > is going to be aware of this decision.  So I'm inclined to think
> > > that we should silently enforce max_slot_wal_keep_size = -1 in
> > > binary-upgrade mode in the server's GUC check hook, and then remove
> > > knowledge of it from pg_upgrade altogether.  Maybe the same for
> > > idle_replication_slot_timeout, which really has got the same issue
> > > that we don't want users overriding that choice.
> >
> > Yeah this change makes sense,
> >
>
> Agreed.
>
> One other idea to achieve similar functionality is that during
> BinaryUpgrade, avoid removing WAL due to max_slot_wal_keep_size, and
> also skip InvalidateObsoleteReplicationSlots. The one advantage of
> such a change is that after this, we can remove Assert in
> InvalidatePossiblyObsoleteSlot, remove check_hook functions for GUCs
> max_slot_wal_keep_size and idle_replication_slot_timeout, and remove
> special settings for these GUCs in pg_upgrade.

Yeah that could also be possible, not sure which one is better though,
with this idea we will have to put BinaryUpgrade check in KeepLogSeg()
as well as in InvalidateObsoleteReplicationSlots() whereas forcing the
GUC to be -1 during binary upgrade we just need check at one place.
But either LGTM.

-- 
Regards,
Dilip Kumar
Google




Re: A assert failure when initdb with track_commit_timestamp=on

2025-07-08 Thread Michael Paquier
On Sat, Jul 05, 2025 at 02:00:07PM -0400, Tom Lane wrote:
> Here's a fleshed-out implementation of Hayato-san's idea.  I've
> not done anything about reverting 5a6c39b6d, nor have I done any
> checks to see if there are other GUCs we ought to mark similarly.
> (But at this point I'd be prepared to bet that there are.)

I've been reading through the patch (not tested), and no objections
with the concept here.  I would have kept that a HEAD-only change due
to the proposed location for SetProcessingMode(), but, well, if I'm
in minority that's fine by me.
--
Michael


signature.asc
Description: PGP signature


Re: POC: Parallel processing of indexes in autovacuum

2025-07-08 Thread Matheus Alcantara
On Sun Jul 6, 2025 at 5:00 AM -03, Daniil Davydov wrote:
>> The "autovacuum_max_parallel_workers" declared on guc_tables.c mention
>> that is capped by "max_worker_process":
>> +   {
>> +   {"autovacuum_max_parallel_workers", PGC_SIGHUP, 
>> VACUUM_AUTOVACUUM,
>> +   gettext_noop("Maximum number of parallel autovacuum 
>> workers, that can be taken from bgworkers pool."),
>> +   gettext_noop("This parameter is capped by 
>> \"max_worker_processes\" (not by \"autovacuum_max_workers\"!)."),
>> +   },
>> +   &autovacuum_max_parallel_workers,
>> +   0, 0, MAX_BACKENDS,
>> +   check_autovacuum_max_parallel_workers, NULL, NULL
>> +   },
>>
>> IIUC the code, it cap by "max_worker_process", but Masahiko has mention
>> on [1] that it should be capped by max_parallel_workers.
> To be honest, I don't think that this parameter should be explicitly
> capped at all.
> Other parallel operations (for example parallel index build or VACUUM
> PARALLEL) just request as many workers as they want without looking at
> 'max_parallel_workers'.
> And they will not complain, if not all requested workers were launched.
>
> Thus, even if 'autovacuum_max_parallel_workers' is higher than
> 'max_parallel_workers' the worst that can happen is that not all
> requested workers will be running (which is a common situation).
> Users can handle it by looking for logs like "planned vs. launched"
> and increasing 'max_parallel_workers' if needed.
>
> On the other hand, obviously it doesn't make sense to request more
> workers than 'max_worker_processes' (moreover, this parameter cannot
> be changed as easily as 'max_parallel_workers').
>
> I will keep the 'max_worker_processes' limit, so autovacuum will not
> waste time initializing a parallel context if there is no chance that
> the request will succeed.
> But it's worth remembering that actually the
> 'autovacuum_max_parallel_workers' parameter will always be implicitly
> capped by 'max_parallel_workers'.
>
> What do you think about it?
> 

It make sense to me. The main benefit that I see on capping
autovacuum_max_parallel_workers parameter is that users will see
"invalid value for parameter "autovacuum_max_parallel_workers"" error on
logs instead of need to search for "planned vs. launched", which can be
trick if log_min_messages is not set to at least the info level (the
default warning level will not show this log message). If we decide to
not cap this on code I think that at least would be good to mention this
on documentation.

>>
>> I've made some tests and I can confirm that is working correctly for
>> what I can see. I think that would be to start include the documentation
>> changes, what do you think?
>>
>
> It sounds tempting :)
> But perhaps first we should agree on the limitation of the
> 'autovacuum_max_parallel_workers' parameter.
>

Agree

> Please, see v6 patches :
> 1) Fixed typos in autovacuum.c and postgresql.conf.sample
> 2) Removed unused macro 'RelationGetParallelAutovacuumWorkers'
>

Thanks!

--
Matheus Alcantara





Re: Can can I make an injection point wait occur no more than once?

2025-07-08 Thread Peter Geoghegan
On Mon, Jul 7, 2025 at 9:53 PM Noah Misch  wrote:
> If it continues to be a problem, consider sharing the patch that's behaving
> this way for you.

Attached patch shows my current progress with the isolation test.

I also attach diff output of the FreeBSD failures. Notice that the
line "backwards_scan_session: NOTICE:  notice triggered for injection
point lock-and-validate-new-lastcurrblkno" is completely absent from
the test output. This absence indicates that the desired test coverage
is totally missing on FreeBSD -- so the test is completely broken on
FreeBSD.

I ran "meson test --suite setup --suite nbtree -q --print-errorlogs"
in a loop 500 times on my Debian workstation without seeing any
failures. Seems stable there. Whereas the FreeBSD target hasn't even
passed once out of more than a dozen attempts. Seems to be reliably
broken on FreeBSD.

Thanks
-- 
Peter Geoghegan
diff -U3 /tmp/cirrus-ci-build/src/test/modules/nbtree/expected/backwards.out 
/tmp/cirrus-ci-build/build/testrun/nbtree/isolation/results/backwards.out
--- /tmp/cirrus-ci-build/src/test/modules/nbtree/expected/backwards.out 
2025-07-08 14:58:24.118145000 +
+++ /tmp/cirrus-ci-build/build/testrun/nbtree/isolation/results/backwards.out   
2025-07-08 15:01:24.642562000 +
@@ -1,24 +1,7 @@
 Parsed test spec with 2 sessions
 
 starting permutation: b_scan i_insert i_detach
-step b_scan: SELECT * FROM backwards_scan_tbl WHERE col % 100 = 1 ORDER BY col 
DESC; 
-step i_insert: INSERT INTO backwards_scan_tbl SELECT i FROM 
generate_series(-2000, 700) i;
-step i_detach: 
-  SELECT injection_points_detach('lock-and-validate-left');
-  SELECT injection_points_wakeup('lock-and-validate-left');
-
-injection_points_detach

-   
-(1 row)
-
-injection_points_wakeup

-   
-(1 row)
-
-backwards_scan_session: NOTICE:  notice triggered for injection point 
lock-and-validate-new-lastcurrblkno
-step b_scan: <... completed>
+step b_scan: SELECT * FROM backwards_scan_tbl WHERE col % 100 = 1 ORDER BY col 
DESC;
 col
 ---
 601
@@ -30,3 +13,14 @@
   1
 (7 rows)
 
+step i_insert: INSERT INTO backwards_scan_tbl SELECT i FROM 
generate_series(-2000, 700) i;
+step i_detach: 
+  SELECT injection_points_detach('lock-and-validate-left');
+  SELECT injection_points_wakeup('lock-and-validate-left');
+
+injection_points_detach
+---
+   
+(1 row)
+
+ERROR:  could not find injection point lock-and-validate-left to wake up


0001-injection-points.patch
Description: Binary data


Re: like pg_shmem_allocations, but fine-grained for DSM registry ?

2025-07-08 Thread Nathan Bossart
On Tue, Jul 08, 2025 at 07:17:54PM +0800, Florents Tselai wrote:
> I can't see how it's possible to get the actual size for the dsa and dsh case,
> without attaching and then using, dsa_get_total_size on the attached dsa.
> And I don't think we wanna do that.
> 
> Instead maybe we can just report NULL for the dsa and dsh cases? 

Yeah, I'm not sure what else we can do about that without a bunch of
refactoring work on the DSA/dshash side, and it's probably not worth the
effort, anyway.

-- 
nathan




Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Burd, Greg


> On Jul 7, 2025, at 7:38 PM, Michael Paquier  wrote:
> 
> I have also pushed this v2 on this branch, so feel free to grab it if
> that makes your life easier:
> https://github.com/michaelpq/postgres/tree/toast_64bit_v2
> --
> Michael


Thank you for spending time digging into this and for the well structured patch 
set (and GitHub branch which I personally find helpful).  This $subject is 
important on its own, but even more so in the broader context of the zstd/dict 
work [1] and also allowing for innovation when it comes to how externalized 
Datum are stored.   The current model for toasting has served the community 
well for years, but I think that Hannu [2] and Nikita and others have promising 
ideas that should be allowable without forcing core changes.  I've worked a bit 
in this area too, I re-based the Pluggble TOAST work by Nikita [3] onto 18devel 
earlier this year as I was looking for a way to implement a toaster for a 
custom type.

All that aside, I think you're right to tackle this one step at a time and try 
not to boil too much of the ocean at once (the patch set is already large 
enough).  With that in mind I've read once or twice over your changes and have 
a few basic comments/questions.


v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid

This set of changes make sense and as you say are mechanical in nature, no real 
comments other than I think that using uint64 rather than Oid is the right call 
and addresses #2 on Tom's list.


v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code

I like this as well, clarifies the code and reduces repetition.


v2-0003 Refactor external TOAST pointer code for better pluggability

+ * For now there are only two types, all defined in this file. For now this
+ * is the maximum value of vartag_external, which is a historical choice.

This provides a bridge for compatibility, but doesn't open the door to a truly 
pluggable API.  I'm guessing the goal is incremental change rather than 
wholesale rewrite.

+ * The different kinds of on-disk external TOAST pointers. divided by
+ * vartag_external.

Extra '.' in "TOAST pointers. divided" I'm guessing.


v2-0004 Introduce new callback to get fresh TOAST values
v2-0005 Add catcache support for INT8OID
v2-0006 Add GUC default_toast_type

Easily understood, good progression of supporting changes.


v2-0007 Introduce global 64-bit TOAST ID counter in control file

Do you have any concern that this might become a bottleneck when there are many 
relations and many backends all contending for a new id?  I'd imagine that this 
would show up in a flame graph, but I think your test focused on the read side 
detoast_attr_slice() rather than insert/update and contention on the shared 
counter.  Would this be even worse on NUMA systems?


v2-0008 Switch pg_column_toast_chunk_id() return value from oid to bigint
v2-0009 Add support for bigint TOAST values
v2-0010 Add tests for TOAST relations with bigint as value type
v2-0011 Add support for TOAST table types in pg_dump and pg_restore
v2-0012 Add new vartag_external for 8-byte TOAST values
V2-0013 amcheck: Add test cases for 8-byte TOAST values

I read through each of these patches, I like the break down and the attention 
to detail.  The inclusion of good documentation at each step is helpful.  Thank 
you.


Thanks for the flame graphs examining a heavy detoast_attr_slice() workload.  I 
agree that there is little or no difference between them which is nice.

I think the only call out Tom made [4] that isn't addressed was the ask for 
localized ID selection.  That may make sense at some point, especially if there 
is contention on GetNewToastId().  I think that case is worth a separate 
performance test, something with a large number of relations and backends all 
performing a lot of updates generating a lot of new IDs.  What do you think?

As for adding even more flexibility, I see the potential to move in that 
direction over time with this as a good focused incremental set of changes that 
address a few important issues now.


Really excited by this work, thank you.

-greg


[1] https://commitfest.postgresql.org/patch/5702/
[2] "Yes, the idea is to put the tid pointer directly in the varlena
external header and have a tid array in the toast table as an extra
column. If all of the TOAST fits in the single record, this will be
empty, else it will have an array of tids for all the pages for this
toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025
[3] https://postgr.es/m/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru
[4] https://postgr.es/m/764273.1669674269%40sss.pgh.pa.us





Re: Adding basic NUMA awareness

2025-07-08 Thread Andres Freund
Hi,

On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> On 7/8/25 05:04, Andres Freund wrote:
> > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > The reason it would be advantageous to put something like the procarray onto
> > smaller pages is that otherwise the entire procarray (unless particularly
> > large) ends up on a single NUMA node, increasing the latency for backends on
> > every other numa node and increasing memory traffic on that node.
> > 
> 
> That's why the patch series splits the procarray into multiple pieces,
> so that it can be properly distributed on multiple NUMA nodes even with
> huge pages. It requires adjusting a couple places accessing the entries,
> but it surprised me how limited the impact was.

Sure, you can do that, but it does mean that iterations over the procarray now
have an added level of indirection...


> The thing I'm not sure about is how much this actually helps with the
> traffic between node. Sure, if we pick a PGPROC from the same node, and
> the task does not get moved, it'll be local traffic. But if the task
> moves, there'll be traffic. I don't have any estimates how often this
> happens, e.g. for older tasks.

I think the most important bit is to not put everything onto one numa node,
otherwise the chance of increased latency for *everyone* due to the increased
memory contention is more likely to hurt.

Greetings,

Andres Freund




Re: Move the injection_points extension to contrib?

2025-07-08 Thread Antonin Houska
Hayato Kuroda (Fujitsu)  wrote:

> Dear Antonin,
> 
> > Having the extension in contrib would be useful in cases a 3rd party 
> > extension
> > uses injection points in its regression tests. In particular, it's more
> > practical for CI to install the "injection_points" extension by installing 
> > the
> > "contrib" binary package than by building the whole server from source. 
> > (AFAIK
> > the src/modules/injection_points directory is currently not included in any
> > package.)
> 
> IIRC, one of the motivation why we put src/test/modules is to ensure the 
> flexiblity.
> Even contrib modules must follow the rule [1]. E.g., ABI must be kept and API
> change should be considered carefully - commits like [2] may be restricted.
> Also We may even have to consider that whether the name of injection points
> should be kept or not.
> 
> How do you feel?
> 
> [1]: 
> https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-GUIDANCE-API-COMPATIBILITY
> [2]: 
> https://github.com/postgres/postgres/commit/f4af4515bb5f3591d49bc16b6cb8ddbf66f98374

ok, I assume you mean that the requirement for ABI/API stability would make it
hard to include tests for fixes like [2] in minor releases. Thanks for
explanation.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

2025-07-08 Thread Cédric Villemain









On 7/8/25 03:55, Cédric Villemain wrote:

Hi Andres,


Hi,

On 2025-07-05 07:09:00 +, Cédric Villemain wrote:

In my work on more careful PostgreSQL resource management, I've come
to the
conclusion that we should avoid pushing policy too deeply into the
PostgreSQL core itself. Therefore, I'm quite skeptical about integrating
NUMA-specific management directly into core PostgreSQL in such a way.


I think it's actually the opposite - whenever we pushed stuff like this
outside of core it has hurt postgres substantially. Not having
replication in
core was a huge mistake. Not having HA management in core is probably the
biggest current adoption hurdle for postgres.

To deal better with NUMA we need to improve memory placement and various
algorithms, in an interrelated way - that's pretty much impossible to do
outside of core.


Except the backend pinning which is easy to achieve, thus my comment on
the related patch.
I'm not claiming NUMA memory and all should be managed outside of core
(though I didn't read other patches yet).



But an "optimal backend placement" seems to very much depend on where we
placed the various pieces of shared memory. Which the external module
will have trouble following, I suspect.

I still don't have any idea what exactly would the external module do,
how would it decide where to place the backend. Can you describe some
use case with an example?

Assuming we want to actually pin tasks from within Postgres, what I
think might work is allowing modules to "advise" on where to place the
task. But the decision would still be done by core.


Possibly exactly what you're doing in proc.c when managing allocation of 
process, but not hardcoded in postgresql (patches 02, 05 and 06 are good 
candidates), I didn't get that they require information not available to 
any process executing code from a module.


Parts of your code where you assign/define policy could be in one or 
more relevant routines of a "numa profile manager", like in an 
initProcessRoutine(), and registered in pmroutine struct:


pmroutine = GetPmRoutineForInitProcess();
if (pmroutine != NULL &&
pmroutine->init_process != NULL)
pmroutine->init_process(MyProc);

This way it's easier to manage alternative policies, and also to be able 
to adjust when hardware and linux kernel changes.



--
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D





Re: like pg_shmem_allocations, but fine-grained for DSM registry ?

2025-07-08 Thread Florents Tselai



> On 8 Jul 2025, at 11:21 PM, Nathan Bossart  wrote:
> 
> On Tue, Jul 08, 2025 at 07:17:54PM +0800, Florents Tselai wrote:
>> I can't see how it's possible to get the actual size for the dsa and dsh 
>> case,
>> without attaching and then using, dsa_get_total_size on the attached dsa.
>> And I don't think we wanna do that.
>> 
>> Instead maybe we can just report NULL for the dsa and dsh cases?
> 
> Yeah, I'm not sure what else we can do about that without a bunch of
> refactoring work on the DSA/dshash side, and it's probably not worth the
> effort, anyway.


I agree. In that case I think v5 should be enough. 



Re: Logical replication prefetch

2025-07-08 Thread Amit Kapila
On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik  wrote:
>
> There is well known Postgres problem that logical replication subscriber
> can not caught-up with publisher just because LR changes are applied by
> single worker and at publisher changes are made by
> multiple concurrent backends. The problem is not logical replication
> specific: physical replication stream is also handled by single
> walreceiver. But for physical replication Postgres now implements
> prefetch: looking at WAL record blocks it is quite easy to predict which
> pages will be required for redo and prefetch them. With logical
> replication situation is much more complicated.
>
> My first idea was to implement parallel apply of transactions. But to do
> it we need to track dependencies between transactions. Right now
> Postgres can apply transactions in parallel, but only if they are
> streamed  (which is done only for large transactions) and serialize them
> by commits. It is possible to enforce parallel apply of short
> transactions using `debug_logical_replication_streaming` but then
> performance is ~2x times slower than in case of sequential apply by
> single worker.
>

What is the reason of such a large slow down? Is it because the amount
of network transfer has increased without giving any significant
advantage because of the serialization of commits?

> By removing serialization by commits, it is possible to
> speedup apply 3x times and make subscriber apply changes faster then
> producer can produce them even with multiple clients. But it is possible
> only if transactions are independent and it can be enforced only by
> tracking dependencies which seems to be very non-trivial and invasive.
>
> I still do not completely give up with tracking dependencies approach,
> but decided first to try more simple solution - prefetching.
>

Sounds reasonable, but in the long run, we should track transaction
dependencies and allow parallel apply of all the transactions.

 It is
> already used for physical replication. Certainly in case of physical
> replication it is much simpler, because each WAL record contains list of
> accessed blocks.
>
> In case of logical replication prefetching can be done either by
> prefetching access to replica identity index (usually primary key),
> either by executing replication command by some background worker
> Certainly first case is much more easy.
>

It seems there is only one case described, so what exactly are you
referring to first and second?

 We just perform index lookup in
> prefetch worker and it loads accessed index and heap pages in shared
> buffer, so main apply worker does not need to read something from disk.
> But it works well only for DELETE and HOT UPDATE operations.
>
> In the second case we normally execute the LR command in background
> worker and then abort transaction. Certainly in this case we are doing
> the same work twice. But assumption is the same: parallel prefetch
> workers should load affected pages, speeding up work of the main apply
> worker.
>
> I have implemented some PoC (see attached patch). And get first results
> of efficiency of such prefetching.
>
> *** First scenario (update-only).
>
> Publisher:
> ```
> create table t(pk integer primary key, counter integer, filler text
> default repeat('x', 1000)) with (fillfactor=10);
> insert into t values (generate_series(1,10), 0);
> create publication pub1 for table t;
> ```
>
> Subscriber:
> ```
> create table t(pk integer primary key, counter integer, filler text
> default repeat('x', 1000)) with (fillfactor=10);
> create subscription sub1 connection 'port=54321 dbname=postgres'
> publication pub1;
> ```
>
> Then I wait until replication is synced, stop subscriber and do random
> dot updates in 10 sessions at publisher:
>
> ```
> pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres
> ```
>
> where update.sql is:
>
> ```
> \set pk random(1, 10)
> update t set counter=counter+1 where pk=:pk;
> ```
>
> Then I start subscriber and measure how much time is needed for it to
> caught up.
> Results:
>
> no prefetch: 2:00 min
> prefetch (replica identity only): 0:55 min
> prefetch (all): 1:10 min
>
> This is definitely the best case for replica-identity index only
> prefetch (update-only and no other indexes).
> How to interpret this results?
>
> Without prefetch applying updates takes about two times  more at
> subscriber than performing this updates at publisher.
> It means that under huge workload subscriber has no chances to caught up.
>
> With prefetching replica identity index, apply time is even smaller than
> time needed to perform updates at publisher.
> Performing the whole operation and transaction abort certainly adds more
> overhead. But still improvement is quite significant.
>
> Please also notice that this results were obtains at the system with
> larger amount of RAM (64Gb) and fast SSD.
> With data set not fitting in RAM and much slower disks, the difference
> is expected to be mo

Re: [V2] Adding new CRC32C implementation for IBM S390X

2025-07-08 Thread Eduard Stefes
Hi,

here is V3 of the patch. Changes from V2:

- removed IBM copyright
- removed GETAUXVAL check in favor of the already provided check
- moved runtime selection code from pg_crc32c_s390x_choose.c to
pg_crc32c_s390x.c and removed _choose.c file
- removed the config time compiler check and let the buildsystem fall
back to sb8
- changed buffer limit back to 32 bytes before calling s390x specific
implementation 


-- 
Eduard Stefes 





From 2cd3866c4f9b5c3e5b08f36df2eeb0f2928cfbbf Mon Sep 17 00:00:00 2001
From: "Eddy (Eduard) Stefes" 
Date: Tue, 15 Apr 2025 10:22:05 +0200
Subject: [PATCH v3] Added crc32c extension for ibm s390x based on VX
 intrinsics

---
 config/c-compiler.m4 |  39 +
 configure| 221 +-
 configure.ac |  53 ++-
 meson.build  |  34 
 src/include/pg_config.h.in   |   6 +
 src/include/port/pg_crc32c.h |  25 +++
 src/port/Makefile|   5 +-
 src/port/meson.build |   6 +
 src/port/pg_crc32c_s390x.c   | 297 +++
 9 files changed, 674 insertions(+), 12 deletions(-)
 create mode 100644 src/port/pg_crc32c_s390x.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5f3e1d1faf9..73fbbf6362e 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -684,6 +684,45 @@ fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
 
+# PGAC_S390X_VECTOR_VX_INTRINSICS
+# 
+# Check if the compiler supports the S390X vector intrinsics, using
+# __attribute__((vector_size(16))), vec_gfmsum_accum_128
+#
+# These instructions where introduced with -march=z13.
+# the test arg1 is mandatory and should be either:
+# '-fzvector' for clang
+# '-mzarch' for gcc
+#
+# compilation may fail with GCC 14.0.0 <= gcc_version < 14.3.0 and Clang 18.0.0 <= clang_version < 19.1.2
+#
+# If the intrinsics are supported, sets
+# pgac_s390x_vector_intrinsics, and CFLAGS_CRC.
+AC_DEFUN([PGAC_S390X_VECTOR_VX_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_s390x_vector_intrinsics_$1_$2])])dnl
+AC_CACHE_CHECK([for __attribute__((vector_size(16))), vec_gfmsum_accum_128 with CFLAGS=$1 $2], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1 $2"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+unsigned long long a __attribute__((vector_size(16))) = { 0 };
+unsigned long long b __attribute__((vector_size(16))) = { 0 };
+unsigned char c __attribute__((vector_size(16))) = { 0 };
+static void vecint_gfmsum_accum_test(void){
+c = vec_gfmsum_accum_128(a, b, c);
+}],
+  [
+   vecint_gfmsum_accum_test();
+   return 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_CRC="$1 $2"
+  AS_TR_SH([pgac_s390x_vector_intrinsics_$1_$2])=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_S390X_VECTOR_VX_INTRINSICS
+
 # PGAC_XSAVE_INTRINSICS
 # -
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
diff --git a/configure b/configure
index 4f15347cc95..aa06303019b 100755
--- a/configure
+++ b/configure
@@ -18088,6 +18088,187 @@ if test x"$pgac_cv_loongarch_crc32c_intrinsics" = x"yes"; then
 fi
 
 
+# Check for S390X Vector intrinsics to do CRC calculations.
+#
+# First check for the host cpu
+if test x"$host_cpu" = x"s390x" && test x"$ac_cv_func_getauxval" = x"yes"; then
+  # Check for all possible cflag combinations
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __attribute__((vector_size(16))), vec_gfmsum_accum_128 with CFLAGS=-fzvector " >&5
+$as_echo_n "checking for __attribute__((vector_size(16))), vec_gfmsum_accum_128 with CFLAGS=-fzvector ... " >&6; }
+if ${pgac_cv_s390x_vector_intrinsics__fzvector_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -fzvector "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+unsigned long long a __attribute__((vector_size(16))) = { 0 };
+unsigned long long b __attribute__((vector_size(16))) = { 0 };
+unsigned char c __attribute__((vector_size(16))) = { 0 };
+static void vecint_gfmsum_accum_test(void){
+c = vec_gfmsum_accum_128(a, b, c);
+}
+int
+main ()
+{
+
+   vecint_gfmsum_accum_test();
+   return 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_s390x_vector_intrinsics__fzvector_=yes
+else
+  pgac_cv_s390x_vector_intrinsics__fzvector_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_s390x_vector_intrinsics__fzvector_" >&5
+$as_echo "$pgac_cv_s390x_vector_intrinsics__fzvector_" >&6; }
+if test x"$pgac_cv_s390x_vector_intrinsics__fzvector_" = x"yes"; then
+  CFLAGS_CRC="-fzvector "
+  pgac_s390x_vector_intrinsics__fzvector_=yes
+fi
+
+  if test x"$pgac_s390x_vecto

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2025-07-08 Thread Ilia Evdokimov

On 07.07.2025 16:49, Andrei Lepikhov wrote:

Exposing internal information about the estimation of the number of 
groups in the Memoise node, shouldn't we do the same in even more 
vague cases, such as IncrementalSort? For example, in [1] (see its 
attachment), I observe that IncrementalSort is considerably better 
than Sort, but has a larger cost. It would be helpful to understand if 
an incorrect ngroups estimation causes this.


LGTM.

If we still don't expose any information for nodes like IncrementalSort 
that would help explain why the planner chose them, then I fully agree - 
we definitely should add it.


I suggest discussing that in a separate thread.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.





Re: Inconsistent LSN format in pg_waldump output

2025-07-08 Thread Álvaro Herrera
On 2025-Jul-07, Japin Li wrote:

> Thank you for pushing this patch.  I noticed some other areas in the
> documentation that could also use an update.

Thanks!  Pushed these fixes.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.




Re: [PATCH] Support for basic ALTER TABLE progress reporting.

2025-07-08 Thread jian he
hi.
within ATRewriteTable we have:
pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
RelationGetNumberOfBlocks(oldrel));
pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
 heapScan->rs_nblocks);

PROGRESS_CLUSTER_TOTAL_HEAP_BLKS
value is fixed, we only need to call pgstat_progress_update_param once here?


another patch [1] is expected to refactor pg_stat_progress_cluster a lot, so I'm
unsure whether it's a good idea to put CLUSTER, VACUUM FULL, or ALTER TABLE into
pg_stat_progress_cluster.
alternatively, we could introduce a separate progress report specifically for
ALTER TABLE, allowing us to distinguish between table rewrite and table
scan.

[1] https://commitfest.postgresql.org/patch/5117




Re: Add os_page_num to pg_buffercache

2025-07-08 Thread Mircea Cadariu
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi Bertrand, 

Just tried out your patch, nice work, thought to leave a review as well.

Patch applied successfully on top of commit a27893df4 in master. 
Ran the tests in pg_buffercache and they pass including the new ones. 

Running "pagesize" on my laptop returns 16384. 

test=# SELECT current_setting('block_size');
 current_setting 
-
 8192
(1 row)

Given the above, the results are as expected: 

test=# select * from pg_buffercache_os_pages;
 bufferid | os_page_num 
--+-
1 |   0
2 |   0
3 |   1
4 |   1
5 |   2
6 |   2

I have noticed that pg_buffercache_os_pages would be the 3rd function 
which follows the same high-level structure (others being pg_buffercache_pages 
and pg_buffercache_numa_pages). I am wondering if this would be let's say 
"strike three" - time to consider extracting out a high-level "skeleton" 
function, 
with a couple of slots which would then be provided by the 3 variants. 

Kind regards,
Mircea

The new status of this patch is: Waiting on Author


Re: array_random

2025-07-08 Thread Aleksander Alekseev
Hi,

> it seems not trivial to wrap up all the generated random values into a 
> specific
> multi-dimensional array (more than 2 dimensions).
> for example, say we generated 24 random values and wanted to arrange them 
> into a
> 3-dimensional array with shape [4, 3, 2].
> we can easily use:
> SELECT array_random(1, 6, array[4,3, 2]);
>
> of course, we can use plpgsql to do it, but the c function would be
> more convenient.
> does this make sense?

The proposed function seems to do two things at a time - generating
random values and transforming them into an array of desired
dimensions. Generally we try to avoid such interfaces. Can you think
of something like array_transform() / array_reshape() that takes an
arbitrary single-dimension array and modifies it?




Re: Small optimization with expanding dynamic hash table

2025-07-08 Thread cca5507
> One thing to note is that in this scenario, there is no safeguard if 
the hashvalue is 0x111 and new_bucket is 0x110.

But the hash table is already corrupted if the hashvalue 0x111 in 
old_bucket 0x010, all hashvalue in old_bucket should have: hashvalue & 
low_mask == old_bucket's no.


--
Regards,
ChangAo Chen

Re: Adding wait events statistics

2025-07-08 Thread Bertrand Drouvot
Hi,

On Tue, Jul 08, 2025 at 04:46:28AM +, Bertrand Drouvot wrote:
> On Tue, Jul 08, 2025 at 09:36:53AM +0900, Michael Paquier wrote:
> > Yes, you may need some level of meta-data generated for the wait
> > classes generated when the perl script generating this data is run.
> > It would be nice to the footprint of the code generated minimal if we
> > can.  It's one of these things where I would try both approaches,

I did some testing and experiments and I think an "hybrid" approach mixing 
multiple
arrays and one "global" one is the best approach.

So, something like:

"
ClassInfo LWLockClassInfo = {
.numberOfEvents = NB_LWLOCK_WAIT_EVENTS,
.pendingCounts = PendingWaitEventStats.lwlock_counts
};

WaitClassInfo LockClassInfo = {
.numberOfEvents = NB_LOCK_WAIT_EVENTS,
.pendingCounts = PendingWaitEventStats.lock_counts
};
.
.
.
.
"

and then the "global" one:

"
WaitClassInfo *AllWaitClasses[] = {
NULL,  /* gap - no class with ID 0 */
&LWLockClassInfo,
NULL,  /* gap - no class with ID 2 */
&LockClassInfo,
&BufferPinClassInfo,
&ActivityClassInfo,
&ClientClassInfo,
&ExtensionClassInfo,
&IPCClassInfo,
&TimeoutClassInfo,
&IOClassInfo,
};
"

That way:

- we do not need a mapping between ClassID and "Arrays" and the gaps
in the ClassID are handled. So we can acces directly the class of interest
knowing its classID.
- we can easily iterate (again no mapping needed) over all the pending stats
thanks to AllWaitClasses[]

What do you think? It looks like the most elegant approach to me.

Bonus point, as you can see above, while working on this approach I've been able
to remove storing the className and the array of eventName per wait class (would
have worked in v1 too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Tom Lane
Hannu Krosing  writes:
> On Mon, Jul 7, 2025 at 11:38 PM Tom Lane  wrote:
>> What do you think of instead specifying the limit as the maximum
>> running-percentage to print, with a default of say 99.99%?  That
>> gives me results like

> I agree that percentage covered is a much better metric indeed.
> And I am equally ok with a default of either 99.9% or 99.99%.

OK, pushed after a bit more fooling with the documentation.

regards, tom lane




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Tom Lane
BTW, returning to the original topic of this thread:

The new exact-delays table from pg_test_timing is really quite
informative.  For example, on my M4 Macbook:

Observed timing durations up to 99.9900%:
  ns   % of total  running %  count
   0  62.212462.2124  118127987
  41  12.582674.7951   23891661
  42  25.165399.9604   47783489
  83   0.019499.9797  36744
  84   0.009699.9893  18193
 125   0.002099.9913   3784
...
   31042   0.   100.  1

The fact that the clock tick is about 40ns is extremely 
obvious in this presentation.

Even more interesting is what I got from an ancient PPC Macbook
(mamba's host, running NetBSD):

Testing timing overhead for 3 seconds.
Per loop time including overhead: 731.26 ns
...
Observed timing durations up to 99.9900%:
  ns   % of total  running %  count
 705  39.916239.91621637570
 706  17.604057.5203 722208
 759  18.679776.2000 766337
 760  23.785199.9851 975787
 813   0.000299.9853  9
 814   0.000499.9857 17
 868   0.000199.9858  4
 922   0.000199.9859  3
...
  564950   0.   100.  1

I had previously reported that that machine had microsecond
timing precision, but this shows that the real problem is that
it takes most of a microsecond to go 'round the timing loop.
It seems clear that the system clock ticks about every 50ns,
even on this decades-old hardware.

regards, tom lane




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Andrey Borodin



> On 8 Jul 2025, at 23:07, Tom Lane  wrote:
> 
> The fact that the clock tick is about 40ns is extremely 
> obvious in this presentation.

FWIW while working on UUID v7 Masahiko found that if we try to read real time 
with clock_gettime(CLOCK_REALTIME,) measurement is truncated to microseconds.


Best regards, Andrey Borodin.



Re: Unnecessary delay in streaming replication due to replay lag

2025-07-08 Thread sunil s
Hello Hackers,

I recently had the opportunity to continue the effort originally led by a
valued contributor.
I’ve addressed most of the previously reported feedback and issues, and
would like to share the updated patch with the community.

IMHO starting WAL receiver eagerly offers significant advantages because of
following reasons

   1.

   If recovery_min_apply_delay is set high (for various operational
   reasons) and the primary crashes, the mirror can recover quickly, thereby
   improving overall High Availability.
   2.

   For setups without archive-based recovery, restore and recovery
   operations complete faster.
   3.

   When synchronous_commit is enabled, faster mirror recovery reduces
   offline time and helps avoid prolonged commit/query wait times during
   failover/recovery.
   4.

   This approach also improves resilience by limiting the impact of network
   interruptions on replication.


> In common cases, I believe archive recovery is faster than
replication. If a segment is available from archive, we don't need to
prefetch it via stream.

I completely agree — restoring from the archive is significantly faster
than streaming.
 Attempting to stream from the last available WAL in the archive would
introduce complexity and risk.
Therefore, we can limit this feature to crash recovery scenarios and skip
it when archiving is enabled.

> The "FATAL: could not open file" message from walreceiver means that
the walreceiver was operationally prohibited to install a new wal
segment at the time.
This was caused by an additional fix added in upstream to address a race
condition between the archiver and checkpointer.
It has been resolved in the latest patch, which also includes a TAP test to
verify the fix. Thanks for testing and bringing this to our attention.
For now we will skip wal receiver early start since enabling the write
access for wal receiver will reintroduce the bug, which the
commit cc2c7d65fc27e877c9f407587b0b92d46cd6dd16

fixed
previously.


I've attached the rebased patch with the necessary fix.

Thanks & Regards,
Sunil S (Broadcom)


On Tue, Jul 8, 2025 at 11:01 AM Kyotaro Horiguchi 
wrote:

> At Wed, 15 Dec 2021 17:01:24 -0800, Soumyadeep Chakraborty <
> soumyadeep2...@gmail.com> wrote in
> > Sure, that makes more sense. Fixed.
>
> As I played with this briefly.  I started a standby from a backup that
> has an access to archive.  I had the following log lines steadily.
>
>
> [139535:postmaster] LOG:  database system is ready to accept read-only
> connections
> [139542:walreceiver] LOG:  started streaming WAL from primary at 0/200
> on timeline 1
> cp: cannot stat '/home/horiguti/data/arc_work/00010003':
> No such file or directory
> [139542:walreceiver] FATAL:  could not open file
> "pg_wal/00010003": No such file or directory
> cp: cannot stat '/home/horiguti/data/arc_work/0002.history': No such
> file or directory
> cp: cannot stat '/home/horiguti/data/arc_work/00010003':
> No such file or directory
> [139548:walreceiver] LOG:  started streaming WAL from primary at 0/300
> on timeline 1
>
> The "FATAL:  could not open file" message from walreceiver means that
> the walreceiver was operationally prohibited to install a new wal
> segment at the time.  Thus the walreceiver ended as soon as started.
> In short, the eager replication is not working at all.
>
>
> I have a comment on the behavior and objective of this feature.
>
> In the case where archive recovery is started from a backup, this
> feature lets walreceiver start while the archive recovery is ongoing.
> If walreceiver (or the eager replication) worked as expected, it would
> write wal files while archive recovery writes the same set of WAL
> segments to the same directory. I don't think that is a sane behavior.
> Or, if putting more modestly, an unintended behavior.
>
> In common cases, I believe archive recovery is faster than
> replication.  If a segment is available from archive, we don't need to
> prefetch it via stream.
>
> If this feature is intended to use only for crash recovery of a
> standby, it should fire only when it is needed.
>
> If not, that is, if it is intended to work also for archive recovery,
> I think the eager replication should start from the next segment of
> the last WAL in archive but that would invite more complex problems.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>
>
>
>


v6-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch
Description: Binary data


v6-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patch
Description: Binary data


v6-0003-Test-archive-recovery-takes-precedence-over-strea.patch
Description: Binary data


Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Nikita Malakhov
Hi!

Greg, thanks for the interest in our work!

Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.

What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Logical replication prefetch

2025-07-08 Thread Konstantin Knizhnik


On 08/07/2025 2:51 pm, Amit Kapila wrote:

On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik wrote:

It is possible to enforce parallel apply of short
transactions using `debug_logical_replication_streaming` but then
performance is ~2x times slower than in case of sequential apply by
single worker.


What is the reason of such a large slow down? Is it because the amount
of network transfer has increased without giving any significant
advantage because of the serialization of commits?


No, I do not think that network traffic is somehow increased.
If I removed locks (just by commenting body of `pa_lock_stream` and 
`pa_unlock_stream` functions and callof `pa_wait_for_xact_finish`), I 
get 3x speed improvement (with 4 parallel apply workers) comparing with 
normal mode
(when transactions are applied by main logical replication worker). So 
the main reason is lock overhead/contention and de-facto serialization 
of transactions (in `top` I see that only one worker is active most the 
time.


Even with simulated 0.1msec read delay, results of update tests are the 
following:


normal mode: 7:40
forced parallel mode: 8:30
forced parallel mode (no locks): 1:45


By removing serialization by commits, it is possible to
speedup apply 3x times and make subscriber apply changes faster then
producer can produce them even with multiple clients. But it is possible
only if transactions are independent and it can be enforced only by
tracking dependencies which seems to be very non-trivial and invasive.

I still do not completely give up with tracking dependencies approach,
but decided first to try more simple solution - prefetching.


Sounds reasonable, but in the long run, we should track transaction
dependencies and allow parallel apply of all the transactions.


I agree.
I see two different approaches:

1. Build dependency graph: track dependency between xids when 
transaction is executed at publisher and then include this graph in 
commit record.
2. Calculate hash of replica identity key  and check that data sets of 
transactions do no intersect (certainly will notwork if there are some 
triggers).





  It is

already used for physical replication. Certainly in case of physical
replication it is much simpler, because each WAL record contains list of
accessed blocks.

In case of logical replication prefetching can be done either by
prefetching access to replica identity index (usually primary key),
either by executing replication command by some background worker
Certainly first case is much more easy.


It seems there is only one case described, so what exactly are you
referring to first and second?


First: perform lookup in replica identity index (primary key). It will 
prefetch index pages and referenced heap page.
Seconds: execute LR operation (insert/update) in prefetch worker and 
then rollback transaction.




But what about worst cases where these additional pre-fetches could
lead to removing some pages from shared_buffers, which are required by
the workload on the subscriber? I think you try such workloads as
well.

It is the common problem of all prefetch algorithms: if size of cache 
where prefetch results are stored (shared buffers, OS file cache,...) is 
not larger enough to keep prefetch result until it will be used,
then prefetch will not provide any improvement of performance and may be 
even cause some degradation.
So it is really challenged task to choose optimal time for prefetch 
operation: too early - and its results will be thrown away before 
requested, too late - executor has to wait prefetch completion or load 
page itself. Certainly there is some kind of autotuning: worker 
performing prefetch has to wait for IO completion and executor whichm 
pickup page from cache process requests faster and so should catch up 
prefetch workers. Then it has to perform IO itself and start fall behind 
prefetch workers.





I understand that it is just a POC, so you haven't figured out all the
details, but it would be good to know the reason of these deadlocks.


Will investigate it.



I wonder if such LR prefetching approach is considered to be useful?
Or it is better to investigate other ways to improve LR apply speed
(parallel apply)?


I think it could be a good intermediate step till we are able to find
a solution for tracking the dependencies. Do you think this work will
be useful once we have parallel apply, and if so how?



I think that if we will have parallel apply, prefetch is not needed.
At least that is what I see now in Neon (open source serverless Postgres 
which separates compute and storage).
We have implemented prefetch for seqscan and indexscan because of 
relatively large acess latency with page server. And it can really 
significantly improve performance - 4 or even more times.
But the same effect cna be achieved by forcing parallel plan with larger 
number of parallel workers. Unfortunately effect of this two 
optimizations is not multiplied, so parallel plan + prefetch shows 
al

Tab completion for large objects

2025-07-08 Thread Dagfinn Ilmari Mannsåker
Hi hackers,

I noticed that psql's tab completion suggested TO immediately after
GRANT ... ON LARGE OBJECT, and not after ON LARGE OBJECT .  This is
because LARGE OBJECT is the only two-word object type, so it thinks
LARGE is the object type and OBJECT is the name.  Attached are three
patches that address this and other LO-related tab completion issues:

 1. Tab complete OBJECT after GRANT|REVOKE ... ON LARGE, and TO/FROM
after GRANT|REVOKE ... ON LARGE OBJECT

 2. Tab complete filenames after \lo_export 

 3. Tab complete large object OIDs where relevant.  This is less useful
than completing names like for other objects, but it might still
be handy.  Separate patch in case it proves controversial or is
deemed useless.

- ilmari

>From b657b3d1a31af927231a7d51002d00d830f473b7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 8 Jul 2025 19:21:59 +0100
Subject: [PATCH 1/3] Improve tab completion around GRANT/REVOKE ... ON LARGE
 OBJECT

Because large objects are the only two-word object type, it was
completing with TO or FROM immediately after LARGE OBJECT, not after
having entered the OID.
---
 src/bin/psql/tab-complete.in.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 53e7d35fe98..fe97179b979 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4531,6 +4531,8 @@ match_previous_words(int pattern_id,
 			COMPLETE_WITH_VERSIONED_SCHEMA_QUERY(Query_for_list_of_functions);
 		else if (TailMatches("LANGUAGE"))
 			COMPLETE_WITH_QUERY(Query_for_list_of_languages);
+		else if (TailMatches("LARGE"))
+			COMPLETE_WITH("OBJECT");
 		else if (TailMatches("PROCEDURE"))
 			COMPLETE_WITH_VERSIONED_SCHEMA_QUERY(Query_for_list_of_procedures);
 		else if (TailMatches("ROUTINE"))
@@ -4589,9 +4591,11 @@ match_previous_words(int pattern_id,
 	else if (Matches("ALTER", "DEFAULT", "PRIVILEGES", MatchAnyN, "TO", MatchAny))
 		COMPLETE_WITH("WITH GRANT OPTION");
 	/* Complete "GRANT/REVOKE ... ON * *" with TO/FROM */
-	else if (Matches("GRANT", MatchAnyN, "ON", MatchAny, MatchAny))
+	else if (Matches("GRANT", MatchAnyN, "ON", MatchAny, MatchAny) ||
+			 Matches("GRANT", MatchAnyN, "ON", "LARGE", "OBJECT", MatchAny))
 		COMPLETE_WITH("TO");
-	else if (Matches("REVOKE", MatchAnyN, "ON", MatchAny, MatchAny))
+	else if (Matches("REVOKE", MatchAnyN, "ON", MatchAny, MatchAny) ||
+			 Matches("REVOKE", MatchAnyN, "ON", "LARGE", "OBJECT", MatchAny))
 		COMPLETE_WITH("FROM");
 
 	/* Complete "GRANT/REVOKE * ON ALL * IN SCHEMA *" with TO/FROM */
-- 
2.50.0

>From 20715236941ecb7884545fa70fadbd49a026dd3c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 8 Jul 2025 19:22:13 +0100
Subject: [PATCH 2/3] Tab complete filenames after \lo_export 

---
 src/bin/psql/tab-complete.in.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index fe97179b979..922895d0d0f 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5468,7 +5468,8 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_views);
 	else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\gx|\\i|\\include|"
 		   "\\ir|\\include_relative|\\o|\\out|"
-		   "\\s|\\w|\\write|\\lo_import"))
+		   "\\s|\\w|\\write|\\lo_import") ||
+			 TailMatchesCS("\\lo_export", MatchAny))
 	{
 		completion_charp = "\\";
 		completion_force_quote = false;
-- 
2.50.0

>From a16c467898f5579f4961bdfe6db5dd3c52e5820a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 8 Jul 2025 19:23:00 +0100
Subject: [PATCH 3/3] Add tab completion for large object OIDs

For GRANT, REVOKE, \lo_unlink and \lo_export.

This is less useful than names, but might still be handy.
---
 src/bin/psql/tab-complete.in.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 922895d0d0f..227747d9897 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -978,6 +978,11 @@ static const SchemaQuery Query_for_trigger_of_table = {
 	.refnamespace = "c1.relnamespace",
 };
 
+static SchemaQuery Query_for_list_of_large_objects = {
+	.catname = "pg_catalog.pg_largeobject_metadata lom",
+	.result = "lom.oid::pg_catalog.text",
+};
+
 
 /*
  * Queries to get lists of names of various kinds of things, possibly
@@ -4552,6 +4557,9 @@ match_previous_words(int pattern_id,
 		else
 			COMPLETE_WITH("FROM");
 	}
+	else if (TailMatches("GRANT|REVOKE", MatchAny, "ON", "LARGE", "OBJECT") ||
+			 TailMatches("REVOKE", "GRANT", "OPTION", "FOR", MatchAny, "ON", "LARGE", "OBJECT"))
+		COMPLETE_WITH_SCHEMA_QUERY_VERBATIM(Query_for_list_of_large_objects);
 
 	/*
 	 * Complete "GRANT/REVOKE ... TO/FROM" with username, PUBLIC,
@@ -5406,6 +5414,8 @@ match_p

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Hannu Krosing
I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.

It will remove the need for any oid management and also will be
many-many orders of magnitude faster for large tables (just 2x faster
for in-memory small tables)

I plan to go over Michael's patch set here and see how much change is
needed to add the "direct toast"

My goals are:
1. fast lookup from skipping index lookup
2. making the toast pointer in main heap as small as possible -
hopefully just the 6 bytes of tid pointer - so that scans that do not
need toasted values get more tuples from each page
3. adding all (optional) the extra data into toast chunk record as
there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record

Column | Type | Storage
-+-+--
chunk_id | oid | plain | 0 when not using toast index, 0xfffe -
non-deletable, for example when used as dictionary for multiple
toasted values.
chunk_seq | integer | plain | if not 0 when referenced from toast
pointer then the toasted data starts at toast_pages[0] (or below it in
that tree), which *must* have chunk_id = 0
chunk_data | bytea | plain

-- added fields

toast_pages | tid[] | plain | can be chained or make up a tree
offsets | int[] | plain | -- starting offsets of the toast_pages
(octets or type-specific units), upper bit is used to indicate that a
new compressed span starts at that offset, 2nd highest bit indicates
that the page is another tree page
comp_method | int | plain | -- compression methos used maybe should be enum ?
dict_pages | tid[] | plain | -- pages to use as compression
dictionary, up to N pages, one level

This seems to be flexible enough to allow for both compressin and
efficient partial updates

---
Hannu


On Tue, Jul 8, 2025 at 8:31 PM Nikita Malakhov  wrote:
>
> Hi!
>
> Greg, thanks for the interest in our work!
>
> Michael, one more thing forgot to mention yesterday -
> #define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
> static const toast_external_info 
> toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
> VARTAG_ONDISK_OID historically has a value of 18
> and here we got an array of 19 members with only 2 valid ones.
>
> What do you think about having an individual
> TOAST value id counter per relation instead of using
> a common one? I think this is a very promising approach,
> but a decision must be made where it should be stored.
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Tom Lane
Hannu Krosing  writes:
> On Tue, Jul 8, 2025 at 8:07 PM Tom Lane  wrote:
>> Even more interesting is what I got from an ancient PPC Macbook
>> (mamba's host, running NetBSD):
>> 
>> Testing timing overhead for 3 seconds.
>> Per loop time including overhead: 731.26 ns
>> ...
>> Observed timing durations up to 99.9900%:
>> ns   % of total  running %  count
>> 705  39.916239.91621637570
>> 706  17.604057.5203 722208
>> 759  18.679776.2000 766337
>> 760  23.785199.9851 975787
>> 813   0.000299.9853  9
>> 814   0.000499.9857 17
>> 868   0.000199.9858  4
>> 922   0.000199.9859  3

> Do we have a fencepost error in the limit code so that it stops before
> printing out the 99.9900% limit row ?

No, I think what's happening there is that we get to NUM_DIRECT before
reaching the 99.99% mark.  Running the test a bit longer, I do get a
hit at the next plausible 50ns step:

$ ./pg_test_timing -d 10
Testing timing overhead for 10 seconds.
Per loop time including overhead: 729.79 ns
Histogram of timing durations:
   <= ns   % of total  running %  count
   0   0. 0.  0
   1   0. 0.  0
   3   0. 0.  0
   7   0. 0.  0
  15   0. 0.  0
  31   0. 0.  0
  63   0. 0.  0
 127   0. 0.  0
 255   0. 0.  0
 511   0. 0.  0
1023  99.987999.9879   13700887
2047   0.99.9880  2
4095   0.006399.9942859
8191   0.001999.9962267
   16383   0.001799.9978227
   32767   0.001299.9990166
   65535   0.000199.9992 16
  131071   0.000799.9998 90
  262143   0.99.9998  5
  524287   0.000199. 11
 1048575   0.0001   100. 10

Observed timing durations up to 99.9900%:
  ns   % of total  running %  count
 705  40.762340.76235585475
 706  17.973258.73552462787
 759  18.139276.87472485525
 760  23.112999.98763167060
 813   0.99.9877  5
 814   0.000299.9878 23
 868   0.99.9879  5
 869   0.99.9879  1
 922   0.99.9879  3
 923   0.99.9879  2
 976   0.99.9879  1
...
  625444   0.   100.  1

amd the next step after that would be 1026 ns which is past
the NUM_DIRECT array size.

I considered raising NUM_DIRECT some more, but I think it'd be
overkill.  This machine is surely an order of magnitude slower
than anything anyone would consider of practical interest today.
Just for fun, though, I tried a run with NUM_DIRECT = 10240:

$ ./pg_test_timing -d 10
Testing timing overhead for 10 seconds.
Per loop time including overhead: 729.23 ns
Histogram of timing durations:
   <= ns   % of total  running %  count
   0   0. 0.  0
   1   0. 0.  0
   3   0. 0.  0
   7   0. 0.  0
  15   0. 0.  0
  31   0. 0.  0
  63   0. 0.  0
 127   0. 0.  0
 255   0. 0.  0
 511   0. 0.  0
1023  99.987899.9878   13711494
2047   0.99.9878  5
4095   0.006299.9941854
8191   0.002199.9962289
   16383   0.001799.9979236
   32767   0.001199.9990153
   65535   0.000299.9992 24
  131071   0.000699.9998 85
  262143   0.000199.  8
  524287   0.000199.  9
 1048575   0.0001   100.  7
 2097151   0.   100.  0
 4194303   0.   100.  0
 8388607   0.   100.  0
16777215   0.   100.  0
33554431   0.   100.  1
67108863   0.   100.  1

Observed timing durations up to 99.9900%:
  ns   % of total  running %  count
 705  50.353450.35346905051
 706  22.198872.55223044153
 759  12.061384.61351653990
 760  15.373299.98672108150
 813   0.99.9867  2
 814   0.000299.9869 27
 868   0.000699.9875 85
 869   0.99.9876  2
 922   0.000199.9877 20
 923   0.000199.9878  9
 976   0.99.9878 

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 08:54:33PM +0200, Hannu Krosing wrote:
> I still think we should go with direct toast tid pointers in varlena
> and not some kind of oid.
>
> It will remove the need for any oid management and also will be
> many-many orders of magnitude faster for large tables (just 2x faster
> for in-memory small tables)

There is also the point of backward-compatibility.  We cannot just
replace things, and we have to provide a way for users to be able to
rely on the system so as upgrades are painless.  So we need to think
about the correct application layer to use to maintain the legacy code
behavior while considering improvements.

> I plan to go over Michael's patch set here and see how much change is
> needed to add the "direct toast"

If you do not have a lot of time looking at the full patch set, I'd
recommend looking at 0003, files toast_external.h and
toast_external.c which include the key idea.  Adding a new external
TOAST pointer is then a two-step process:
- Add a new vartag_external.
- Add some callbacks to let the backend understand what it should do
with this new vartag_external.

> My goals are:
> 1. fast lookup from skipping index lookup
> 2. making the toast pointer in main heap as small as possible -
> hopefully just the 6 bytes of tid pointer - so that scans that do not
> need toasted values get more tuples from each page
> 3. adding all (optional) the extra data into toast chunk record as
> there we are free to add whatever is needed
> Currently I plan to introduces something like this for toast chunk record

Points 2. and 3. are things that the refactoring should allow.  About
1., I have no idea how much you want to store in the TOAST external
points and how it affects the backend, but you could surely implement
an option that lets the backend know that it should still index
lookups based on what the external TOAST pointer says, if this stuff
has benefits.

> This seems to be flexible enough to allow for both compressin and
> efficient partial updates

I don't really disagree with all that.  Now the history of the TOAST
threads point out that we've good at proposing complex things, but
these had a high footprint.  What I'm proposing is lighter than that,
I think, tackling my core issue with the infra supporting backward
compatibility and the addition of more modes on top of it.
--
Michael


signature.asc
Description: PGP signature


Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 09:31:29PM +0300, Nikita Malakhov wrote:
> Michael, one more thing forgot to mention yesterday -
> #define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
> static const toast_external_info
> toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
> VARTAG_ONDISK_OID historically has a value of 18
> and here we got an array of 19 members with only 2 valid ones.

Yeah, I'm aware of that.  The code is mostly to make it easier to read
while dealing with this historical behavior, even if it costs a bit
more in memory.  I don't think that it's a big deal, and we could
always have one more level of redirection to reduce its size.  Now
there's the extra complexity..

> What do you think about having an individual
> TOAST value id counter per relation instead of using
> a common one? I think this is a very promising approach,
> but a decision must be made where it should be stored.

I've thought about that, and decided to discard this idea for now to
keep the whole proposal simpler.  This has benefits if you have many
relations with few OIDs consumed, but this has a cost in itself as you
need to maintain the data for each TOAST relation.  When I looked at
the problems a couple of weeks ago, I came to the conclusion that all
the checkbox properties of "local" TOAST values are filled with a
sequence: WAL logging to ensure uniqueness, etc.  So I was even 
considering the addition of some code to create sequences on-the-fly,
but at the end that was just more complexity with how we define
sequences currently compared to a unique 8-byte counter in the
control file that's good enough for a veeery long time.

I've also noticed that this sort of links to a piece I've implemented
last year and is still sitting in the CF app without much interest
from others: sequence AMs.  You could implement a "TOAST" sequence
method, for example, optimized for this purpose.  As a whole, I
propose to limit the scope of the proposal to the pluggability of the
external TOAST pointers.  The approach I've taken should allow such
improvements, these can be added later if really needed.
--
Michael


signature.asc
Description: PGP signature


Re: Move the injection_points extension to contrib?

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 05:50:59PM +0200, Antonin Houska wrote:
> ok, I assume you mean that the requirement for ABI/API stability would make it
> hard to include tests for fixes like [2] in minor releases. Thanks for
> explanation.

You can think about it the same way as we do for regress.so, for
example.
--
Michael


signature.asc
Description: PGP signature


Re: Adding wait events statistics

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 03:25:26PM +, Bertrand Drouvot wrote:
> So, something like:
> 
> ClassInfo LWLockClassInfo = {
> .numberOfEvents = NB_LWLOCK_WAIT_EVENTS,
> .pendingCounts = PendingWaitEventStats.lwlock_counts
> };
> 
> and then the "global" one:
> 
> WaitClassInfo *AllWaitClasses[] = {
> NULL,  /* gap - no class with ID 0 */
> &LWLockClassInfo,
> NULL,  /* gap - no class with ID 2 */
> &LockClassInfo,
> &BufferPinClassInfo,
> &ActivityClassInfo,
> &ClientClassInfo,
> &ExtensionClassInfo,
> &IPCClassInfo,
> &TimeoutClassInfo,
> &IOClassInfo,
> };

Okay.  I'm guessing that all this information is automatically built
based on the contents of wait_event_names.txt (with some tweaks for
the LWLock part?), so why not even if there are gaps as long as we
know the length of AllWaitClasses.  The stats code would be simple by
doing an O(n^2) across the array when flushing the individual contents
of each wait event. 

> That way:
> 
> - we do not need a mapping between ClassID and "Arrays" and the gaps
> in the ClassID are handled. So we can acces directly the class of interest
> knowing its classID.
> - we can easily iterate (again no mapping needed) over all the pending stats
> thanks to AllWaitClasses[]
> 
> What do you think? It looks like the most elegant approach to me.

I think that's pretty cool.

> Bonus point, as you can see above, while working on this approach I've been 
> able
> to remove storing the className and the array of eventName per wait class 
> (would
> have worked in v1 too).

You mean the extra set of names generated?  Yes, I was getting the
impression while reading your patch that this did not require this
complexity when generating the code for the pending data.  Getting rid
of this part would be a nice outcome.
--
Michael


signature.asc
Description: PGP signature


Re: Fix comment in btree_gist--1.8--1.9.sql

2025-07-08 Thread Michael Paquier
On Tue, Jul 08, 2025 at 04:49:30PM -0700, Paul Jungwirth wrote:
> I noticed that the comment at the top of btree_gist--1.8--1.9.sql says it is
> the 1.7--1.8 file. Here is a one-line patch to fix that.
> 
> A related question: In the v18 release we are adding two upgrade files:
> btree_gist--1.7--1.8.sql and btree_gist--1.8-1.9.sql. Why not collapse them
> into one? Technically we briefly had btree_gist--1.7--1.8.sql in 17devel,
> but it was reverted before release. Would you like a patch combining these
> files? Is it too late for that?

It would be too late once the branch is officially released, and we
are still in a position where it is possible to combine both.  Let's
do so for REL_18_STABLE, having two upgrade paths for a single major
version is just extra work for no benefit.

For PGSS, we've added several things in a single major release, always
limiting ourselves to a single new version bump.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Add support for displaying database service in psql prompt

2025-07-08 Thread Noah Misch
On Tue, Jul 08, 2025 at 08:52:08AM +0900, Michael Paquier wrote:
> On Sun, Jul 06, 2025 at 08:00:09PM -0700, Noah Misch wrote:
> > I think the choice to make there is whether to call PQconninfo() once per
> > prompt emission or to cache the value, invalidating that cache e.g. once per
> > SyncVariables().  My first thought was to cache, but the decision is not too
> > important.  A PQconninfo() call is likely negligible relative to all that
> > happens between prompts.  Even if not negligible, the overhead of not 
> > caching
> > will affect only prompts using the new escape sequence.
> 
> SyncVariables() happens at startup and when re-syncing a connection
> during a check, so that does not really worry me.
> 
> By the way, there is a second change in the CF that's suggesting the
> addition of a SERVICEFILE variable, which also uses a separate libpq
> API (forgot about this one):
> https://commitfest.postgresql.org/patch/5387/
> 
> Changing the patch on the other thread to use a conninfo is
> stright-forward.  How about extending the get_service_name()@common.c
> I've proposed upthread so as it takes a string in input and it could
> be reused for more connection options than only "service" so as it
> could be reused there as well?

I'd prefer not to get involved in decisions affecting only psql efficiency and
psql code cosmetics.  Please make that decision without my input.




Re: What is a typical precision of gettimeofday()?

2025-07-08 Thread Tom Lane
Hannu Krosing  writes:
> On Tue, Jul 8, 2025 at 10:01 PM Tom Lane  wrote:
>> No, I think what's happening there is that we get to NUM_DIRECT before
>> reaching the 99.99% mark.

> Makes sense.
> Should we change the header to something like
> "Showing values covering up to 99.9900% of total observed timing
> durations below 1 us. "

I think that'd just confuse people more not less.

> And maybe 10,001 would be a better value for collection, especially on
> older machines which could in fact have some wose timer resolution ?

I don't have any objection to boosting NUM_DIRECT to 10K or so.
It will cause the stats display loop to take microscopically longer,
but that shouldn't matter, even on slow machines.

One other thing that bothers me as I look at the output is

Per loop time including overhead: 731.26 ns

That's stated in a way that makes it sound like that is a
very solid number, when in fact it's just the average.
We see from these test cases that there are frequently
a few outliers that are far from the average.  I'm tempted
to rephrase as

Average loop time including overhead: 731.26 ns

or some variant of that.  Thoughts?

regards, tom lane




Re: Adding wait events statistics

2025-07-08 Thread Andres Freund
Hi,

On 2025-06-30 13:36:12 +, Bertrand Drouvot wrote:
> Wait events are useful to know what backends are waiting for when there is/was
> a performance issue: for this we can sample pg_stat_activity at regular 
> intervals
> and record historical data. That’s how it is commonly used.
>
> It could also be useful to observe the engine/backends behavior over time and
> help answer questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer 
> or
> is it because it is hit more frequently?
> * I observe a peak in some of them (say for example MultiXact%), is it due to 
> a
> workload change?
>
> For the above use cases, we need a way to track the wait events that occur 
> between
> samples: please find attached a proof of concept patch series doing so.
> >
> 0002 -  It adds wait events statistics
>
> It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics.
>
> This new statistic kind is a fixed one because we know the maximum number of 
> wait
> events. Indeed:
>
>  * it does not take into account custom wait events as extensions have all 
> they need
>  at their disposal to create custom stats on their own wait events should they
>  want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though).
>
>  * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for 
> the same
>  reasons as above. That said, there is no maximum limitation in 
> LWLockNewTrancheId().
>
>  * we don't want to allocate memory in some places where the counters are
>  incremented (see 4feba03d8b9). We could still use the same implementation as 
> for
>  backend statistics (i.e, make use of flush_static_cb) if we really want/need 
> to
>  switch to variable stats.

Even without allocating memory, this makes wait events considerably more
expensive. Going from ~4 instructions to something more like ~40.  Including
external function calls.

Today:

start wait:
   0x00a08e8f <+63>:lea0x735662(%rip),%r12# 0x113e4f8 

   0x00a08e96 <+70>:mov(%r12),%rax
   0x00a08e9a <+74>:movl   $0x906,(%rax)

   < do wait operation >


end wait:
   0x00a08eaa <+90>:mov(%r12),%rax


After:

start wait:
   0x00a0b39f <+63>:lea0x73673a(%rip),%r12# 0x1141ae0 

   0x00a0b3a6 <+70>:mov(%r12),%rax
   0x00a0b3aa <+74>:movl   $0x906,(%rax)


   < do wait operation >

end wait:

   0x00a0b3ba <+90>:mov(%r12),%rax
   0x00a0b3be <+94>:mov(%rax),%edi
   0x00a0b3c0 <+96>:rex call 0xa452e0 
   0x00a0b3c6 <+102>:   mov(%r12),%rax

Where waitEventIncrementCounter ends up as:

   0x00a452e0 <+0>: lea0x6fc7f9(%rip),%rax# 0x1141ae0 

   0x00a452e7 <+7>: mov(%rax),%rax
   0x00a452ea <+10>:mov(%rax),%eax
   0x00a452ec <+12>:mov%eax,%edx
   0x00a452ee <+14>:and$0xff00,%edx
   0x00a452f4 <+20>:jne0xa45300 
   0x00a452f6 <+22>:test   %ax,%ax
   0x00a452f9 <+25>:jne0xa45300 
   0x00a452fb <+27>:ret
   0x00a452fc <+28>:nopl   0x0(%rax)
   0x00a45300 <+32>:cmp$0x100,%edx
   0x00a45306 <+38>:jne0xa4530e 
   0x00a45308 <+40>:cmp$0x5e,%ax
   0x00a4530c <+44>:ja 0xa452fb 
   0x00a4530e <+46>:cmp$0x700,%edx
   0x00a45314 <+52>:sete   %cl
   0x00a45317 <+55>:test   %ax,%ax
   0x00a4531a <+58>:setne  %al
   0x00a4531d <+61>:test   %al,%cl
   0x00a4531f <+63>:jne0xa452fb 
   0x00a45321 <+65>:cmp$0xb00,%edx
   0x00a45327 <+71>:je 0xa452fb 
   0x00a45329 <+73>:mov%edi,%eax
   0x00a4532b <+75>:movzwl %di,%edi
   0x00a4532e <+78>:lea0x714e4b(%rip),%rdx# 0x115a180 

   0x00a45335 <+85>:shr$0x18,%eax
   0x00a45338 <+88>:shl$0x5,%rax
   0x00a4533c <+92>:add0x4e1735(%rip),%rax# 0xf26a78
   0x00a45343 <+99>:movslq 0x8(%rax),%rax
   0x00a45347 <+103>:   add%rdi,%rax
   0x00a4534a <+106>:   incq   (%rdx,%rax,8)
   0x00a4534e <+110>:   lea0x714e0b(%rip),%rax# 0x115a160 

   0x00a45355 <+117>:   movb   $0x1,(%rax)
   0x00a45358 <+120>:   ret


That's about an order of magnitude increase in overhead, just based on the
number of instructions, and probably considerably xmore in runtime (you're
adding external function calls etc).


I think to make this viable, we would have to remove some wait events and be
more careful about adding them.  And optimize this code a good bit. This
really shouldn't end up with this mouch code.


Greetings,