Re: Batch TIDs lookup in ambulkdelete

2025-05-13 Thread Matheus Alcantara
Hi,

On 30/04/25 19:36, Masahiko Sawada wrote:
> Here are the summary of each attached patch:
>
> 0001: Introduce TIdStoreIsMemberMulti() where we can do IsMember check
> for multiple TIDs in one function call. If the given TIDs are sorted
> (at least in block number), we can save radix tree lookup for the same
> page entry.
>
> 0002: Convert IndexBUlkDeleteCallback() to batched operation.
>
> 0003: Use batch TIDs lookup in btree index bulk-deletion.
>
> In patch 0003, we implement batch TID lookups for both each
> deduplicated index tuple and remaining all regular index tuples, which
> seems to be the most straightforward approach. While further
> optimizations are possible, such as performing batch TID lookups for
> all index tuples on a single page, these could introduce additional
> overhead from sorting and re-sorting TIDs. Moreover, considering that
> radix tree lookups are relatively inexpensive, the benefits of sorting
> TIDs and using TidStoreIsMemberMulti() might be minimal. Nevertheless,
> these potential optimizations warrant further evaluation to determine
> their actual impact on performance.
>
> Also, the patch includes the batch TIDs lookup support only for btree
> indexes but we potentially can improve other index AMs too.
>

The code looks good and also +1 for the idea. I just have some small
points:
- Maybe it would be good to mention somewhere that
  IndexBulkDeleteCallback() callback returns the number of tids
  members found on TidStore?
- The vac_tid_reaped() docs may need to be updated?

I also executed meson tests for each patch individually and the 0002
patch is not passing on my machine (MacOs).

Ok: 39
Expected Fail:  0
Fail:   271
Unexpected Pass:0
Skipped:22
Timeout:0

One behaviour that I found by executing the 0002 tests is that it may be
leaking some shared memory segments. I notice that because after
executing the tests I tried to re-execute based on master and all tests
were failing with the "Failed system call was shmget(key=97530599,
size=56, 03600)" error. I also checked the shared memory segments using
"ipcs -m" and it returns some segments which is don't returned when I
execute the tests on master (after cleaning the leaked memory segments)
and it also doesn't occur when executing based on 0001 or 0003.

~/d/p/batch-tids-lookup-ambulkdelete ❯❯❯ ipcs -m
IPC status from  as of Tue May 13 18:19:14 -03 2025
T ID KEYMODE   OWNERGROUP
Shared Memory:
m 18087936 0x05f873bf --rw---  matheusstaff
m 15925250 0x05f966fe --rw---  matheusstaff
m 24248325 0x05f9677e --rw---  matheusstaff


Note that the the 0003 patch don't have this issue so at the end we will
not have problem with this I think, but it maybe be good to mention that
although the patches are separate, there is a dependency between them,
which may cause issues on buildfarm?

-- 
Matheus Alcantara




Re: Disable parallel query by default

2025-05-13 Thread Scott Mead



On Tue, May 13, 2025, at 5:07 PM, Greg Sabino Mullane wrote:
> On Tue, May 13, 2025 at 4:37 PM Scott Mead  wrote:
>> I'll open by proposing that we prevent the planner from automatically 
>> selecting parallel plans by default
> 
> That seems a pretty heavy hammer, when we have things like 
> parallel_setup_cost that should be tweaked first.

I agree it's a big hammer and I thought through parallel_setup_cost quite a bit 
myself.  The problem with parallel_setup_cost is that it doesn't actually 
represent the overhead of a setting up parallel query for a busy system.  It 
does define the cost of setup for a *single* parallel session, but it cannot 
accurately express the cost of CPU and other overhead associated with the 
second, third, fourth, etc... query that is executed as parallel.  The expense 
to the operating system is a function of the _rate_ of parallel query 
executions being issued.  Without new infrastructure, there's no way to define 
something that will give me a true representation of the cost of issuing a 
query with parallelism.


>> The recommendation that I give to users is pretty straightforward: "Disable 
>> automatic parallel query, enable it for queries where you find substantial 
>> savings and can control the rate of execution."  I always tell users that if 
>> they're using parallel query for anything that should execute in less than 5 
>> minutes, they're probably pushing on the wrong tuning strategy as the load 
>> induced by the parallel query infrastructure is likely going to negate the 
>> savings that they're getting.
> 
> Five minutes?! That's not been my experience. Not claiming parallelism is 
> perfect yet, but there are plenty of parallel performance savings under the 
> five minute mark.

Absolutely, I've seen 1 second queries go to 200ms with parallelism of 2.  The 
problem isn't about making that query faster in isolation, the problem is that 
every single one of those means a new connection.

If you have a connection pool from your application and you issue 60 queries 
per minute, each that can go from 1 second to 200 ms, That means that you are 
making 120 connections per minute back to the DB.

As we know, connection establishment is brutal 

Executing a pgbench with 15 clients for 10 seconds using "SELECT 1;" as the 
workload gives me 315,449 tps.  If I add the -C flag (connect / disconnect for 
each transaction), I get 684 TPS.   The overhead of a connection is more than 
just to the specific query being optimized, it has far-reaching impact even 
outside of the postgres processes on the machine. 


$ pgbench -f select1.sql --no-vacuum -c 15 -T 10
pgbench (16.4)
transaction type: select1.sql
scaling factor: 1
query mode: simple
number of clients: 15
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 3146966
number of failed transactions: 0 (0.000%)
latency average = 0.048 ms
initial connection time = 28.854 ms
tps = 315449.609763 (without initial connection time)



$ pgbench -f select1.sql --no-vacuum -c 15 -T 10 -C
pgbench (16.4)
transaction type: select1.sql
scaling factor: 1
query mode: simple
number of clients: 15
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 6848
number of failed transactions: 0 (0.000%)
latency average = 21.910 ms
average connection time = 1.455 ms
tps = 684.615907 (including reconnection times)


> 
> Cheers,
> Greg
> 
> --
> Crunchy Data - https://www.crunchydata.com
> Enterprise Postgres Software Products & Tech Support
> 

--
Scott Mead
Amazon Web Services
sc...@meads.us




Re: Improve monitoring of shared memory allocations

2025-05-13 Thread Amit Kapila
On Fri, Apr 4, 2025 at 9:15 PM Rahila Syed  wrote:
>
> Please find attached the patch which removes the changes for non-shared hash 
> tables
> and keeps them for shared hash tables.
>

  /* Allocate initial segments */
+ i = 0;
  for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
  {
- *segp = seg_alloc(hashp);
- if (*segp == NULL)
- return false;
+ /* Assign initial segments, which are also pre-allocated */
+ if (hashp->isshared)
+ {
+ *segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+ MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
+ }
+ else
+ {
+ *segp = seg_alloc(hashp);
+ i++;
+ }

In the non-shared hash table case, previously, we used to return false
from here when seg_alloc() fails, but now it seems to be proceeding
without returning false. Is there a reason for the same?

> I tested this by running make-check, make-check world and the reproducer 
> script shared
> by David. I also ran pgbench to test creation and expansion of some of the
> shared hash tables.
>

This covers the basic tests for this patch. I think we should do some
low-level testing of both shared and non-shared hash tables by having
a contrib module or such (we don't need to commit such a contrib
module, but it will give us confidence that the low-level data
structure allocation change is thoroughly tested). We also need to
focus on negative tests where there is insufficient memory in the
system.

-- 
With Regards,
Amit Kapila.




Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-05-13 Thread Amit Kapila
On Sat, May 10, 2025 at 5:15 PM Amit Kapila  wrote:
>
> I am planning to proceed with the test-fix proposed by Vignesh [1]
> early next week unless we want to discuss more on this issue.
>

Pushed.

> [1] - 
> https://www.postgresql.org/message-id/CALDaNm3TH3J8fwj%2BNcdjdN8DZCdrnmm1kzBsHBW2nkN%2Bh6up3A%40mail.gmail.com
>

-- 
With Regards,
Amit Kapila.




Re: Fix incorrect order of params in comment

2025-05-13 Thread Daniel Gustafsson
> On 9 May 2025, at 23:50, Michael Paquier  wrote:
> 
> On Fri, May 09, 2025 at 07:21:17PM +0200, Daniel Gustafsson wrote:
>> -=item $node->log_check($offset, $test_name, %parameters)
>> +=item $node->log_check($test_name, $offset, %parameters)
>> 
>> Check contents of server logs.
> 
> Right, good catch.  The internals of the routine use %params instead
> of %parameters, so perhaps this should be changed as well in the
> description?

Good point, not that documentation must match 1:1 but the other functions use
%params in their docs and matching that has value. Done.

--
Daniel Gustafsson





Re: Allow reading LSN written by walreciever, but not flushed yet

2025-05-13 Thread Fujii Masao




On 2025/05/13 0:47, Andrey Borodin wrote:

Moved off from "Small fixes needed by high-availability tools"


On 12 May 2025, at 01:33, Amit Kapila  wrote:

On Fri, May 2, 2025 at 6:30 PM Andrey Borodin  wrote:


3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might 
be ahead of node1 by flush LSN, but before by written LSN. If we do a failover 
we choose node2 instead of node1 and loose data recently committed with 
synchronous_commit=remote_write.


In this case, doesn't the flush LSN typically catch up to the write LSN on node2
after a few seconds? Even if the walreceiver exits while there's still written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when 
selecting
the most advanced node? No?



Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact 
returns flushed LSN, not written. I propose to add a new function which returns 
LSN actually written. Internals of this function are already implemented 
(GetWalRcvWriteRecPtr()), but unused.


GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: Add explicit initialization for all PlannerGlobal fields

2025-05-13 Thread David Rowley
On Tue, 13 May 2025 at 04:03, Richard Guo  wrote:
> While adding a new field to PlannerGlobal in another patch, I noticed
> that although most fields are explicitly initialized, a few are not.
> This doesn't cause any functional issues, since makeNode() zeroes all
> fields by default.  However, the inconsistency stood out to me, so I
> wrote the attached patch to explicitly initialize the remaining fields
> for clarity and consistency.
>
> Does this seem worthwhile?  Or should we simply rely on makeNode() for
> zero-initialization and consider this unnecessary?

These should be zeroed explicitly.

David




Re: using index to speedup add not null constraints to a table

2025-05-13 Thread Andres Freund
Hi,

On 2025-04-28 12:36:14 +0800, jian he wrote:
> On Fri, Apr 18, 2025 at 4:07 PM jian he  wrote:
> >
> > I don't have any good ideas to do the regress tests.
> > I use
> > ereport(NOTICE,
> > errmsg("all not-null constraints on relation
> > \"%s\" are validated by index scan",
> > RelationGetRelationName(oldrel)));
> > to do the tests.
> >
> for tests, just found out i can imitate
> src/test/modules/test_misc/t/001_constraint_validation.pl,
> 
> So I created a file:
> src/test/modules/test_misc/t/008_indexscan_validate_notnull.pl
> for TAP tests.

The tests have not passed in a few weeks:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5444

Greetings,

Andres Freund




Re: Test mail for pgsql-hackers

2025-05-13 Thread Andres Freund
On 2025-03-12 15:56:08 +0530, Ashutosh Bapat wrote:
> Large part of your patch is renaming files which are not related to
> this patch. Can you please clean that up.

Since this has not been fixed, and the patch has not passed CI since at least
mid March, I'll close this CF entry.




Re: Memoize ANTI and SEMI JOIN inner

2025-05-13 Thread Andres Freund
Hi,

On 2025-04-10 19:36:14 +0900, Richard Guo wrote:
> On Wed, Apr 9, 2025 at 6:18 PM David Rowley  wrote:
> > On Wed, 9 Apr 2025 at 18:48, Richard Guo  wrote:
> > > Perhaps we could spend some planner cycles proving inner_unique for
> > > anti joins, so that Memoize nodes can be considered for them?
>
> > Worth a try. It should be pretty easy to enable, as far as I can see.
> > It might just be a case of shuffling the cases around in the switch
> > statement in add_paths_to_joinrel().
>
> Right.  Here is a patch for that.

This is failing on CI:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5636
https://api.cirrus-ci.com/v1/artifact/task/5411026402803712/testrun/build-32/testrun/regress/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/memoize.out 
/tmp/cirrus-ci-build/build-32/testrun/regress/regress/results/memoize.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/memoize.out  2025-04-12 
11:24:10.866868945 +
+++ /tmp/cirrus-ci-build/build-32/testrun/regress/regress/results/memoize.out   
2025-04-12 11:32:44.454864476 +
@@ -525,7 +525,7 @@
  ->  Unique (actual rows=2.67 loops=N)
->  Sort (actual rows=67.33 loops=N)
  Sort Key: t2_1.a
- Sort Method: quicksort  Memory: 27kB
+ Sort Method: quicksort  Memory: 18kB
  ->  Seq Scan on tab_anti t2_1 (actual 
rows=100.00 loops=N)
 (15 rows)


There shouldn't be Memory mentioned in tests added to the tree.

Greetings,

Andres




Re: Serverside SNI support in libpq

2025-05-13 Thread Andres Freund
Hi,

On 2025-02-27 14:38:24 +0100, Daniel Gustafsson wrote:
> The attached v6 rebase contains this as well as your tests as well as
> general cleanup and comment writing etc.

This is not passing CI on windows...
https://cirrus-ci.com/build/4765059278176256

Greetings,

Andres




Re: Add partial :-variable expansion to psql \copy

2025-05-13 Thread Andres Freund
Hi,

On 2025-05-01 16:37:18 +, Akshat Jaimini wrote:
> I am unable to apply the patch on master. Can you please confirm if the patch 
> is rebased?

cfbot seems to be able to apply the patch.

However, the tests have not passed in quite a while:

https://cirrus-ci.com/task/5702107409416192
https://api.cirrus-ci.com/v1/task/5702107409416192/logs/cores.log

Greetings,

Andres Freund




Add explicit initialization for all PlannerGlobal fields

2025-05-13 Thread Richard Guo
While adding a new field to PlannerGlobal in another patch, I noticed
that although most fields are explicitly initialized, a few are not.
This doesn't cause any functional issues, since makeNode() zeroes all
fields by default.  However, the inconsistency stood out to me, so I
wrote the attached patch to explicitly initialize the remaining fields
for clarity and consistency.

Does this seem worthwhile?  Or should we simply rely on makeNode() for
zero-initialization and consider this unnecessary?

Thanks
Richard


v1-0001-Add-explicit-initialization-for-all-PlannerGlobal.patch
Description: Binary data


Re: make VALIDATE domain constraint lock on related relations as ShareUpdateExclusiveLock

2025-05-13 Thread wenhui qiu
HI
   Thanks for the patch! and overall this is a valuable improvement in
terms of concurrency and consistency with table-level behavior.The
refactoring to pass LOCK MODE into validateDomainCheckConstraint() improves
flexibility and clarity. and the patch also looks good to me.


Regards

On Tue, May 13, 2025 at 2:23 PM Dilip Kumar  wrote:

> On Tue, May 13, 2025 at 8:57 AM jian he 
> wrote:
> >
> > hi.
> >
> > We can still perform DML operations on a table while validating its
> > check constraint.
> > Similarly, it should be fine to do DML while validating domain
> constraints?
> > but currently, it's not allowed for domain constraints.
> >
> > The attached patch addresses this problem.
>
> This makes sense, and the patch also looks good to me.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
>
>


Backward movement of confirmed_flush resulting in data duplication.

2025-05-13 Thread shveta malik
Hi All,

It is a spin-off thread from earlier discussions at [1] and [2].

While analyzing the slot-sync BF failure as stated in [1], it was
observed that there are chances that confirmed_flush_lsn may move
backward depending on the feedback messages received from the
downstream system. It was suspected that the backward movement of
confirmed_flush_lsn may result in data duplication issues. Earlier we
were able to successfully reproduce the issue with two_phase enabled
subscriptions (see[2]). Now on further analysing, it seems possible
that data duplication issues may happen without two-phase as well.

With the attached injection-point patch and test-script (provided by
Hou-San), the data duplication issue is reproduced without using the
twophase option. The problem arises when changes applied by the table
sync worker are duplicated by the apply worker if the
confirmed_flush_lsn moves backward after the table sync is completed.

The error expected on sub after executing the test script:
# ERROR:  conflict detected on relation "public.tab2": conflict=insert_exists
# DETAIL:  Key already exists in unique index "tab2_a_key", modified
in transaction 
# Key (a)=(2); existing local tuple (2); remote tuple (2).

The general steps followed by attached script are:

1. After adding a new table, tab2, to the publication, refresh the
subscription to initiate a table sync for tab2. Before the state
reaches SYNCDONE, insert some data into tab2. This new insertion will
be replicated by the table sync worker.
2. Disable the subscription and stop the apply worker before changing
the state to READY.
3. Re-enable the subscription and wait for the table sync to finish.
Notice that the origin position should not have progressed since step
1.
4. Disable and re-enable the subscription. Given that the origin
position is less than the slot's confirmed_flush_lsn, control the
walsender to stop when the confirmed_flush_lsn moves backward.
5. Disable and re-enable the subscription once more, causing the slot
to retain the previous confirmed_flush_lsn, and the insertion from
step 2 will be replicated again.

The test script uses 3 injection points to control the race condition
mentioned in above steps. Also the LOG_SNAPSHOT_INTERVAL_MS is
increased to prevent the restart_lsn from increasing beyond the
insert. To fix this issue, we need to  prevent confirmed_flush_lsn
from moving backward. Attached the fix patch for the same.

With the given script, the problem reproduces on Head and PG17. We are
trying to reproduce the issue on PG16 and below where injection points
are not there.

[1]: 
https://www.postgresql.org/message-id/OS3PR01MB5718BC899AEE1D04755C6E7594832%40OS3PR01MB5718.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/OS0PR01MB57164AB5716AF2E477D53F6F9489A%40OS0PR01MB5716.jpnprd01.prod.outlook.com

thanks
Shveta
#!/bin/bash

# The intent of this script is to reproduce the issue due to confirmed_lsn
# backward movement without the use of two_phase option.
# The problem arises when changes applied by the table sync worker are duplicated by
# the apply worker if the confirmed_flush_lsn moves backward after the table sync.
#
# This script reproduces this issue with the help of three injection points.
#
# After this script run, the error expected on sub due to above stated issue:
# ERROR:  conflict detected on relation "public.tab2": conflict=insert_exists
# DETAIL:  Key already exists in unique index "tab2_a_key", modified in transaction 
#	Key (a)=(2); existing local tuple (2); remote tuple (2).

port_primary=5432
port_subscriber=5434

echo '=='
echo '=Clean up='
echo '=='

pg_ctl stop -D data_primary
pg_ctl stop -D data_subscriber

rm -rf data_* *log

echo '==='
echo '=Set up primary server='
echo '==='

initdb -D data_primary

cat << EOF >> data_primary/postgresql.conf
wal_level = logical 
port = $port_primary
#standby_slot_names = 'physical'
#log_replication_commands = 'on'
#max_slot_wal_keep_size = 64kB

max_wal_senders=550
max_worker_processes=1000 
max_replication_slots=550 
log_replication_commands = 'on' 
checkpoint_timeout = 1d 
shared_buffers = 6GB 
max_worker_processes = 32 
max_parallel_maintenance_workers = 24 
max_parallel_workers = 32 
synchronous_commit = on 
checkpoint_timeout = 1d 
max_wal_size = 24GB 
min_wal_size = 15GB 
autovacuum = off

wal_sender_timeout = 6000s
wal_receiver_timeout = 6000s

max_prepared_transactions = 10
log_min_messages = 'debug1'

EOF


pg_ctl -D data_primary start -w -l primary.log 

psql -d postgres -p $port_primary -c "SELECT * FROM pg_create_physical_replication_slot('physical');"

echo '==='
echo '=Set up subscriber='
echo '==='

initdb -D data_subscriber

cat << EOF >> data_subscriber/postgresql.conf
port = $port_subscriber


checkpoint_timeout = 1h
shared_buffers = '8GB'
wal_buffers = '1GB'
max_connections = '5000'
max_wal_size = 20GB
min_wal_size = 10GB
max_wal_send

Re: Vacuum statistics

2025-05-13 Thread Alena Rybakina

Hi!

On 12.05.2025 08:30, Amit Kapila wrote:

On Fri, May 9, 2025 at 5:34 PM Alena Rybakina  wrote:

I did a rebase and finished the part with storing statistics separately from 
the relation statistics - now it is possible to disable the collection of 
statistics for relationsh using gucs and
this allows us to solve the problem with the memory consumed.


I think this patch is trying to collect data similar to what we do for
pg_stat_statements for SQL statements. So, can't we follow a similar
idea such that these additional statistics will be collected once some
external module like pg_stat_statements is enabled? That module should
be responsible for accumulating and resetting the data, so we won't
have this memory consumption issue.
The idea is good, it will require one hook for the pgstat_report_vacuum 
function, the extvac_stats_start and extvac_stats_end functions can be 
run if the extension is loaded, so as not to add more hooks.
But I see a problem here with tracking deleted objects for which 
statistics are no longer needed. There are two solutions to this and I 
don't like both of them, to be honest.
The first way is to add a background process that will go through the 
table with saved statistics and check whether the relation or the 
database are relevant now or not and if not, then
delete the vacuum statistics information for it. This may be 
resource-intensive. The second way is to add hooks for deleting the 
database and relationships (functions dropdb, index_drop, 
heap_drop_with_catalog).

BTW, how will these new statistics be used to autotune a vacuum?

yes, but they are collected on demand - by guc.

And
do we need all the statistics proposed by this patch?

Regarding this issue, it was discussed here and so far we have come to 
the conclusion that statistics are needed for a deep understanding of 
the work of vacuum statistics [0] [1] [2].


[0] 
https://www.postgresql.org/message-id/0B6CBF4C-CC2A-4200-9126-CE3A390D938B%40upgrade.com


[1] 
https://www.postgresql.org/message-id/6732acf8ce0f31025b535ae1a64568750924a887.camel%40moonset.ru


[2] 
https://www.postgresql.org/message-id/5AA8FFD5-6DE2-4A31-8E00-AE98F738F5D1%40upgrade.com



--
Regards,
Alena Rybakina
Postgres Professional





Re: Backward movement of confirmed_flush resulting in data duplication.

2025-05-13 Thread Dilip Kumar
On Tue, May 13, 2025 at 3:48 PM shveta malik  wrote:
>
> Hi All,
>
> It is a spin-off thread from earlier discussions at [1] and [2].
>
> While analyzing the slot-sync BF failure as stated in [1], it was
> observed that there are chances that confirmed_flush_lsn may move
> backward depending on the feedback messages received from the
> downstream system. It was suspected that the backward movement of
> confirmed_flush_lsn may result in data duplication issues. Earlier we
> were able to successfully reproduce the issue with two_phase enabled
> subscriptions (see[2]). Now on further analysing, it seems possible
> that data duplication issues may happen without two-phase as well.

Thanks for the detailed explanation. Before we focus on patching the
symptoms, I’d like to explore whether the issue can be addressed on
the subscriber side. Specifically, have we analyzed if there’s a way
to prevent the subscriber from moving the LSN backward in the first
place? That might lead to a cleaner and more robust solution overall.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

2025-05-13 Thread Salvatore Dipietro
On Thu, 1 May 2025 at 14:50, Nathan Bossart  wrote:
> So...
>
> * The ISB does seem to have a positive effect without commit 3d0b4b1
>   applied.
>
> * With commit 3d0b4b1 applied, removing the ISB seems to have a positive
>   effect at high concurrencies.  This is especially pronounced in the
>   pgbench test.
>
> * With commit 3d0b4b1 applied, removing the ISB doesn't change much at
>   lower concurrencies, and there might even be a small regression.
>
> * At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
>   some test_shm_mq tests.  Removing the ISB instruction appears to help in
>   some cases, but not all.

Based on your findings Nathan, what is the best way to proceed for this change?
Do we need more validation for it? If yes, which kind?

Thanks,
Salvatore




Re: queryId constant squashing does not support prepared statements

2025-05-13 Thread Dmitry Dolgov
> On Mon, May 12, 2025 at 06:40:43PM GMT, Sami Imseih wrote:
> > The static variables was only part of the concern, another part was
> > using A_Expr to carry this information, which will have impact on lots
> > of unrelated code.
>
> What would be the problem if A_Expr carries an extra pointer to a List?
> It already had other fields, rexpr, lexpr and location that could be no-op.

They can be empty sometimes, but the new fields will be empty 99% of the
time. This is a clear sign to me that this informaton does not belong to
a node for "infix, prefix, and postfix expressions", don't you think?




Re: Disable parallel query by default

2025-05-13 Thread Greg Sabino Mullane
On Tue, May 13, 2025 at 4:37 PM Scott Mead  wrote:

> I'll open by proposing that we prevent the planner from automatically
> selecting parallel plans by default


That seems a pretty heavy hammer, when we have things like
parallel_setup_cost that should be tweaked first.

The recommendation that I give to users is pretty straightforward: "Disable
> automatic parallel query, enable it for queries where you find substantial
> savings and can control the rate of execution."  I always tell users that
> if they're using parallel query for anything that should execute in less
> than 5 minutes, they're probably pushing on the wrong tuning strategy as
> the load induced by the parallel query infrastructure is likely going to
> negate the savings that they're getting.
>

Five minutes?! That's not been my experience. Not claiming parallelism is
perfect yet, but there are plenty of parallel performance savings under the
five minute mark.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: Disable parallel query by default

2025-05-13 Thread Lukas Fittl
On Tue, May 13, 2025 at 4:36 PM Scott Mead  wrote:

> I'm looking forward to the upcoming monitoring in e7a9496 (Add two
> attributes to pg_stat_database for parallel workers activity), it will be
> easier to empirically prove that parallel query is being used.  I don't
> think the patch goes far enough though, we really need the ability to
> pinpoint the query and the specific variables used that triggered the
> parallel plan.  When we tell a user that parallel query is in-use and
> suspected, it is almost always met with "no, we don't use that feature".
> Users do not even realize that it's happening and quickly ask for a list of
> all queries that have ever undergone parallel execution.  It's pretty much
> impossible to give an accurate list of these because there is no
> instrumentation available (even in the new patch) to get to the per-query
> level.


Independent of your point at hand, its worth noting that pg_stat_statements
has also gained the two fields parallel_workers_to_launch and
parallel_workers_launched in 18 (see cf54a2c0), that would allow breaking
this down on a per-query basis and getting a better assessment as to which
queries are using parallelism (and whether sufficient parallel workers were
available during execution).

That doesn't help with the aspect of which parameters cause parallel plans
to be used of course (in case a query is flipping between plans), but may
help narrow it down.

Best,
Lukas

-- 
Lukas Fittl


spurious pg_amcheck test failure

2025-05-13 Thread Andres Freund
Hi,

In a rare CI failure I just noticed:

https://cirrus-ci.com/task/4901437240508416
https://api.cirrus-ci.com/v1/artifact/task/4901437240508416/testrun/build/testrun/pg_amcheck/002_nonesuch/log/regress_log_002_nonesuch

[11:38:06.031](0.000s) #   Failed test 'schema exclusion pattern overrides all 
inclusion patterns stderr /(?^:pg_amcheck: warning: skipping database 
"template1": amcheck is not installed)/'
#   at /Users/admin/pgsql/src/bin/pg_amcheck/t/002_nonesuch.pl line 400.
[11:38:06.031](0.000s) #   'pg_amcheck: warning: skipping 
database "another_db": amcheck is not installed
# '
# doesn't match '(?^:pg_amcheck: warning: skipping database "template1": 
amcheck is not installed)'
[11:38:06.031](0.000s) not ok 107 - schema exclusion pattern overrides all 
inclusion patterns stderr /(?^:pg_amcheck: error: no relations to check)/
[11:38:06.031](0.000s) #   Failed test 'schema exclusion pattern overrides all 
inclusion patterns stderr /(?^:pg_amcheck: error: no relations to check)/'
#   at /Users/admin/pgsql/src/bin/pg_amcheck/t/002_nonesuch.pl line 400.
[11:38:06.031](0.000s) #   'pg_amcheck: warning: skipping 
database "another_db": amcheck is not installed
# '
# doesn't match '(?^:pg_amcheck: error: no relations to check)'


I don't really understand how there can be an *occasional* reordering of the
database list - afaict the pg_amcheck query has an ORDER BY that should
prevent that?
") AS combined_records"
"\nORDER BY pattern_id NULLS LAST, datname");

Greetings,

Andres Freund




Re: Statistics Import and Export

2025-05-13 Thread Hari Krishna Sunder
We found a minor issue when testing statistics import with upgrading from
versions older than v14. (We have VACUUM and ANALYZE disabled)
3d351d916b20534f973eda760cde17d96545d4c4

changed
the default value for reltuples from 0 to -1. So when such tables are
imported they get the pg13 default of 0 which in pg18 is treated
as "vacuumed and seen to be empty" instead of "never yet vacuumed". The
planner then proceeds to pick seq scans even if there are indexes for these
tables.
This is a very narrow edge case and the next VACUUM or ANALYZE will fix it
but the perf of these tables immediately after the upgrade is considerably
affected.

Can we instead use -1 if the version is older than 14, and reltuples is 0?
This will have the unintended consequence of treating a truly empty table
as "never yet vacuumed", but that should be fine as empty tables are going
to be fast regardless of the plan picked.

PS: This is my first patch, so apologies for any issues with the patch.


On Fri, Apr 4, 2025 at 7:06 PM Nathan Bossart 
wrote:

> On Fri, Apr 04, 2025 at 07:32:48PM -0400, Corey Huinker wrote:
> > This patch shrinks the array size to 1 for versions < 9.4, which keeps
> the
> > modern code fairly elegant.
>
> Committed.
>
> --
> nathan
>
>
>


0001-Stats-import-Fix-default-reltuples-on-versions-older.patch
Description: Binary data


Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part

2025-05-13 Thread David E. Wheeler
On May 9, 2025, at 15:50, Robert Haas  wrote:

> # We have the kluge of having separate "_tz" functions to support
> # non-immutable datetime operations, but that way doesn't seem like
> # it's going to scale well to multiple sources of mutability.
> 
> But I'm not sure I understand why it matters that there are multiple
> sources of mutability here. Maybe I'm missing a piece of the puzzle
> here.

I read that to mean “we’re not going to add another json_path_exists_* function 
for every potentially immutable JSONPath function. But I take your point that 
it could be generalized for *any* mutable function. In which case maybe it 
should be renamed?

Best,

David



signature.asc
Description: Message signed with OpenPGP


Disable parallel query by default

2025-05-13 Thread Scott Mead
Hello Hackers, 

Over the last 24 months, I've noticed a pattern amongst users with unexpected 
plan flips landing on parallel plans.

77cd477 (9.6 beta) defaulted parallel query on (max_parallel_degree = 2), it's 
been nine years and I'd like to open the discussion to see what our thoughts 
are.  Especially since it seems that the decision was made for 9.6 beta 
testing, and never really revisited. 

I'll open by proposing that we prevent the planner from automatically selecting 
parallel plans by default, opting instead to allow users to set their 
max_parallel_workers_per_gather as needed.  IOW: lets make the default 
max_parallel_workers_per_gather=0 for V18 forward.

Just to be clear, my concern isn't with parallel query in general, the issue we 
see is when high-frequency, low-latency queries start executing with 
parallelism on their own (i.e. the traditional plan flip with a twist).  Given 
that max_parallel_workers_per_gather is dynamic and easily configured per 
session (or even per query with something like the pg_hint_plan extension), 
dissuading the planner from opting in to parallelism by default will contain 
the fallout that we see when plans flip to parallel execution.

What is the fallout?  When a high-volume, low-latency query flips to parallel 
execution on a busy system, we end up in a situation where the database is 
effectively DDOSing itself with a very high rate of connection establish and 
tear-down requests.  Even if the query ends up being faster (it generally does 
not), the CPU requirements for the same workload rapidly double or worse, with 
most of it being spent in the OS (context switch, fork(), destroy()).  When 
looking at the database, you'll see a high load average, and high wait for CPU 
with very little actual work being done within the database.  

For an example of scale, we have seen users with low connection rates (<= 5 / 
minute) suddenly spike to between 2000 and 3000 connect requests per minute 
until the system grinds to a halt.  

I'm looking forward to the upcoming monitoring in e7a9496 (Add two attributes 
to pg_stat_database for parallel workers activity), it will be easier to 
empirically prove that parallel query is being used.  I don't think the patch 
goes far enough though, we really need the ability to pinpoint the query and 
the specific variables used that triggered the parallel plan.  When we tell a 
user that parallel query is in-use and suspected, it is almost always met with 
"no, we don't use that feature".  Users do not even realize that it's happening 
and quickly ask for a list of all queries that have ever undergone parallel 
execution.  It's pretty much impossible to give an accurate list of these 
because there is no instrumentation available (even in the new patch) to get to 
the per-query level.

When a user says "I'm not using parallel query" we have to walk through 
circumstantial evidence of its use.  I typically combine IPC:BgWorkerShutDown, 
IPC:ParallelFinish, IO:DataFileRead (this helps nail it for sequential scans) 
with a high rate of connection establishment.  When you look at all of these 
together, it still hard to see that parallelism is the cause, but when we 
disable automated plan selection, system stability returns.

The recommendation that I give to users is pretty straightforward: "Disable 
automatic parallel query, enable it for queries where you find substantial 
savings and can control the rate of execution."  I always tell users that if 
they're using parallel query for anything that should execute in less than 5 
minutes, they're probably pushing on the wrong tuning strategy as the load 
induced by the parallel query infrastructure is likely going to negate the 
savings that they're getting.  

I'm curious to hear what others think of this proposal.  I've dealt with so 
many of these over the last 24 months, most of them causing strife along the 
way, that I'm interested in what others think.  

--
Scott Mead
Amazon Web Services
sc...@meads.us

Note: When testing the attached patch, there are failures in misc_sanity.out 
and misc_functions.out (replication origin name is too long).  I assume these 
are unrelated to my attached patch.

v1-0001-Disable-parallel-query-by-default.patch
Description: Binary data


Re: Small fixes needed by high-availability tools

2025-05-13 Thread Mihail Nikalayeu
Hello, everyone!

> On Mon, 12 May 2025 at 18:42, Andrey Borodin  wrote:
>> >> Problem: user might try to cancel locally committed transaction and if we 
>> >> do so we will show non-replicated data as committed. This leads to 
>> >> loosing data with UPSERTs.
>> > >
>> > > Could you explain why specifically UPSERTs would lose data (vs any
>> > other user workload) in cancellations during SyncRepWaitForLSN?
>>
>> Upserts change data conditionally. That's where observed effect affect 
>> writtned data. But the root problem is observing non-replicated data, it 
>> only becomes obvious when issuing: "INSERT ON CONFLICT DO >NOTHING" and 
>> retrying it.
>> 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
>> 2. JDBC cancels query by after default timeout
>> 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

> Right. I think upsert is a red herring here. Any system trying to
> implement idempotency/exactly once delivery will be built around a
> similar pattern. Check if a transaction has already been executed, if
> not run the transaction, commit, on failure retry. This is
> particularly vulnerable to the visibility issue because the retry is
> likely to land on the partitioned off leader.

I think UPSERT is just one specific case here. Any data that becomes
visible and then disappears can cause a variety of issues.

For example, the system receives a callback from a payment system,
marks an order as "PAID," commits the transaction, and returns a 200
response to the payment system (so it won't retry the callback).
However, if the transaction is lost due to a new primary, we end up
with an order that is paid in the real world, but the system is
unaware of it.

And yes, that patch has actually been applied on top of HEAD by most
PG cloud providers for over four years now [0].

> One idea to solve this problem could be that whenever we cancel
> sync_rep_wait, we set some system-wide flag that indicates that any
> new transaction must ensure that all the current data is replicated to
> the synchronous standby. Once we ensure that we have waited for
> pending transactions to replicate, we can toggle back that system-wide
> flag. Now, if the system restarts for any reason during such a wait,
> we can use your idea to disallow new connections until the standby
> quorum is established.

It might not necessarily be a flag—it could be some LSN value instead.
Also, it's not just about a "new transaction," but about any new
snapshot that could see data not yet replicated to the synchronous
standby.

Best regards,
Mikhail.

[0]: 
https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50




Re: Allow reading LSN written by walreciever, but not flushed yet

2025-05-13 Thread Mihail Nikalayeu
Hello, everyone!

> Or might want LSN "flush everything you have written, and return that LSN". 
> That will also do the trick, but is not necessary.
I think it is a better option. Less things may go wrong in such a case.

Best regards,
Mikhail.




Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part

2025-05-13 Thread Florents Tselai


> On 13 May 2025, at 2:07 PM, David E. Wheeler  wrote:
> 
> On May 9, 2025, at 15:50, Robert Haas  wrote:
> 
>> # We have the kluge of having separate "_tz" functions to support
>> # non-immutable datetime operations, but that way doesn't seem like
>> # it's going to scale well to multiple sources of mutability.
>> 
>> But I'm not sure I understand why it matters that there are multiple
>> sources of mutability here. Maybe I'm missing a piece of the puzzle
>> here.
> 
> I read that to mean “we’re not going to add another json_path_exists_* 
> function for every potentially immutable JSONPath function. But I take your 
> point that it could be generalized for *any* mutable function. In which case 
> maybe it should be renamed?
> 
> Best,
> 
> David
> 

We discussed this a bit during the APFS:

As Robert said—and I agree—renaming the existing _tz family would be more 
trouble than it’s worth, given the need for deprecations, migration paths, etc. 
If we were designing this today, suffixes like _stable or _volatile might have 
been more appropriate, but at this point, we’re better off staying consistent 
with the _tz family.

So the path forward seems to be:

- Put these new functions under the jsonb_path_*_tz family.

- Raise an error if they’re used in the non-_tz versions.

- Document this behavior clearly.

I’ll make sure to follow the patterns in the existing _tz functions closely.

Other thoughts and head’s up are, of course, welcome.

Patch CF entry: https://commitfest.postgresql.org/patch/5270/

Last updated Sept 24, so it will also need a rebase to account for changes in 
jsonpath_scan.l. I’ll get to that shortly.

Re: fix: propagate M4 env variable to flex subprocess

2025-05-13 Thread Andres Freund
Hi,

On 2025-05-12 23:14:59 -0400, J. Javier Maestro wrote:
> The pgflex wrapper runs flex with an explicit environment, so it doesn't
> inherit environment variables from the parent process. However, flex can
> use the M4 env variable and/or the PATH (via execvp) to find the m4 macro
> processor.

> Thus, since flex honors the M4 env variable, it should be propagated to the
> subprocess environment if it's set in the parent environment. This patch
> fixes it.

I think it probably was not intentional to fully specify the environment,
rather than just *adding* FLEX_TMP_DIR to the caller's environment. Bilal, I
think you wrote this originally, do you recall?

It seems like an issue beyond just M4...

Greetings,

Andres Freund




ARM/ARM64 SpinLockAcquire and SpinLockRelease are not full barriers

2025-05-13 Thread Yura Sokolov
Good day.

During conversation about other patch, Andres pointed out SpinLockAcquire
have semantic of full memory barrier [1] .

spin.h also states:

> Load and stores operation in calling code are guaranteed not to be
reordered with respect to these operations, because they include a compiler
barrier.

But on ARM/ARM64 platform tas in s_lock.h is implemented as simple call to
__sync_lock_test_and_set, and . And GCC's documentation states it has only
Acquire semantic, and __sync_lock_release (used as implementation of
SpinLockRelease) has only Release semantic [2].

Compiling and disassembling postgresql with -O2 optimization level on arm64
we can see that SpinLockAquire is compiled down to call to
__aarch64_swp4_sync (in optimistic case) which has disassemble:

00662cc0 <__aarch64_swp4_sync>:
  662cc0:   d503245fbti c
  662cc4:   90001bf0adrpx16, 9de000 
  662cc8:   39582210ldrbw16, [x16, #1544]
  662ccc:   3470cbz w16, 662cd8
  662cd0:   b8a08020swpaw0, w0, [x1]
  662cd4:   d65f03c0ret
  662cd8:   2a0003f0mov w16, w0
  662cdc:   885f7c20ldxrw0, [x1]
  662ce0:   88117c30stxrw17, w16, [x1]
  662ce4:   35d1cbnzw17, 662cdc
  662ce8:   d5033bbfdmb ish
  662cec:   d65f03c0ret

Here we see 'swpa' instruction which has only acquire semantic [3].

And SpinLockRelease generates 'stlr' instruction, which has release
semantic, iirc.

Godbolt shows noticable difference between __sync_lock_test_and_set vs
__atomic_exchange_n(ATOMIC_SEQ_CST) (and __sync_lock_release vs
__atomic_store_n(ATOMIC_SEQ_CST)) using GCC even with march=armv7 [4]:
- __atomic_exchange_n and __atomic_store_n has 'dmb ish' instruction both
before and after load/store operations,
- but __sync_lock_test_and_set has 'dmb ish' only after load/store
operations and __sync_lock_release only before store operation. (You can
see same pattern in __aarch64_swp4_sync above in case 'swpa' instruction is
not present).

Therefore I think neither SpinLockAcquire nor SpinLockRelease have no full
memory barrier semantic on ARM/ARM64 at the moment when compiled with GCC.

Surprisingly, Clang puts 'dmb ish' both before and after load/store at
__sync_lock_test_and_set, but does not the same for __sync_lock_release.

As a simple fix I suggest to put __sync_synchronize before
__sync_lock_test_and_set and after __sync_lock_release.

Initially I wanted to use __atomic_exchange_n and __atomic_store_n. But in
absence of support for atomic instructions, their inner loop doesn't look
to be safe for reordering since it uses ldar+stlr loop:
- looks like previous operations could be reordered after 'ldar'.
- similarly following operations could be reordered before 'stlr'.
There's very relevant discussion at GCC's bugzilla [5].
Probably this hypothesis is not valid. I believe we have to ask someone
closely familiar with ARM internals to be sure.

# Alternative.

As an alternative, we may fix comments about
SpinLockAcquire/SpinLockRelease to state they provide acquire and release
barrier semantics only. And then check all their uses and set appropriate
memory barriers where their usage as full memory barriers is detected (as
in [1]).

[1]
https://postgr.es/m/5ccgykypol3azijw2chqnpg3rhuwjtwsmbfs3pgcqm7fu6laus%40wppo6zcfszay
[2]
https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/_005f_005fsync-Builtins.html#index-_005f_005fsync_005flock_005ftest_005fand_005fset
[3]
https://developer.arm.com/documentation/100069/0606/Data-Transfer-Instructions/SWPA--SWPAL--SWP--SWPL--SWPAL--SWP--SWPL
[4] https://godbolt.org/z/h5P9fjzWd
[5] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

-- 
regards
Yura Sokolov aka funny-falconFrom 1315faa9a01c8d768fd750ed643bae4f3eefc48a Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Tue, 13 May 2025 18:57:46 +0300
Subject: [PATCH v1] Make SpinLockAcquire and SpinLockRelease full memory
 barriers.

Andres Freund stated SpinLockAcquire is used as a full memory barrier at
least once [1].

Code comment in spin.h also states:

> Load and stores operation in calling code are guaranteed not to be
> reordered with respect to these operations, because they include a
> compiler barrier.

But ARM/ARM64 code used __spin_lock_test_and_set and __spin_lock_release
which only have acquire and release semantics respectively.

Fix it by adding __sync_synchronize before __sync_lock_test_and_set and
after __sync_lock_release.
---
 src/include/storage/s_lock.h | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 2f73f9fcf57..7ee94302df1 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -258,10 +258,14 @@ typedef int slock_t;
 static __inline__ int
 tas(volatile slock_t *lock)
 {
+	__sync_synchronize();
 	return __sync_lock_test_and_set(lock, 1);
 }
 
-#defin

Re: Add explicit initialization for all PlannerGlobal fields

2025-05-13 Thread Richard Guo
On Tue, May 13, 2025 at 8:26 PM David Rowley  wrote:
> On Tue, 13 May 2025 at 04:03, Richard Guo  wrote:
> > While adding a new field to PlannerGlobal in another patch, I noticed
> > that although most fields are explicitly initialized, a few are not.
> > This doesn't cause any functional issues, since makeNode() zeroes all
> > fields by default.  However, the inconsistency stood out to me, so I
> > wrote the attached patch to explicitly initialize the remaining fields
> > for clarity and consistency.
> >
> > Does this seem worthwhile?  Or should we simply rely on makeNode() for
> > zero-initialization and consider this unnecessary?
>
> These should be zeroed explicitly.

Thank you for confirming.  I've pushed this patch to master.

Thanks
Richard




Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part

2025-05-13 Thread David E. Wheeler
On May 13, 2025, at 16:24, Florents Tselai  wrote:

> As Robert said—and I agree—renaming the existing _tz family would be more 
> trouble than it’s worth, given the need for deprecations, migration paths, 
> etc. If we were designing this today, suffixes like _stable or _volatile 
> might have been more appropriate, but at this point, we’re better off staying 
> consistent with the _tz family.

I get the pragmatism, and don’t want to over-bike-shed, but what a wart to live 
with. [I just went back and re-read Robert’s post, and didn’t realize he used 
exactly the same expression!] Would it really be too effortful to create 
_stable or _volatile functions and leave the _tz functions as a sort of legacy?

Or maybe there’s a nice backronym we could come up with for _tz. 

Best,

David






signature.asc
Description: Message signed with OpenPGP


Re: Memoize ANTI and SEMI JOIN inner

2025-05-13 Thread Richard Guo
On Tue, May 13, 2025 at 10:51 PM Andres Freund  wrote:
> This is failing on CI:
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5636
> https://api.cirrus-ci.com/v1/artifact/task/5411026402803712/testrun/build-32/testrun/regress/regress/regression.diffs
>
> diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/memoize.out 
> /tmp/cirrus-ci-build/build-32/testrun/regress/regress/results/memoize.out
> --- /tmp/cirrus-ci-build/src/test/regress/expected/memoize.out  2025-04-12 
> 11:24:10.866868945 +
> +++ /tmp/cirrus-ci-build/build-32/testrun/regress/regress/results/memoize.out 
>   2025-04-12 11:32:44.454864476 +
> @@ -525,7 +525,7 @@
>   ->  Unique (actual rows=2.67 loops=N)
> ->  Sort (actual rows=67.33 loops=N)
>   Sort Key: t2_1.a
> - Sort Method: quicksort  Memory: 27kB
> + Sort Method: quicksort  Memory: 18kB
>   ->  Seq Scan on tab_anti t2_1 (actual 
> rows=100.00 loops=N)
>  (15 rows)
>
>
> There shouldn't be Memory mentioned in tests added to the tree.

Thanks for pointing this out.  You're right — Memory usage should be
hidden from the output of EXPLAIN ANALYZE.  I'm a bit surprised the
existing explain_memoize function doesn't already handle that; I
guess it's because the plans in memoize.sql don't currently include
any Sort nodes.  In any case, I've added a regexp_replace to filter
out 'Memory: \d+kB' in explain_memoize.

Thanks
Richard


v2-0001-Enable-use-of-Memoize-for-ANTI-joins.patch
Description: Binary data


Re: Logical Replication of sequences

2025-05-13 Thread Nisha Moond
On Sat, May 3, 2025 at 7:28 PM vignesh C  wrote:
>
>
> There was one pending open comment #6 from [1]. This has been
> addressed in the attached patch.

Thank you for the patches, here are my comments for patch-004: sequencesync.c

copy_sequences()
---
1)
+ if (first)
+ first = false;
+ else
+ appendStringInfoString(seqstr, ", ");

We can avoid additional variable here, suggestion -
if (seqstr->len > 0)
  appendStringInfoString(seqstr, ", ");


2)
+ else
+ {
+ *sequence_sync_error = true;
+ append_mismatched_sequences(mismatched_seqs, seqinfo);
+ }

I think *sequence_sync_error = true can be removed from here, as we
can avoid setting it for every mismatch, as it is already set at the
end of the function if any sequence mismatches are found.


3)
+ if (message_level_is_interesting(DEBUG1))
+ {
+ /* LOG all the sequences synchronized during current batch. */
+ for (int i = 0; i < curr_batch_seq_count; i++)
+ {
+ LogicalRepSequenceInfo *done_seq;
...
+
+ ereport(DEBUG1,
+ errmsg_internal("logical replication synchronization for
subscription \"%s\", sequence \"%s\" has finished",
+ get_subscription_name(subid, false),
+ done_seq->seqname));
+ }
+ }

3a) I think the DEBUG1 log can be moved inside the while loop just
above, to avoid traversing the list again unnecessarily.


LogicalRepSyncSequences():
-
4)
+ /*
+ * Sequence synchronization failed due to a parameter mismatch. Set the
+ * failure time to prevent immediate initiation of the sequencesync
+ * worker.
+ */
+ if (sequence_sync_error)
+ {
+ logicalrep_seqsyncworker_set_failuretime();
+ ereport(LOG,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("sequence synchronization failed because the parameters
between the publisher and subscriber do not match for all
sequences"));
+ }

I think saying "sequence synchronization failed" could be misleading,
as the matched sequences will still be synced successfully. It might
be clearer to reword it to something like:
"sequence synchronization failed for some sequences because the
parameters between the publisher and subscriber do not match."


--
Thanks,
Nisha




Re: Backward movement of confirmed_flush resulting in data duplication.

2025-05-13 Thread Amit Kapila
On Tue, May 13, 2025 at 4:22 PM Dilip Kumar  wrote:
>
> On Tue, May 13, 2025 at 3:48 PM shveta malik  wrote:
> >
> > Hi All,
> >
> > It is a spin-off thread from earlier discussions at [1] and [2].
> >
> > While analyzing the slot-sync BF failure as stated in [1], it was
> > observed that there are chances that confirmed_flush_lsn may move
> > backward depending on the feedback messages received from the
> > downstream system. It was suspected that the backward movement of
> > confirmed_flush_lsn may result in data duplication issues. Earlier we
> > were able to successfully reproduce the issue with two_phase enabled
> > subscriptions (see[2]). Now on further analysing, it seems possible
> > that data duplication issues may happen without two-phase as well.
>
> Thanks for the detailed explanation. Before we focus on patching the
> symptoms, I’d like to explore whether the issue can be addressed on
> the subscriber side. Specifically, have we analyzed if there’s a way
> to prevent the subscriber from moving the LSN backward in the first
> place? That might lead to a cleaner and more robust solution overall.
>

The subscriber doesn't move the LSN backwards, it only shares the
information with the publisher, which is the latest value of remote
LSN tracked by the origin. Now, as explained in email [1], the
subscriber doesn't persistently store/advance the LSN, for which it
doesn't have to do anything like DDLs, or any other non-published
DMLs. However, subscribers need to send confirmation of such LSNs for
synchronous replication. This is commented in the code as well, see
comments in CreateDecodingContext (It might seem like we should error
out in this case, but it's pretty common for a client to acknowledge a
LSN it doesn't have to do anything for ...). As mentioned in email[1],
persisting the LSN information that the subscriber doesn't have to do
anything with could be a noticeable performance overhead.

I think it is better to deal with this in the publisher by not
allowing it to move confirm_flush LSN backwards, as Shveta proposed.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1%2BzWQwOe5G8zCYGvErnaXh5%2BDbyg_A1Z3uywSf_4%3DT0UA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Allow reading LSN written by walreciever, but not flushed yet

2025-05-13 Thread Alexander Kukushkin
Hi Fujii,


On Tue, 13 May 2025 at 13:13, Fujii Masao 
wrote:

> In this case, doesn't the flush LSN typically catch up to the write LSN on
> node2
> after a few seconds? Even if the walreceiver exits while there's still
> written
> but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed
> by
> calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when
> selecting
> the most advanced node? No?
>

I think it is a bit more complex than that. There are also cases when we
want to ensure that there are "healthy" standby nodes when switchover is
requested.
Meaning of "healthy" could be something like: "According to the write LSN
it is not lagging more than 16MB" or similar.
Now it is possible to extract this value using
pg_stat_get_wal_receiver()/pg_stat_wal_receiver, but it works only when the
walreceiver process is alive.


> >>> Caveat: we already have a function pg_last_wal_receive_lsn(), which in
> fact returns flushed LSN, not written. I propose to add a new function
> which returns LSN actually written. Internals of this function are already
> implemented (GetWalRcvWriteRecPtr()), but unused.
>
> GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
> when the walreceiver restarts. This behavior is OK for your purpose?
>

IMO, most of HA tools are prepared for it. They can't rely only on
write/flush LSN, because standby may be replaying WALs from the archive
using restore_command and as a result only replay LSN is progressing.
That is, they are supposed to be doing something like max(write_lsn,
replay_lsn).

-- 
Regards,
--
Alexander Kukushkin


Re: Backward movement of confirmed_flush resulting in data duplication.

2025-05-13 Thread Alexander Kukushkin
Hi Dilip,

On Wed, 14 May 2025 at 08:29, Dilip Kumar  wrote:

> What I meant wasn’t that the subscriber is moving the confirmed LSN
> backward, nor was I suggesting we fix it by persisting the LSN on the
> subscriber side. My point was: the fact that the subscriber is sending
> an LSN older than one it has already sent, does that indicate a bug on
> the subscriber side?  And if so, should the logic be fixed there?
>

In my experience, client applications do a lot of surprisingly not smart
things.
However, it doesn't mean that the server should be blindly accepting
whatever LSN client sends.
I tend to agree with Amit, we shouldn't allow confirmed_flush_lsn to move
backwards.

-- 
Regards,
--
Alexander Kukushkin


Re: Backward movement of confirmed_flush resulting in data duplication.

2025-05-13 Thread Amit Kapila
On Wed, May 14, 2025 at 11:59 AM Dilip Kumar  wrote:
>
> On Wed, May 14, 2025 at 9:16 AM Amit Kapila  wrote:
> >
> > On Tue, May 13, 2025 at 4:22 PM Dilip Kumar  wrote:
> > >
> > > On Tue, May 13, 2025 at 3:48 PM shveta malik  
> > > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > It is a spin-off thread from earlier discussions at [1] and [2].
> > > >
> > > > While analyzing the slot-sync BF failure as stated in [1], it was
> > > > observed that there are chances that confirmed_flush_lsn may move
> > > > backward depending on the feedback messages received from the
> > > > downstream system. It was suspected that the backward movement of
> > > > confirmed_flush_lsn may result in data duplication issues. Earlier we
> > > > were able to successfully reproduce the issue with two_phase enabled
> > > > subscriptions (see[2]). Now on further analysing, it seems possible
> > > > that data duplication issues may happen without two-phase as well.
> > >
> > > Thanks for the detailed explanation. Before we focus on patching the
> > > symptoms, I’d like to explore whether the issue can be addressed on
> > > the subscriber side. Specifically, have we analyzed if there’s a way
> > > to prevent the subscriber from moving the LSN backward in the first
> > > place? That might lead to a cleaner and more robust solution overall.
> > >
> >
> > The subscriber doesn't move the LSN backwards, it only shares the
> > information with the publisher, which is the latest value of remote
> > LSN tracked by the origin. Now, as explained in email [1], the
> > subscriber doesn't persistently store/advance the LSN, for which it
> > doesn't have to do anything like DDLs, or any other non-published
> > DMLs. However, subscribers need to send confirmation of such LSNs for
> > synchronous replication. This is commented in the code as well, see
> > comments in CreateDecodingContext (It might seem like we should error
> > out in this case, but it's pretty common for a client to acknowledge a
> > LSN it doesn't have to do anything for ...). As mentioned in email[1],
> > persisting the LSN information that the subscriber doesn't have to do
> > anything with could be a noticeable performance overhead.
>
> Thanks for your response.
>
> What I meant wasn’t that the subscriber is moving the confirmed LSN
> backward, nor was I suggesting we fix it by persisting the LSN on the
> subscriber side. My point was: the fact that the subscriber is sending
> an LSN older than one it has already sent, does that indicate a bug on
> the subscriber side?  And if so, should the logic be fixed there?
>
> I understand this might not be feasible, and it may not even be a bug
> on the subscriber side, it could be an intentional part of the design.
>

Right, it is how currently the subscriber/publisher communication is designed.

> But my question was whether we’ve already considered and ruled out
> that possibility.
>

That is what I explained in my previous response. Basically, to
achieve what you are saying, we need to persist the remote LSN values
by advancing the origin for cases, even when the subscriber doesn't
need to apply such changes like DDLs.

-- 
With Regards,
Amit Kapila.




Re: Logical Replication of sequences

2025-05-13 Thread vignesh C
On Wed, 14 May 2025 at 09:55, Nisha Moond  wrote:
>
> On Sat, May 3, 2025 at 7:28 PM vignesh C  wrote:
> >
> >
> > There was one pending open comment #6 from [1]. This has been
> > addressed in the attached patch.
>
> Thank you for the patches, here are my comments for patch-004: sequencesync.c
>
> copy_sequences()
> ---
> 1)
> + if (first)
> + first = false;
> + else
> + appendStringInfoString(seqstr, ", ");
>
> We can avoid additional variable here, suggestion -
> if (seqstr->len > 0)
>   appendStringInfoString(seqstr, ", ");

Modified

> 2)
> + else
> + {
> + *sequence_sync_error = true;
> + append_mismatched_sequences(mismatched_seqs, seqinfo);
> + }
>
> I think *sequence_sync_error = true can be removed from here, as we
> can avoid setting it for every mismatch, as it is already set at the
> end of the function if any sequence mismatches are found.

Modified

> 3)
> + if (message_level_is_interesting(DEBUG1))
> + {
> + /* LOG all the sequences synchronized during current batch. */
> + for (int i = 0; i < curr_batch_seq_count; i++)
> + {
> + LogicalRepSequenceInfo *done_seq;
> ...
> +
> + ereport(DEBUG1,
> + errmsg_internal("logical replication synchronization for
> subscription \"%s\", sequence \"%s\" has finished",
> + get_subscription_name(subid, false),
> + done_seq->seqname));
> + }
> + }
>
> 3a) I think the DEBUG1 log can be moved inside the while loop just
> above, to avoid traversing the list again unnecessarily.

Modified

> LogicalRepSyncSequences():
> -
> 4)
> + /*
> + * Sequence synchronization failed due to a parameter mismatch. Set the
> + * failure time to prevent immediate initiation of the sequencesync
> + * worker.
> + */
> + if (sequence_sync_error)
> + {
> + logicalrep_seqsyncworker_set_failuretime();
> + ereport(LOG,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("sequence synchronization failed because the parameters
> between the publisher and subscriber do not match for all
> sequences"));
> + }
>
> I think saying "sequence synchronization failed" could be misleading,
> as the matched sequences will still be synced successfully. It might
> be clearer to reword it to something like:
> "sequence synchronization failed for some sequences because the
> parameters between the publisher and subscriber do not match."

Modified

The comments are fixed in the v20250514 version patch attached at [1].

[1] - 
https://www.postgresql.org/message-id/CALDaNm3GXa-kKTe3oqmKA8oniHvZfgYUXG8mVczv4GJzFwG7bg%40mail.gmail.com

Regards,
Vignesh




Re: Limiting overshoot in nbtree's parallel SAOP index scans

2025-05-13 Thread Amit Kapila
On Thu, Oct 17, 2024 at 5:03 AM Matthias van de Meent
 wrote:
>
> On Thu, 17 Oct 2024 at 00:33, Peter Geoghegan  wrote:
> >
> > On Wed, Oct 16, 2024 at 5:48 PM Matthias van de Meent
> >  wrote:
> > > In v17 and the master branch you'll note 16 buffer hits for the test
> > > query. However, when we use more expensive btree compare operations
> > > (e.g. by adding pg_usleep(1) to both btint8cmp and btint4cmp), the
> > > buffer access count starts to vary a lot and skyrockets to 30+ on my
> > > machine, in some cases reaching >100 buffer hits. After applying my
> > > patch, the buffer access count is capped to a much more agreeable
> > > 16-18 hits - it still shows signs of overshooting the serial bounds,
> > > but the number of buffers we overshoot our target is capped and thus
> > > significantly lower.
> >
> > It's not exactly capped, though. Since in any case you're always prone
> > to getting extra leaf page reads at the end of each primitive index
> > scan. That's not something that's new to Postgres 17, though.
>
> True, but the SAOP-enabled continued overshoot _is_ new: previously,
> each backend would only get up to one additional buffer access for
> every SOAP scan entry, while now it's only limited by outer SAOP
> bounds and index size.
>

IIUC, the problem you are trying to solve is that we will miss
primitive index scans in case of parallel index scans. The problem can
happen when the second parallel backend process (say b-2) has seized
the page after the first parallel backend (say b-1) has released the
current page and before b-1 could check whether it can schedule
another primitive index scan. And you want to fix it by delaying the
release of current page by b-1 in some cases, if so, then it has some
downsides as well as pointed out by Peter. Is my understanding
correct? If so, then we may also need to prove that releasing the page
at a later point doesn't harm any other cases.

I am not sure if I understood the problem completely, so can you
please explain it in a bit more layman's language, how it works before
17 and after 17?

BTW, I also want to clarify my understanding of primitive index scans
and how it has changed in PG17, is it related to how we optimize SAOP
scans by reducing the number of leaf page scans?

-- 
With Regards,
Amit Kapila.