date:20250616

[PATCH] Remove unused #include's in src/backend/utils/adt/*

2025-06-16 Thread Aleksander Alekseev

Hi,

clangd indicates that certain #include's are redundant. Removing them
will speed up the build process a bit.

-- 
Best regards,
Aleksander Alekseev


v1-0001-Remove-unused-include-s-in-src-backend-utils-adt.patch
Description: Binary data

Re: Non-reproducible AIO failure

2025-06-16 Thread Konstantin Knizhnik

One more update: with the proposed patch (memory barrier before 
`ConditionVariableBroadcast` in `pgaio_io_process_completion` and 
replacing bit fields with `uint8`) the problem is not reproduced at my 
system during 5 seconds.

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2025-06-16 Thread Alexander Korotkov

Dear Kuroda-san,

On Mon, Jun 16, 2025 at 12:11 PM Hayato Kuroda (Fujitsu)
 wrote:
> Thanks for pushing the fix patch! BTW, I have few comments for your commits.
> Can you check and include them if needed?
>
> 01.
> ```
> $node->append_conf('postgresql.conf',
> "shared_preload_libraries = 'injection_points'");
> ```
>
> No need to set shared_preload_libraries in 046/047. ISTM it must be set when 
> we
> enable the statistics.
>
> 02.
> We should also check whether the injection_points can be installed or not.
> You can check check_extension() and callers.

Thank you!  All of these totally make sense.  The updated patch is attached.

--
Regards,
Alexander Korotkov
Supabase


v3-0001-Improve-runtime-and-output-of-tests-for-replicati.patch
Description: Binary data

[PATCH] Add an ldflags_sl meson build option

2025-06-16 Thread Matt Smith (matts3)

Currently there doesn't seem to be a way to pass shared library-specific flags 
via a meson build option.

This patch allows the existing ldflags_sl to be set via a build option.

Patch created against the 17.4 release.
diff --git a/meson.build b/meson.build
index 42a4d25bfd7..68690e5de52 100644
--- a/meson.build
+++ b/meson.build
@@ -98,7 +98,7 @@ cxxflags_mod = []
 
 ldflags = []
 ldflags_be = []
-ldflags_sl = []
+ldflags_sl = get_option('ldflags_sl')
 ldflags_mod = []
 
 test_c_args = []
diff --git a/meson_options.txt b/meson_options.txt
index 246cecf3827..defeec3aafa 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -76,6 +76,8 @@ option('darwin_sysroot', type: 'string', value: '',
 option('rpath', type: 'boolean', value: true,
   description: 'Embed shared library search path in executables')
 
+option('ldflags_sl', type: 'array', value: [],
+  description: 'Shared library ldflags')
 
 # External dependencies

Re: [PATCH] Add an ldflags_sl meson build option

2025-06-16 Thread Peter Eisentraut


On 16.06.25 06:10, Matt Smith (matts3) wrote:
Currently there doesn't seem to be a way to pass shared library-specific 
flags via a meson build option.


This patch allows the existing ldflags_sl to be set via a build option.


What uses do you have in mind for this?

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 09:29, Vaibhav Dalvi  
> wrote:

> I'm able to create the object as shown in the below:
> 
> postgres=# CREATE OR REPLACE FUNCTION pg_catalog.nont_ext_func() RETURNS char 
> AS $$ BEGIN return 'v'; END; $$ LANGUAGE plpgsql;
> CREATE FUNCTION
> 
> Why can't we strictly restrict object creation in pg_catalog?

Do you have allow_system_table_mods set to ON by any chance?  As Laurenz said,
such creation is already restricted, but it can be circumvented by using said
GUC (which is *not* intended for production usage).

--
Daniel Gustafsson

Re: Possibly hard-to-read message

2025-06-16 Thread Daniel Gustafsson

> On 12 Jun 2025, at 08:27, Peter Eisentraut  wrote:

> The rest looks good to me.

FTR this was committed last week but I missed following up here.

--
Daniel Gustafsson

Re: [PATCH] Fix incomplete memory clearing in OAuth authentication

2025-06-16 Thread Daniel Gustafsson

> On 13 Jun 2025, at 18:41, Taras Kloba  wrote:
> 
> Hi hackers,
> 
> I discovered a minor security issue in the OAuth authentication code where 
> sensitive bearer tokens are not completely cleared from memory.
> 
> ## The Issue
> 
> In src/backend/libpq/auth-oauth.c, the oauth_exchange() function attempts to 
> clear the bearer token from memory using explicit_bzero(), but it only clears 
> inputlen bytes. Since the buffer is allocated with pstrdup(), which allocates 
> strlen(input) + 1 bytes, the null terminator byte remains uncleared.

Maybe I'm lacking imagination, but I fail to see how it's a security issue to
not set a byte to \0 when it is known to be \0?

--
Daniel Gustafsson

Re: Conflict detection for update_deleted in logical replication

2025-06-16 Thread shveta malik

On Mon, Jun 16, 2025 at 9:29 AM Zhijie Hou (Fujitsu)
 wrote:
>
> 0001:
> * Removed the slot acquisition as suggested above.
>
> 0002:
> * Addressed the comments above.
>

Thanks for the patches.

In advance_conflict_slot_xmin(), if new_xmin is same as slot's current
 xmin, then shall we simply return without doing any slot- update?
Below assert indicates that new_xmin can be same as slot's current
xmin:
Assert(TransactionIdPrecedesOrEquals(MyReplicationSlot->data.xmin, new_xmin));

thanks
Shveta

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2025-06-16 Thread Hayato Kuroda (Fujitsu)

Dear Alexander,

> Thank you!  All of these totally make sense.  The updated patch is attached.

Thanks for the update. I found another point.

```
-# Another 2M rows; that's about 260MB (~20 segments) worth of WAL.
+# Another 50K rows; that's about 86MB (~5 segments) worth of WAL.
 $node->safe_psql('postgres',
-   q{insert into t (b) select md5(i::text) from generate_series(1,100) 
s(i)}
+   q{insert into t (b) select repeat(md5(i::text),50) from 
generate_series(1,5) s(i)}
 );
```

I think a perl function advance_wal() can be used instead of doing actual INSERT
command because no one refers the replicated result. Same thing can be said in
046/047.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

2025-06-16 Thread Amit Kapila

On Thu, Jun 12, 2025 at 11:34 AM Zhijie Hou (Fujitsu)
 wrote:
>

Few comments on v36 patches:
==
1. In advance_conflict_slot_xmin(), we first save the slot to disk,
then update its effective xmin, and then do the required xmin
computation. Now, if we don't save the slot every time, there is a
risk that its value can go backwards after a restart. But OTOH, for
physical slots maintained by walsender for physical replication, we
also don't save the physical slot. However, still the system works,
see discussion in email: [1].

As per my understanding, even if the conflict_slot's xmin moved back
after restart, it shouldn't cause any problem. Because it will anyway
be moved ahead in the next cycle, and there won't be any rows that
will get removed but are required for conflict detection. If this is
correct, then we don't need to save the slot in
advance_conflict_slot_xmin().

2.
+ *
+ * Issue a warning if track_commit_timestamp is not enabled when
+ * check_commit_ts is set to true.
+ *
+ * Issue a warning if the subscription is being disabled.
+ */
+void
+CheckSubConflictInfoRetention(bool retain_conflict_info, bool check_commit_ts,
+   bool disabling_sub)
+{
+ if (!retain_conflict_info)
+ return;
+
+ if (check_commit_ts && !track_commit_timestamp)
+ ereport(WARNING,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("commit timestamp and origin data required for detecting
conflicts won't be retained"),
+ errhint("Consider setting \"%s\" to true.",
+ "track_commit_timestamp"));
+
+ if (disabling_sub)
+ ereport(WARNING,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("deleted rows to detect conflicts would not be removed until
the subscription is enabled"),
+ errhint("Consider setting %s to false.",
+ "retain_conflict_info"));

The quoted comments atop this function just say what it is apparent
from the code. It is better if the comments explain why we allow to
proceed when the above conditions are not met.

I think we can probably add a check here that this option requires
wal_level = replica as the launcher needs to create a physical slot to
retain the required info.

3. Isn't the new check for logical slots in
check_new_cluster_subscription_configuration() somewhat redundant with
the previous check done in
check_new_cluster_logical_replication_slots()? Can't we combine both?

Apart from this, I have made a number of changes in the comments and a
few other cosmetic changes in the attached.

[1]: https://www.postgresql.org/message-id/28c8bf-68470780-3-51b29480%4089454035

-- 
With Regards,
Amit Kapila.
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index b8f9bf573ea..16310cf474f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8088,8 +8088,8 @@ SCRAM-SHA-256$:&l
   
   
If true, the information (e.g., dead tuples, commit timestamps, and
-   origins) on the subscriber that is still useful for conflict detection
-   is retained.
+   origins) on the subscriber that is useful for conflict detection is\
+   retained.
   
  
 
diff --git a/doc/src/sgml/ref/create_subscription.sgml 
b/doc/src/sgml/ref/create_subscription.sgml
index be90088bcd0..0e49bf09eca 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -443,18 +443,18 @@ CREATE SUBSCRIPTION subscription_name
  
   Specifies whether the information (e.g., dead tuples, commit
-  timestamps, and origins) on the subscriber that is still useful for
-  conflict detection is retained. The default is
-  false. If set to true, an additional replication
-  slot named pg_conflict_detection
+  timestamps, and origins) required for conflict detection on the
+  subscriber is retained. The default is false.
+  If set to true, a replication slot named
+  pg_conflict_detection
   will be created on the subscriber to prevent the conflict information
   from being removed.
  
 
  
   Note that the information useful for conflict detection is retained
-  only after the creation of the additional slot. You can verify the
-  existence of this slot by querying pg_replication_slots.
+  only after the creation of the slot. You can verify the existence of
+  this slot by querying pg_replication_slots.
   And even if multiple subscriptions on one node enable this option,
   only one replication slot will be created.
  
@@ -468,7 +468,7 @@ CREATE SUBSCRIPTION subscription_name
 
  
-  This option cannot be enabled if the publisher is also a physical 
standby.
+  This option cannot be enabled if the publisher is a physical standby.
  
 

diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 9938fbe3e57..396cfcce1f6 100644
--- a/src/bac

Re: Commitfest app release at half June

2025-06-16 Thread Jelte Fennema-Nio

On Sat, 31 May 2025 at 13:47, Jelte Fennema-Nio  wrote:
>
> I realized two smallish commitfest app improvements were merged but
> not deployed yet for quite some time.
> 1. There's now a similar summary at the top of the "Personal
> Dashboard" page, as is on regular commitfest pages. Thanks to Akshat
> Jaimini.
> 2. Selecting the committer for a committed patch from the dropdown is
> changed to use a searchable dropdown, and the names in the dropdown
> are now sorted by firstname first instead of last name first (to match
> how they are actually displayed in the dropdown).

These changes are now live. In addition to fixing the link to the
homepage from the dashboard, which was contributed by Matthias, but
which I failed to mention in the previous email.

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 11:26, Evgeniy Gorbanev  wrote:

> I ran tests for pg_dump and they all passed. Logs attached.

Files=7, Tests=11918, 35 wallclock secs ( 0.59 usr  0.07 sys +  7.92 cusr  4.32 
csys = 12.90 CPU)

That seems like a low number of tests executed, with ZStd enabled I see over
13200 tests in the log.  Are you sure you are building with ZStd compression
enabled?

--
Daniel Gustafsson

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Evgeniy Gorbanev


16.06.2025 15:00, Daniel Gustafsson пишет:

On 16 Jun 2025, at 10:52, Evgeniy Gorbanev  wrote:

16.06.2025 14:25, Daniel Gustafsson пишет:

On 16 Jun 2025, at 10:14, Evgeniy Gorbanev  wrote:
In src/bin/pg_dump/compress_zstd.c, the Zstd_read function always
returns true. But if you look at the Zstd_gets and Zstd_getc functions,
where Zstd_read is called via CFH->read_func, it is expected that
the Zstd_read function can also return false. In case of
a read-from-file error, the process is expected to terminate, but
pg_dump will continue the process.
I assume that after checking
if (cnt == 0)
should be
return false;

   if (cnt == 0)
- break;
+ return false;

As cnt is coming from fread() returning false here would be wrong as you cannot
from 0 alone know if it's EOF or an error.  Instead it needs to inspect the
stream with feof() and ferror() to know how to proceed.

The feof() check is in the calling functions, e.g. in the Zstd_getc
function.

That function is using feof/ferror to log an appropriate error message, what
you are proposing is to consider all cases of EOF as an error.  If you test
that you will see lots of test starting to fail.

--
Daniel Gustafsson


I ran tests for pg_dump and they all passed. Logs attached.
Please check.make -C ../../../src/backend generated-headers
make[1]: вход в каталог «/home/user/postgres/src/backend»
make -C ../include/catalog generated-headers
make[2]: вход в каталог «/home/user/postgres/src/include/catalog»
make[2]: Цель «generated-headers» не требует выполнения команд.
make[2]: выход из каталога «/home/user/postgres/src/include/catalog»
make -C nodes generated-header-symlinks
make[2]: вход в каталог «/home/user/postgres/src/backend/nodes»
make[2]: Цель «generated-header-symlinks» не требует выполнения команд.
make[2]: выход из каталога «/home/user/postgres/src/backend/nodes»
make -C utils generated-header-symlinks
make[2]: вход в каталог «/home/user/postgres/src/backend/utils»
make -C adt jsonpath_gram.h
make[3]: вход в каталог «/home/user/postgres/src/backend/utils/adt»
make[3]: «jsonpath_gram.h» не требует обновления.
make[3]: выход из каталога «/home/user/postgres/src/backend/utils/adt»
make[2]: выход из каталога «/home/user/postgres/src/backend/utils»
make[1]: выход из каталога «/home/user/postgres/src/backend»
rm -rf '/home/user/postgres'/tmp_install
/bin/mkdir -p '/home/user/postgres'/tmp_install/log
make -C '../../..' DESTDIR='/home/user/postgres'/tmp_install install >'/home/user/postgres'/tmp_install/log/install.log 2>&1
make -j1  checkprep >>'/home/user/postgres'/tmp_install/log/install.log 2>&1
PATH="/home/user/postgres/tmp_install/home/user/postgres/buildroot/bin:/home/user/postgres/src/bin/pg_dump:$PATH" LD_LIBRARY_PATH="/home/user/postgres/tmp_install/home/user/postgres/buildroot/lib" INITDB_TEMPLATE='/home/user/postgres'/tmp_install/initdb-template  initdb --auth trust --no-sync --no-instructions --lc-messages=C --no-clean '/home/user/postgres'/tmp_install/initdb-template >>'/home/user/postgres'/tmp_install/log/initdb-template.log 2>&1
echo "# +++ tap check in src/bin/pg_dump +++" && rm -rf '/home/user/postgres/src/bin/pg_dump'/tmp_check && /bin/mkdir -p '/home/user/postgres/src/bin/pg_dump'/tmp_check && cd . && TESTLOGDIR='/home/user/postgres/src/bin/pg_dump/tmp_check/log' TESTDATADIR='/home/user/postgres/src/bin/pg_dump/tmp_check' PATH="/home/user/postgres/tmp_install/home/user/postgres/buildroot/bin:/home/user/postgres/src/bin/pg_dump:$PATH" LD_LIBRARY_PATH="/home/user/postgres/tmp_install/home/user/postgres/buildroot/lib" INITDB_TEMPLATE='/home/user/postgres'/tmp_install/initdb-template  PGPORT='65432' top_builddir='/home/user/postgres/src/bin/pg_dump/../../..' PG_REGRESS='/home/user/postgres/src/bin/pg_dump/../../../src/test/regress/pg_regress' share_contrib_dir='/home/user/postgres/tmp_install/home/user/postgres/buildroot/share/' /usr/bin/prove -I ../../../src/test/perl/ -I .  t/*.pl
# +++ tap check in src/bin/pg_dump +++
t/001_basic.pl  ok
t/002_pg_dump.pl .. ok
t/003_pg_dump_with_server.pl .. ok
t/004_pg_dump_parallel.pl . ok
t/005_pg_dump_filterfile.pl ... ok
t/006_pg_dumpall.pl ... ok
t/010_dump_connstr.pl . ok
All tests successful.
Files=7, Tests=11918, 35 wallclock secs ( 0.59 usr  0.07 sys +  7.92 cusr  4.32 csys = 12.90 CPU)
Result: PASS

Re: pglogical3 : support

2025-06-16 Thread Amit Kapila

On Sat, Jun 14, 2025 at 2:05 PM Perumal Raj  wrote:
>
> Hi Team,
>
> I am looking to upgrade pg17  with near zero downtime using logical 
> replication(master <-> master) . The current pglogical2 ( open) has some 
> limitation with 'REPLICA IDENTITY FULL' tables.
>
> Do we have any plan to make pglogical3 open source in near future? 😊  or is 
> there better way to achieve the same using open source packages ?
>

The PostgreSQL community doesn't support pglogical. You can contact
pglogical community for your questions on the same.

-- 
With Regards,
Amit Kapila.

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2025-06-16 Thread jian he

for v45.

+ foreach_ptr(CookedConstraint, ccon, cookedConstraints)
+ {
+ if (!ccon->skip_validation && ccon->contype == CONSTR_CHECK)
+ {
+ Bitmapset  *attnums = NULL;
+
+ pull_varattnos((Node *) ccon->expr, 1, &attnums);
+
+ /*
+ * Add check only if it contains tableoid
+ * (TableOidAttributeNumber).
+ */
+ if (bms_is_member(TableOidAttributeNumber -
FirstLowInvalidHeapAttributeNumber,
+  attnums))
+ {
+ NewConstraint *newcon;
+
+ newcon = (NewConstraint *) palloc0(sizeof(NewConstraint));
+ newcon->name = ccon->name;
+ newcon->contype = ccon->contype;
+ newcon->qual = ccon->expr;
+
+ tab->constraints = lappend(tab->constraints, newcon);
+ }
+ }
+ }

we need to expand the virtual generated column here,
otherwise, bms_is_member would  be not correct.
consider case like:

CREATE TABLE pp (f1 INT, f2 INT generated always as (f1 +
tableoid::int)) PARTITION BY RANGE (f1);
CREATE TABLE pp_1 (f2 INT generated always as (f1 + tableoid::int), f1 int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (-1) TO (10);
CREATE TABLE pp_2 (f2 INT generated always as (f1 + tableoid::int), f1 int);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (10) TO (20);
ALTER TABLE PP add check (f2 > 0);

ALTER TABLE pp MERGE PARTITIONS (pp_1, pp_2) INTO pp_12;

In this context, the merge partition command  needs to evaluate the constraint
"pp_f2_check" again on pp_12.

attach minor diff fix this problem.


check_constraint_if_it_contains_tableoid.no-cfbot
Description: Binary data

No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Evgeniy Gorbanev


Hello.

In src/bin/pg_dump/compress_zstd.c, the Zstd_read function always
returns true. But if you look at the Zstd_gets and Zstd_getc functions,
where Zstd_read is called via CFH->read_func, it is expected that
the Zstd_read function can also return false. In case of
a read-from-file error, the process is expected to terminate, but
pg_dump will continue the process.
I assume that after checking
if (cnt == 0)
should be
return false;

Patch attached.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Reporter: Evgeniy Gorbanev (gorbanyo...@basealt.ru).

Organization: BaseALT (o...@basealt.ru).


Best regards,
Evgeniy
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index cb595b10c2d..d0a58861228 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -298,7 +298,7 @@ Zstd_read(void *ptr, size_t size, size_t *rdsize, CompressFileHandle *CFH)
 
 			/* If we have no more input to consume, we're done */
 			if (cnt == 0)
-break;
+return false;
 		}
 
 		while (input->pos < input->size)

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 10:14, Evgeniy Gorbanev  wrote:

> In src/bin/pg_dump/compress_zstd.c, the Zstd_read function always
> returns true. But if you look at the Zstd_gets and Zstd_getc functions,
> where Zstd_read is called via CFH->read_func, it is expected that
> the Zstd_read function can also return false. In case of
> a read-from-file error, the process is expected to terminate, but
> pg_dump will continue the process.
> I assume that after checking
> if (cnt == 0)
> should be
> return false;

if (cnt == 0)
-   break;
+   return false;

As cnt is coming from fread() returning false here would be wrong as you cannot
from 0 alone know if it's EOF or an error.  Instead it needs to inspect the
stream with feof() and ferror() to know how to proceed.

--
Daniel Gustafsson

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Vaibhav Dalvi

Hi Laurenz,

Thanks for the response.

>
>
> > I believe one of the following approaches should be considered to prevent
> > such failures:
> >
> > 1.  Restrict the creation of user objects within the `pg_catalog` schema.
>
> That's already the case:
>
> test=# CREATE TABLE pg_catalog.new ();
> ERROR:  permission denied to create "pg_catalog.new"
> DETAIL:  System catalog modifications are currently disallowed.
>
>
I'm able to create the object as shown in the below:

postgres=# CREATE OR REPLACE FUNCTION pg_catalog.nont_ext_func() RETURNS
char AS $$ BEGIN return 'v'; END; $$ LANGUAGE plpgsql;
CREATE FUNCTION

Why can't we strictly restrict object creation in pg_catalog?

Thanks,
Vaibhav

Re: relrewrite not documented at the top of heap_create_with_catalog()

2025-06-16 Thread Steven Niu

Hi, Michael,

The change looks good to me.

I made more change based on your patch to combine the old comment.

Regards,
Steven

Michael Paquier  于2025年6月16日周一 14:49写道：

> Hi all,
>
> While looking at the heap code, I've found that relrewrite, parameter
> used a trick to keep a link to the original relation rewritten, is not
> documented at the top of heap_create_with_catalog() contrary to all
> its other parameters.
>
> A simple patch is attached to document that.
>
> Thanks,
> --
> Michael
>


0001-Improve-the-comment-of-heap_create_with_catalog.patch
Description: Binary data

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Tom Lane

Daniel Gustafsson  writes:
> On 16 Jun 2025, at 16:20, Tom Lane  wrote:
>> This API is actually quite bizarrely defined, if you ask me.
>> Returning the byte count is optional, but if you don't pay
>> attention to the byte count you cannot know if you got any
>> data.  At best, that's bug-encouraging.

> Agreed.  Given that many of implementations in the gzip support code directly
> return the gzXXX function I suspect it was modeled around GZip but thats an
> unresearched guess.  The fact that this returns Z_NULL where the API defines
> NULL is unlikely to ever be an issue, but it also doesn't seem entirely neat..

After looking around a bit more I realized that this API is a complete
disaster: it's not only bug-prone, but there are actual bugs in
multiple callers, eg _ReadBuf() totally fails to detect early EOF as
it intends to.  We need to fix it, not slap some band-aids on.
Draft patch attached.

The write_func side is a bit of a mess too.  I fixed some obvious API
violations in the attached, but I think we need to do more, because
it seems that zero thought was given to reporting errors sanely.
The callers all assume that strerror() is what to use to report an
incomplete write, but this is not appropriate in every case, for
instance we need to pay attention to gzerror() in gzip cases.
I'm inclined to think that the best answer is to move error reporting
into the write_funcs, and just make the API spec be "write or die".
I've not tackled that here though.

I was depressed to find that Zstd_getc reads one byte into
an otherwise-uninitialized int and then returns the whole int.
Even if we're lucky enough for the variable to start off zero,
this would fail utterly on big-endian machines.  The only
reason we've not noticed is that the regression tests do not
reach Zstd_getc, nor most of the other getc_func implementations.

Now I'm afraid to look at the rest of the compress_io.h API.
If it's as badly written as these parts, there are more bugs
to be found.

regards, tom lane

diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index 5a30ebf9bf5..8db121b94a3 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -251,14 +251,14 @@ InitCompressorGzip(CompressorState *cs,
  *--
  */
 
-static bool
-Gzip_read(void *ptr, size_t size, size_t *rsize, CompressFileHandle *CFH)
+static size_t
+Gzip_read(void *ptr, size_t size, CompressFileHandle *CFH)
 {
 	gzFile		gzfp = (gzFile) CFH->private_data;
 	int			gzret;
 
 	gzret = gzread(gzfp, ptr, size);
-	if (gzret <= 0 && !gzeof(gzfp))
+	if (gzret < 0)
 	{
 		int			errnum;
 		const char *errmsg = gzerror(gzfp, &errnum);
@@ -267,10 +267,7 @@ Gzip_read(void *ptr, size_t size, size_t *rsize, CompressFileHandle *CFH)
  errnum == Z_ERRNO ? strerror(errno) : errmsg);
 	}
 
-	if (rsize)
-		*rsize = (size_t) gzret;
-
-	return true;
+	return (size_t) gzret;
 }
 
 static bool
@@ -278,7 +275,7 @@ Gzip_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
 	gzFile		gzfp = (gzFile) CFH->private_data;
 
-	return gzwrite(gzfp, ptr, size) > 0;
+	return gzwrite(gzfp, ptr, size) == size;
 }
 
 static int
diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index db9b38744c8..dad0c91fec5 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -123,21 +123,23 @@ struct CompressFileHandle
 	CompressFileHandle *CFH);
 
 	/*
-	 * Read 'size' bytes of data from the file and store them into 'ptr'.
-	 * Optionally it will store the number of bytes read in 'rsize'.
+	 * Read up to 'size' bytes of data from the file and store them into
+	 * 'ptr'.
 	 *
-	 * Returns true on success and throws an internal error otherwise.
+	 * Returns number of bytes read (this might be less than 'size' if EOF was
+	 * reached).  Throws error for error conditions other than EOF.
 	 */
-	bool		(*read_func) (void *ptr, size_t size, size_t *rsize,
+	size_t		(*read_func) (void *ptr, size_t size,
 			  CompressFileHandle *CFH);
 
 	/*
 	 * Write 'size' bytes of data into the file from 'ptr'.
 	 *
-	 * Returns true on success and false on error.
+	 * Returns true on success and false on error (it is caller's
+	 * responsibility to deal with error conditions).
 	 */
 	bool		(*write_func) (const void *ptr, size_t size,
-			   struct CompressFileHandle *CFH);
+			   CompressFileHandle *CFH);
 
 	/*
 	 * Read at most size - 1 characters from the compress file handle into
diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e99f0cad71f..c94f9dcd4cf 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -583,6 +583,7 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 			return false;
 		}
 
+		errno = 0;
 		if (fwrite(state->buffer, 1, status, state->fp) != status)
 		{
 			errno = (errno) ? errno : ENOSPC;
@@ -598,8 +599,8 @@ LZ4Stream_write(const void *ptr, size_t

回复: Fix potential overflow risks from wcscpy and sprintf

2025-06-16 Thread Yan Haibo

Thank you. Peter. It seems the patch may have been lost during our earlier 
communication, so I’ve reattached it here.
I hope it comes through correctly this time.
Best regards,
Haibo

发件人: Peter Eisentraut 
发送时间: 2025年6月11日 00:43
收件人: Yan Haibo ; pgsql-hackers@lists.postgresql.org 

主题: Re: Fix potential overflow risks from wcscpy and sprintf

On 06.06.25 22:50, Yan Haibo wrote:
> This change stems from a recent static code analysis, which identified a
> minor potential overflow issue. I would appreciate it if someone could
> review the fix at their convenience.

Please provide more detail in each case what the issue is and how you
are proposing to fix it.

Note that PostgreSQL code is regularly run through various static
analysis tools, so if the issue were super-obvious, it would probably
have been noticed by now.

0001-Mitigate-potential-overflow-risks-from-wcscpy-and-sp.patch
Description:  0001-Mitigate-potential-overflow-risks-from-wcscpy-and-sp.patch

Re: Amcheck verification of GiST and GIN

2025-06-16 Thread Tomas Vondra

Thanks.

I went through the patches, polished the commit messages and did some
minor tweaks in patch 0002 (to make the variable names a bit more
consistent, and reduce the scope a little bit). I left it as a separate
patch to make the changes clearer, but it should be merged into 0002.

Please read through the commit messages, and let me know if I got some
of the details wrong (or not clear enough). Otherwise I plan to start
pushing this soon (~tomorrow).


regards

-- 
Tomas VondraFrom cb24bb068582a39df9e9e59c2a9347889e896cf2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 9 Jun 2025 01:42:52 +0200
Subject: [PATCH v8 1/6] amcheck: Add gin_index_check on a multicolumn index

Adds a regression test with gin_index_check() on a multicolumn index,
to verify it's handled correctly and improve test coverage for code
introduced by 14ffaece0fb5.

Author: Arseniy Mukhin 
Reviewed-by: Andrey M. Borodin 
Discussion: https://postgr.es/m/CAE7r3MJ611B9TE=yqbbncewp7-k64vws+sjk7xf6fjux77u...@mail.gmail.com
---
 contrib/amcheck/expected/check_gin.out | 12 
 contrib/amcheck/sql/check_gin.sql  | 10 ++
 2 files changed, 22 insertions(+)

diff --git a/contrib/amcheck/expected/check_gin.out b/contrib/amcheck/expected/check_gin.out
index b4f0b110747..8dd01ced8d1 100644
--- a/contrib/amcheck/expected/check_gin.out
+++ b/contrib/amcheck/expected/check_gin.out
@@ -76,3 +76,15 @@ SELECT gin_index_check('gin_check_jsonb_idx');
 
 -- cleanup
 DROP TABLE gin_check_jsonb;
+-- Test GIN multicolumn index
+CREATE TABLE "gin_check_multicolumn"(a text[], b text[]);
+INSERT INTO gin_check_multicolumn (a,b) values ('{a,c,e}','{b,d,f}');
+CREATE INDEX "gin_check_multicolumn_idx" on gin_check_multicolumn USING GIN(a,b);
+SELECT gin_index_check('gin_check_multicolumn_idx');
+ gin_index_check 
+-
+ 
+(1 row)
+
+-- cleanup
+DROP TABLE gin_check_multicolumn;
diff --git a/contrib/amcheck/sql/check_gin.sql b/contrib/amcheck/sql/check_gin.sql
index 66f42c34311..11caed3d6a8 100644
--- a/contrib/amcheck/sql/check_gin.sql
+++ b/contrib/amcheck/sql/check_gin.sql
@@ -50,3 +50,13 @@ SELECT gin_index_check('gin_check_jsonb_idx');
 
 -- cleanup
 DROP TABLE gin_check_jsonb;
+
+-- Test GIN multicolumn index
+CREATE TABLE "gin_check_multicolumn"(a text[], b text[]);
+INSERT INTO gin_check_multicolumn (a,b) values ('{a,c,e}','{b,d,f}');
+CREATE INDEX "gin_check_multicolumn_idx" on gin_check_multicolumn USING GIN(a,b);
+
+SELECT gin_index_check('gin_check_multicolumn_idx');
+
+-- cleanup
+DROP TABLE gin_check_multicolumn;
-- 
2.49.0

From 04b0c0c718dd109b9b4598d316b27daab281688d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 9 Jun 2025 01:44:59 +0200
Subject: [PATCH v8 2/6] amcheck: Fix checks of entry order for GIN indexes

This tightens a couple checks in checking GIN indexes, which might have
resulted in incorrect results (false positives/negatives).

* The code skipped ordering checks if the entries were for different
  attributes (for multi-column GIN indexes), possibly missing some cases
  of data corruption. But the attribute number is part of the ordering,
  so we can check that.

* The root page was skipped when checking entry order, but that is
  unnecessary. The root page is subject to the same ordering rules, we
  can process it just like any other page.

* The high key on the right-most page was not checked, but that is
  needed only for inner pages (we don't store the high key for those).
  For leaf pages we can check the high key just fine.

* Correct the detection of split pages. If the page gets split, the
  cached parent key is creater than the current child key (not less, as
  the core incorrectly expected).

Issues reported by Arseniy Mikhin, along with a proposed patch. Review
by Andrey M. Borodin, cleanup and improvements by me.

Author: Arseniy Mukhin 
Reviewed-by: Andrey M. Borodin 
Discussion: https://postgr.es/m/CAE7r3MJ611B9TE=yqbbncewp7-k64vws+sjk7xf6fjux77u...@mail.gmail.com
---
 contrib/amcheck/meson.build |   1 +
 contrib/amcheck/t/006_verify_gin.pl | 199 
 contrib/amcheck/verify_gin.c|  37 +++---
 3 files changed, 219 insertions(+), 18 deletions(-)
 create mode 100644 contrib/amcheck/t/006_verify_gin.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..1f0c347ed54 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
   't/003_cic_2pc.pl',
   't/004_verify_nbtree_unique.pl',
   't/005_pitr.pl',
+  't/006_verify_gin.pl',
 ],
   },
 }
diff --git a/contrib/amcheck/t/006_verify_gin.pl b/contrib/amcheck/t/006_verify_gin.pl
new file mode 100644
index 000..7fdde170e06
--- /dev/null
+++ b/contrib/amcheck/t/006_verify_gin.pl
@@ -0,0 +1,199 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test:

Re: Per-role disabling of LEAKPROOF requirements for row-level security?

2025-06-16 Thread Tom Lane

Yugo Nagata  writes:
> Andreas Lind  wrote:
>> I dug into the code and noticed that restrictinfo->leakproof is only
>> being checked in two places (createplan.c and equivclass.c), so it seems
>> fairly easy to only selectively enforce it. Then there's the question of
>> how to configure it. I can think of a few possible ways:
>> 
>> 1) Add a BYPASSLEAKPROOF role attribute that can only be granted by a
>> superuser, similar to the BYPASSRLS flag.
>> 2) Add a session variable, eg. enable_security_leakproof, that can only
>> be set or granted to another role by a superuser.
>> 3) Make it a property of the individual POLICY that grants access to the
>> table. This would be a bit more granular than a global switch, but
>> there'd be some ambiguity when multiple policies are involved.

> I'm not sure whether multi-tenant applications fall into the category where
> LEAKPROOFness isn't considered important, since security is typically a key
> concern for users of such systems.

Yeah, ISTM that you might as well just disable the RLS policy as
ignore leakproofness, because it is completely trivial to examine
supposedly-hidden data if you can apply a non-leakproof function
to it.

So I like #3 the best.  We already have the ability to specify that
particular policies apply to just specific users, but it seems like
what you want here is the inverse: to be able to name specific users
that are exempt from a given policy.  (While that's not absolutely
essential, without it you might need very long and hard-to-maintain
lists of every-role-but-that-one.)  It doesn't seem to me to be
unreasonable to extend CREATE/ALTER POLICY in that direction.
Perhaps like

 CREATE POLICY name ON table_name
[ AS { PERMISSIVE | RESTRICTIVE } ]
[ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ]
[ TO { role_name | PUBLIC | CURRENT_ROLE | CURRENT_USER | SESSION_USER } [, 
...] ]
+   [ EXCEPT { role_name | PUBLIC | CURRENT_ROLE | CURRENT_USER | SESSION_USER 
} [, ...] ]
[ USING ( using_expression ) ]
[ WITH CHECK ( check_expression ) ]

(Not sure that EXCEPT PUBLIC is sensible; also we'd need a decision
about what to do if same role appears in both lists.)

regards, tom lane

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2025-06-16 Thread Sergey Sargsyan

Hey Mihail,

I've started looking at the patches today, mostly the STIR part. Seems
solid, but I've got a question about validation. Why are we still grabbing
tids from the main index and sorting them?

I think it's to avoid duplicate errors when adding tuples from STIP to the
main index, but couldn't we just suppress that error during validation and
skip the new tuple insertion if it already exists?

The main index may get huge after building, and iterating over it in a
single thread and then sorting tids can be time consuming.

At least I guess one can skip it when STIP is empty. But, I think we could
skip it altogether by figuring out what to do with duplicates, making
concurrent and non-concurrent index creation almost identical in speed
(only locking and atomicity would differ).

p.s. I noticed that `stip.c` has a lot of functions that don't follow the
Postgres coding style of return type on separate line.

On Mon, Jun 16, 2025, 6:41 PM Mihail Nikalayeu 
wrote:

> Hello, everyone!
>
> Rebased, patch structure and comments available here [0]. Quick
> introduction poster - here [1].
>
> Best regards,
> Mikhail.
>
> [0]:
> https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%40mail.gmail.com#52c97e004b8f628473124c05e3bf2da1
> [1]:
> https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf
>

Re: amcheck support for BRIN indexes

2025-06-16 Thread Andrey Borodin

Hi!

Nice feature!

> On 8 Jun 2025, at 17:39, Arseniy Mukhin  wrote:
> 
> 

+#define BRIN_MAX_PAGES_PER_RANGE   131072

I'd personally prefer some words on where does this limit come from. I'm not a 
BRIN pro, just curious.
Or, at least, maybe we can use a form 128 * 1024, if it's just a sane limit.

> On 8 Jun 2025, at 17:39, Arseniy Mukhin  wrote:
> 

A bit more detailed commit message would be very useful.

The test coverage is very decent. The number of possible corruptions in tests 
is impressive. I don't really have an experience with BRIN to say how different 
they are, but I want to ask if you are sure that these types of corruption will 
be portable across big-endian machines and such stuff.

Copyright year in all new files should be 2025.

Some documentation about brin_index_check() would be handy, especially about 
its parameters. Perhaps, somewhere near gin_index_check() in amcheck.sgml.

brin_check_ereport() reports ERRCODE_DATA_CORRUPTED. We should distinguish 
between ERRCODE_INDEX_CORRUPTED which is a stormbringer and 
ERRCODE_DATA_CORRUPTED which is an actual disaster.

If it's not very difficult - it would be great to use read_stream 
infrastructure. See btvacuumscan() in nbtree.c calling 
read_stream_begin_relation() for example. We cannot use it in logical scans in 
B-tree\GiST\GIN, but maybe BRIN can take some advantage of this new shiny 
technology.

+   state->revmapbuf = ReadBufferExtended(idxrel, MAIN_FORKNUM, 
state->revmapBlk, RBM_NORMAL,
+   
  state->checkstrategy);
+   LockBuffer(state->revmapbuf, BUFFER_LOCK_SHARE);
// usage of state->revmapbuf
+   LockBuffer(state->revmapbuf, BUFFER_LOCK_UNLOCK);
// more usage of state->revmapbuf
+   ReleaseBuffer(state->revmapbuf);

I hope you know what you are doing. BRIN concurrency is not known to me at all.

That's all for first pass through patches. Thanks for working on it!

Best regards, Andrey Borodin.

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

Attached is a simple reproducer. It passes with clang v16 -O0, but fails with 
17 and 18 only when built with -O0. 

Build command: clang main.c -O0

Hope this helps. 
Raghuveer

> -Original Message-
> From: John Naylor 
> Sent: Sunday, June 15, 2025 7:39 PM
> To: Andy Fan 
> Cc: Jesper Pedersen ; Nathan Bossart
> ; Tomas Vondra ; Devulapalli,
> Raghuveer ; pgsql-
> hack...@lists.postgresql.org; Shankaran, Akash 
> Subject: Re: Improve CRC32C performance on SSE4.2
> 
> On Sun, Jun 15, 2025 at 8:32 AM Andy Fan  wrote:
> >
> > Jesper Pedersen  writes:
> >
> > Hi,
> >
> > Thank you Nathan, Tomas and Jesper for the answers. The patch at [0]
> > works for me and I could work with master smoothly now.
> 
> Pushed, thanks for testing! I'll do some more testing to see what 
> versions/levels
> are affected and file a bug report, but it'll be a few days before I get to 
> it.
> 
> --
> John Naylor
> Amazon Web Services
> 

#include 
#include 
#include 

__attribute__((target("sse4.2")))
static uint32_t
crc32c_scalar(const unsigned char *data, ssize_t len, uint32_t crc)
{
const unsigned char *p = data;
const unsigned char *pend = p + len;

while (p + 8 <= pend)
{
crc = (uint32_t) _mm_crc32_u64(crc, *((const uint64_t *) p));
p += 8;
}

/* Process remaining full four bytes if any */
if (p + 4 <= pend)
{
crc = _mm_crc32_u32(crc, *((const unsigned int *) p));
p += 4;
}

/* go byte by byte: */
while (p < pend)
{
crc = _mm_crc32_u8(crc, *p);
p++;
}

return crc;
}

#define clmul_lo_(a, b) (_mm512_clmulepi64_epi128((a), (b), 0))
#define clmul_hi_(a, b) (_mm512_clmulepi64_epi128((a), (b), 17))

__attribute__((target("avx512vl,vpclmulqdq")))
static uint32_t
crc32c_avx512(const unsigned char* data, ssize_t length, uint32_t crc)
{
/* adjust names to match generated code */
uint32_t crc0 = crc;
size_t  len = length;
const unsigned char *buf = data;

if (len >= 64)
{
const unsigned char *end = buf + len;
const unsigned char *limit = buf + len - 64;
__m128i z0;

/* First vector chunk. */
__m512i x0 = _mm512_loadu_si512((const void *) buf),
y0;
__m512i k;

k = _mm512_broadcast_i32x4(_mm_setr_epi32(0x740eef02, 0, 
0x9e4addf8, 0));
x0 = 
_mm512_xor_si512(_mm512_castsi128_si512(_mm_cvtsi32_si128(crc0)), x0);
buf += 64;

/* Main loop. */
while (buf <= limit)
{
y0 = clmul_lo_(x0, k), x0 = clmul_hi_(x0, k);
x0 = _mm512_ternarylogic_epi64(x0, y0, 
_mm512_loadu_si512((const void *) buf), 0x96);
buf += 64;
}

/* Reduce 512 bits to 128 bits. */
k = _mm512_setr_epi32(0x1c291d04, 0, 0xddc0152b, 0, 0x3da6d0cb, 
0, 0xba4fc28e, 0, 0xf20c0dfe, 0, 0x493c7d27, 0, 0, 0, 0, 0);
y0 = clmul_lo_(x0, k), k = clmul_hi_(x0, k);
y0 = _mm512_xor_si512(y0, k);
z0 = _mm_ternarylogic_epi64(_mm512_castsi512_si128(y0), 
_mm512_extracti32x4_epi32(y0, 1), _mm512_extracti32x4_epi32(y0, 2), 0x96);
z0 = _mm_xor_si128(z0, _mm512_extracti32x4_epi32(x0, 3));

/* Reduce 128 bits to 32 bits, and multiply by x^32. */
crc0 = _mm_crc32_u64(0, _mm_extract_epi64(z0, 0));
crc0 = _mm_crc32_u64(crc0, _mm_extract_epi64(z0, 1));
len = end - buf;
}

return crc0;
}


static uint8_t randomval()
{
return (rand() % 255);
}

int main() {
const int size = 64;
/* Initialize to random values */
unsigned char arr[size];
srand(42);
for (size_t ii = 0; ii < size; ++ii) {
arr[ii] = randomval();
}

/* Compute crc32c using simple scalar methods and SIMD method */
uint32_t avxcrc = crc32c_avx512(arr, size, 0x);
uint32_t scalar_crc = crc32c_scalar(arr, size, 0x);

/* ASSERT values are the same */
if (scalar_crc != avxcrc) {
printf("Failed! ");
}
else {
printf("Success! ");
}
printf("0x%x, 0x%x\n", scalar_crc, avxcrc);
return 0;
}

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2025-06-16 Thread Sergey Sargsyan

Thank you for the information. Tomorrow, I will also run a few tests to
measure the time required to collect tids from the index; however, since I
do not work with vanilla postgres, the results may vary.

If the results indicate that this procedure is time-consuming, I maybe will
develop an additional patch specifically for b-tree indexes, as they are
the default and most commonly used type.

Best regards,
Sergey


On Mon, Jun 16, 2025, 11:01 PM Mihail Nikalayeu 
wrote:

> Hello, Sergey!
>
> > I think it's to avoid duplicate errors when adding tuples from STIP to
> the main index,
> > but couldn't we just suppress that error during validation and skip the
> new tuple insertion if it already exists?
>
> In some cases, it is not possible:
> – Some index types (GiST, GIN, BRIN) do not provide an easy way to
> detect such duplicates.
> – When we are building a unique index, we cannot simply skip
> duplicates, because doing so would also skip the rows that should
> prevent the unique index from being created (unless we add extra logic
> for B-tree indexes to compare TIDs as well).
>
> > The main index may get huge after building, and iterating over it in a
> single thread and then sorting tids can be time consuming.
> My tests indicate that the overhead is minor compared with the time
> spent scanning the heap and building the index itself.
>
> > At least I guess one can skip it when STIP is empty.
> Yes, that’s a good idea; I’ll add it later.
>
> > p.s. I noticed that `stip.c` has a lot of functions that don't follow
> the Postgres coding style of return type on separate line.
> Hmm... I’ll fix that as well.
>
> Best regards,
> Mikhail.
>

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Tomas Vondra

On 6/16/25 19:56, Tom Lane wrote:
> ...
> 
> After looking around a bit more I realized that this API is a complete
> disaster: it's not only bug-prone, but there are actual bugs in
> multiple callers, eg _ReadBuf() totally fails to detect early EOF as
> it intends to.  We need to fix it, not slap some band-aids on.
> Draft patch attached.
> 
> The write_func side is a bit of a mess too.  I fixed some obvious API
> violations in the attached, but I think we need to do more, because
> it seems that zero thought was given to reporting errors sanely.
> The callers all assume that strerror() is what to use to report an
> incomplete write, but this is not appropriate in every case, for
> instance we need to pay attention to gzerror() in gzip cases.
> I'm inclined to think that the best answer is to move error reporting
> into the write_funcs, and just make the API spec be "write or die".
> I've not tackled that here though.
> 
> I was depressed to find that Zstd_getc reads one byte into
> an otherwise-uninitialized int and then returns the whole int.
> Even if we're lucky enough for the variable to start off zero,
> this would fail utterly on big-endian machines.  The only
> reason we've not noticed is that the regression tests do not
> reach Zstd_getc, nor most of the other getc_func implementations.
> 
> Now I'm afraid to look at the rest of the compress_io.h API.
> If it's as badly written as these parts, there are more bugs
> to be found.
> 

Well, that doesn't sound great :-( Most of this is likely my fault, as
the one who pushed e9960732a961 (and some commits that built on it).
Sorry about that. I'll take a fresh look at the commits this week, but
considering I missed the issues before commit ...

For a moment I was worried about breaking ABI when fixing this in the
backbranches, but I guess that's not an issue for tools like pg_dump.


regards

-- 
Tomas Vondra

Re: Amcheck verification of GiST and GIN

2025-06-16 Thread Tomas Vondra

On 6/16/25 21:09, Arseniy Mukhin wrote:
> On Mon, Jun 16, 2025 at 6:58 PM Tomas Vondra  wrote:
>>
>> Thanks.
>>
>> I went through the patches, polished the commit messages and did some
>> minor tweaks in patch 0002 (to make the variable names a bit more
>> consistent, and reduce the scope a little bit). I left it as a separate
>> patch to make the changes clearer, but it should be merged into 0002.
>>
>> Please read through the commit messages, and let me know if I got some
>> of the details wrong (or not clear enough). Otherwise I plan to start
>> pushing this soon (~tomorrow).
> 
> LGTM.
> Noticed a few typos in messages:
> in v8-0002-amcheck-Fix-checks-of-entry-order-for-GIN-indexes.patch
>- parent key is creator
>- as the core incorrectly expected
> and 'Arseniy Mikhin' in some patches.
> 

Thanks for noticing those typos, especially the one in the name.


regards

-- 
Tomas Vondra

Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-06-16 Thread shveta malik

On Wed, Jun 11, 2025 at 2:31 AM Masahiko Sawada  wrote:
>
> I think it's the user's responsibility to keep at least one logical
> slot. It seems that setting wal_level to 'logical' would be the most
> reliable solution for this case. We might want to provide a way to
> keep 'logical' WAL level somehow but I don't have a good idea for now.
>

Okay,  Let me think  more on this.

>
> Considering cascading replication cases too, 2) could be tricky as
> cascaded standbys need to propagate the information of logical slot
> creation up to the most upstream server.
>

Yes, I understand the challenges here.

Thanks for the v2 patches, few concerns:

1)
Now when the slot on standby is invalidated due to effective_wal_level
switched back to replica and if we restart standby, it fails to
restart even if wal_level is explicitly changed to logical in conf
file.

FATAL:  logical replication slot "slot_st" exists, but logical
decoding is not enabled
HINT:  Change "wal_level" to be "replica" or higher.

2)
I see that when primary switches back its effective wal_level to
replica while standby has wal_level=logical in conf file, then standby
has this status:

postgres=# show wal_level;
 wal_level
---
 logical

postgres=# show effective_wal_level;
 effective_wal_level
-
 replica

Is this correct? Can effective_wal_level be < wal_level anytime? I
feel it can be greater but never lesser.

3)
When standby invalidate obsolete slots due to effective_wal_level on
primary changed to replica, it dumps below:
LOG:  invalidating obsolete replication slot "slot_st2"
DETAIL:  Logical decoding on standby requires "wal_level" >= "logical"
on the primary server

Shall we update this message as well to convey about slot-presence on primary.
DETAIL:  Logical decoding on standby requires "wal_level" >= "logical"
or presence of logical slot on the primary server.

4)
I see that the slotsync worker is running all the time now as against
the previous behaviour where it will not start if wal_level is less
than logical or switched to '< logical' anytime. Even with wal_level
and effective_wal_level set to replica, slot-sync keeps on attempting
synchronization. This does not look correct. I think we need to find a
way to stop sot-sync worker when effective_wal_level is switched to
replica from logical.

5)
Can you please help me understand the changes at [1].

a) Why is it needed when we have code logic at [2]
b) in [1], why do we check n_inuse_logical_slots on standby and then
make decisions? Why not to disable logical-decoding directly just like
[2]

[1]:
+ if (xlrec.wal_level == WAL_LEVEL_LOGICAL)
+ {
+ /*
+ * If the primary increase WAL level to 'logical', we can
+ * unconditionally enable the logical decoding on the standby.
+ */
+ UpdateLogicalDecodingStatus(true);
+ }
+ else if (xlrec.wal_level == WAL_LEVEL_REPLICA &&
+ pg_atomic_read_u32(&ReplicationSlotCtl->n_inuse_logical_slots) == 0)
+ {
+ /*
+ * Disable the logical decoding if there is no in-use logical slot
+ * on the standby.
+ */
+ UpdateLogicalDecodingStatus(false);
+ }

[2]:
+ else if (info == XLOG_LOGICAL_DECODING_STATUS_CHANGE)
+ {
+ bool logical_decoding;
+
+ memcpy(&logical_decoding, XLogRecGetData(record), sizeof(bool));
+ UpdateLogicalDecodingStatus(logical_decoding);
+
+ /*
+ * Invalidate logical slots if we are in hot standby and the primary
+ * disabled the logical decoding.
+ */
+ if (!logical_decoding && InRecovery && InHotStandby)
+ InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL,
+0, InvalidOid,
+InvalidTransactionId);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->logicalDecodingEnabled = logical_decoding;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }

thanks
Shveta

Re: Avoid possible dereference null pointer (contrib/postgres_fdw/postgres_fdw.c)

2025-06-16 Thread Fujii Masao





On 2025/06/17 4:48, Ranier Vilela wrote:

Hi.

In the function *estimate_path_cost_size* the parameter
fpextra can be NULL.


Yes.



It is necessary to always check its validity,
as is already done in other parts of the source.

patch attached.


adjust_foreign_grouping_path_cost(root, 
pathkeys,

  retrieved_rows, width,
-  
   fpextra->limit_tuples,
+  
   fpextra ? fpextra->limit_tuples : 0.0,

  &disabled_nodes,

  &startup_cost, &run_cost);

I couldn't find a query that would reach this code path with
fpextra == NULL, but I agree the current code is fragile.
So I think it's a good idea to add the check before accessing
the field.

Regards,

--
Fujii Masao
NTT DATA Japan Corporation

Re: fix: propagate M4 env variable to flex subprocess

2025-06-16 Thread Peter Eisentraut

On 28.05.25 20:42, J. Javier Maestro wrote:
On Wed, May 28, 2025 at 6:08 PM Andres Freund > wrote:

Hi,

On 2025-05-17 23:32:24 -0400, J. Javier Maestro wrote:
 > On Tue, May 13, 2025 at 11:54 AM Andres Freund
mailto:and...@anarazel.de>> wrote:
 > > Bilal, I think you wrote this originally, do you recall?
 > >
 > > It seems like an issue beyond just M4...
 > >
 >
 > IIRC the rest of the tools in the environment have ways to be
specified via
 > Meson options (BISON, FLEX, PERL) so the only issue I see is Flex
not being
 > able to find the specific m4 binary. What other issue(s) are you
 > considering?

PATH, LD_LIBRARY_PATH, ...

I think this really should just add to the environment, rather than
supplant
it.

Ah, understood. That definitely looks like a better option.

Do you want to write a patch like that? Otherwise I can.

Sure, I've attached the new patch. Let me know what you think, and if 
it's OK, what are the next steps to get the patch merged in main!

This patch looks right to me.

I would wait for the PG19 branching at this point, unless there is a 
concrete need for backpatching.

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Daniel Gustafsson

> On 17 Jun 2025, at 04:58, Tom Lane  wrote:

> There are enough moving parts in that area
> already that I'm not eager to add more constraints to what pg_dump
> needs to do.

Agreed.  I we were to do anything I think a check in pg_upgrade would be more
appropriate than altering pg_dump (but I'm not sure it's worth spending the
cycles in every upgrade to test for this, the check phase is already quite
extensive).

--
Daniel Gustafsson

Re: Replication slot is not able to sync up

2025-06-16 Thread Amit Kapila

On Mon, Jun 16, 2025 at 9:27 AM shveta malik  wrote:
>
> Thanks Peter and Amit for feedback. I have updated the patch.
>


+ When slot-synchronization setup is done as recommended, and
+ slot-synchronization is performed the very first time either automatically
+ or by 
+ pg_sync_replication_slots,
+ then for the synchronized slot to be created and persisted on the standby,
+ one condition must be met. The logical replication slot on the primary
+ must reach a state where the WALs and system catalog rows retained by
+ the slot are also present on the corresponding standby server. This is
+ needed to prevent any data loss and to allow logical replication
to continue
+
...

This whole paragraph sounds like a duplicate of its previous section,
and the line alignment in the first paragraph has some issues.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Add an ldflags_sl meson build option

2025-06-16 Thread Matt Smith (matts3)

I have a Bazel toolchain that sets different linker flags for exe/shared lib.

I wanted to build postgresql shared libs with meson using the same toolchain 
flags that I build all my other 3rd party deps in Bazel.

Currently for postgresql builds, we set LDFLAGS to the Bazel toolchain 
executable flags which means that any postgresql shared_library() gets the exe 
flags.

I noticed ldflags_sl was already setup to pass through shared_library flags, 
but I couldn't find a way to set it via meson build options.

From: Peter Eisentraut 
Sent: Monday, June 16, 2025 9:32 PM
To: Matt Smith (matts3) ; pgsql-hack...@postgresql.org 

Subject: Re: [PATCH] Add an ldflags_sl meson build option

On 16.06.25 06:10, Matt Smith (matts3) wrote:
> Currently there doesn't seem to be a way to pass shared library-specific
> flags via a meson build option.
>
> This patch allows the existing ldflags_sl to be set via a build option.

What uses do you have in mind for this?

Re: Init connection time grows quadratically

2025-06-16 Thread Потапов Александр


To be more precise I used constant number of threads (128 and 1024) to compare 
with previous results. The quadratic dependency exists everywhere, see new 
graph.

> Q: Did you check that pgbench or the OS does not have
> O(n_active_connections) or O(n_active_threads) overhead per worker
> during thread creation or connection establishment, e.g. by varying
> the number of threads used to manage these N clients? I wouldn't be
> surprised if there are inefficiencies in e.g. the threading- or
> synchronization model that cause O(N) per-thread overhead, or O(N^2)
> overall when you have one thread per connection.

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Evgeniy Gorbanev




16.06.2025 14:25, Daniel Gustafsson пишет:

On 16 Jun 2025, at 10:14, Evgeniy Gorbanev  wrote:
In src/bin/pg_dump/compress_zstd.c, the Zstd_read function always
returns true. But if you look at the Zstd_gets and Zstd_getc functions,
where Zstd_read is called via CFH->read_func, it is expected that
the Zstd_read function can also return false. In case of
a read-from-file error, the process is expected to terminate, but
pg_dump will continue the process.
I assume that after checking
if (cnt == 0)
should be
return false;

if (cnt == 0)
-   break;
+   return false;

As cnt is coming from fread() returning false here would be wrong as you cannot
from 0 alone know if it's EOF or an error.  Instead it needs to inspect the
stream with feof() and ferror() to know how to proceed.

--
Daniel Gustafsson


The feof() check is in the calling functions, e.g. in the Zstd_getc
function.


Regards, Evgeniy

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Vaibhav Dalvi

Hi Daniel,

Thanks for your response.

On Mon, Jun 16, 2025 at 1:27 PM Daniel Gustafsson  wrote:

> > On 16 Jun 2025, at 09:29, Vaibhav Dalvi 
> wrote:
>
> > I'm able to create the object as shown in the below:
> >
> > postgres=# CREATE OR REPLACE FUNCTION pg_catalog.nont_ext_func() RETURNS
> char AS $$ BEGIN return 'v'; END; $$ LANGUAGE plpgsql;
> > CREATE FUNCTION
> >
> > Why can't we strictly restrict object creation in pg_catalog?
>
> Do you have allow_system_table_mods set to ON by any chance?  As Laurenz
> said,
> such creation is already restricted, but it can be circumvented by using
> said
> GUC (which is *not* intended for production usage).
>
> --
> Daniel Gustafsson
>

It's OFF.
postgres=# select version();
version


 PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu
14.2.0-4ubuntu2~24.04) 14.2.0, 64-bit
(1 row)
postgres=# show allow_system_table_mods ;
 allow_system_table_mods
-
 off
(1 row)
postgres=# CREATE FUNCTION pg_catalog.nont_ext_func() RETURNS char AS $$
BEGIN return 'v'; END; $$ LANGUAGE plpgsql;
CREATE FUNCTION
postgres=# \df+ nont_ext_func

 List of functions
   Schema   | Name  | Result data type | Argument data types | Type
| Volatility | Parallel |  Owner  | Security | Lea
kproof? | Access privileges | Language | Internal name | Description
+---+--+-+--++--+-+--+
+---+--+---+-
 pg_catalog | nont_ext_func | character| | func
| volatile   | unsafe   | vaibhav | invoker  | no
|   | plpgsql  |   |
(1 row)

Regards,
Vaibhav

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 10:52, Evgeniy Gorbanev  wrote:
> 
> 16.06.2025 14:25, Daniel Gustafsson пишет:
>>> On 16 Jun 2025, at 10:14, Evgeniy Gorbanev  wrote:
>>> In src/bin/pg_dump/compress_zstd.c, the Zstd_read function always
>>> returns true. But if you look at the Zstd_gets and Zstd_getc functions,
>>> where Zstd_read is called via CFH->read_func, it is expected that
>>> the Zstd_read function can also return false. In case of
>>> a read-from-file error, the process is expected to terminate, but
>>> pg_dump will continue the process.
>>> I assume that after checking
>>> if (cnt == 0)
>>> should be
>>> return false;
>>   if (cnt == 0)
>> - break;
>> + return false;
>> 
>> As cnt is coming from fread() returning false here would be wrong as you 
>> cannot
>> from 0 alone know if it's EOF or an error.  Instead it needs to inspect the
>> stream with feof() and ferror() to know how to proceed.
> 
> The feof() check is in the calling functions, e.g. in the Zstd_getc
> function.

That function is using feof/ferror to log an appropriate error message, what
you are proposing is to consider all cases of EOF as an error.  If you test
that you will see lots of test starting to fail.

--
Daniel Gustafsson

RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2025-06-16 Thread Hayato Kuroda (Fujitsu)

Dear Alexander,

Thanks for pushing the fix patch! BTW, I have few comments for your commits.
Can you check and include them if needed?

01.
```
$node->append_conf('postgresql.conf',
"shared_preload_libraries = 'injection_points'");
```

No need to set shared_preload_libraries in 046/047. ISTM it must be set when we
enable the statistics.

02.
We should also check whether the injection_points can be installed or not.
You can check check_extension() and callers.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Evgeniy Gorbanev


16.06.2025 15:43, Daniel Gustafsson пишет:

On 16 Jun 2025, at 11:26, Evgeniy Gorbanev  wrote:
I ran tests for pg_dump and they all passed. Logs attached.

Files=7, Tests=11918, 35 wallclock secs ( 0.59 usr  0.07 sys +  7.92 cusr  4.32 
csys = 12.90 CPU)

That seems like a low number of tests executed, with ZStd enabled I see over
13200 tests in the log.  Are you sure you are building with ZStd compression
enabled?

--
Daniel Gustafsson



Yes, you are right. Now I see the errors.

Re: [PATCH] Split varlena.c into varlena.c and bytea.c

2025-06-16 Thread Aleksander Alekseev

Hi Peter,

> >> The proposed patch extracts the code dealing with bytea from varlena.c
> >> into a separate file, as proposed previously [1]. It merely changes
> >> the location of the existing functions. There are no other changes.
> >
> > Rebased.
>
> I think this is reasonable.  varlena.c is pretty big, and bytea is a
> reasonable subset to take out.  We already have a corresponding header
> file bytea.h, so the naming fits.
>
> I wonder if makeStringAggState() is still useful.  This was factored out
> so that it could be shared between bytea_string_agg_transfn() and
> string_agg_transfn(), but now that these are in separate files, it seems
> to me that it might be easier now to just write the required code inline.

Agree. I assume you meant only bytea.c. Since varlena.c uses
makeStringAggState() in several places I choose to keep it there.

> Here are some refinements to the includes:
>
> --- a/src/backend/utils/adt/bytea.c
> +++ b/src/backend/utils/adt/bytea.c
> @@ -15,13 +15,20 @@
>   #include "postgres.h"
>
>   #include "access/detoast.h"
> -#include "catalog/pg_collation.h"
> +#include "catalog/pg_collation_d.h"
> +#include "catalog/pg_type_d.h"
>   #include "common/int.h"
> -#include "funcapi.h"
> +#include "fmgr.h"
>   #include "libpq/pqformat.h"
> +#include "port/pg_bitutils.h"
>   #include "utils/builtins.h"
>   #include "utils/bytea.h"
> +#include "utils/fmgroids.h"
> +#include "utils/fmgrprotos.h"
> +#include "utils/memutils.h"
> +#include "utils/sortsupport.h"
>   #include "utils/varlena.h"
> +#include "varatt.h"
>
> Especially funcapi.h was apparently not used.  The other additions are
> required includes that came previously via indirect includes.

Thanks, fixed. clangd complained that utils/fmgroids.h is not going to
be used, as it complained about funcapi.h before. So I choose not to
include fmgroids.h.

PFA patch v3.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Split-varlena.c-into-varlena.c-and-bytea.c.patch
Description: Binary data

Re: Fwd: dsm_registry: Add detach and destroy features

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 09:02:22AM +0900, Sungwoo Chang wrote:
> 2025년 6월 14일 (토) 오전 6:50, Nathan Bossart 님이 작성:
>> Could your use-case be handled with a DSA?  On the other thread [0], we're
>> talking about adding a GetNamedDSA() function, which returns a DSA that you
>> can use to allocate and free shared memory as needed.  In theory you could
>> even detach the DSA if you no longer needed it in a backend, although
>> that's probably unnecessary.
> 
> My use-case requires access to the shared memory object through a named key.
> Even if we migrate the code to NamedDSA, within the DSA we will need some sort
> of a map between the named key and the object to access. So, I think NamedDSA
> won't be the solution.

Right, you'd need some other shared space for the DSA pointers.  In the
other thread, I'm using a dshash table (created via GetNamedDSMHash()) to
store those for test_dsm_registry [0].  There are probably other ways to do
this.

> How about when we call destroy, we check if there are other processes
> attached to it and if so, we throw an exception? I checked C++ boost
> interprocess library [0], and it looks like that's the way boost does.
> This way, we can avoid the aforementioned "partitioned" scenario.

That might work.

[0] https://postgr.es/m/aEyX-9k5vlK2lxjz%40nathan

-- 
nathan

Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)

2025-06-16 Thread Andres Freund

Hi,

On 2025-06-16 17:28:31 +0300, Yura Sokolov wrote:
> 04.06.2025 00:04, Andres Freund пишет:
> > Hi,
> > 
> > On 2025-06-02 21:20:33 +0300, Yura Sokolov wrote:
> >> But still problem of spin lock contention is here.
> > 
> > I still would like to see a reproducer for this.
> 
> For problem in sinvaladt.c we have no synthetic reproducer. But version
> with changed maxMsgNum to atomic solved customer's issue.

TBH, I don't see a point in continuing with this thread without something that
others can test.  I rather doubt that the right fix here is to just change the
lock model over, but without a repro I can't evaluate that.

Greetings,

Andres Freund

Re: relrewrite not documented at the top of heap_create_with_catalog()

2025-06-16 Thread Yugo Nagata

On Mon, 16 Jun 2025 16:51:46 +0800
Steven Niu  wrote:

> Hi, Michael,
> 
> The change looks good to me.
> 
> I made more change based on your patch to combine the old comment.

I think it's a good idea to move the description of heap_create_with_catalog
directly above the function itself, as it seems better to keep the explanation
close to the function definition rather than placing it before related 
functions.
A similar change was made to heap_drop_with_catalog in commit 49ce6fff1d34.

I'm not sure whether this should be merged into the original patch, though.

I have a very minor comment on the initial patch.

+ * relrewrite: link to original relation during a table rewrite.

I wonder if the period at the end is unnecessary, since this is not a full
sentence, and for consistency with the other lines.

Best regards,
Yugo Nagata

> Regards,
> Steven
> 
> Michael Paquier  于2025年6月16日周一 14:49写道：
> 
> > Hi all,
> >
> > While looking at the heap code, I've found that relrewrite, parameter
> > used a trick to keep a link to the original relation rewritten, is not
> > documented at the top of heap_create_with_catalog() contrary to all
> > its other parameters.
> >
> > A simple patch is attached to document that.
> >
> > Thanks,
> > --
> > Michael
> >

-- 
Yugo Nagata

Re: Read-Write optimistic lock (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum)

2025-06-16 Thread Yura Sokolov

04.06.2025 00:04, Andres Freund пишет:
> Hi,
> 
> On 2025-06-02 21:20:33 +0300, Yura Sokolov wrote:
>> But still problem of spin lock contention is here.
> 
> I still would like to see a reproducer for this.

For problem in sinvaladt.c we have no synthetic reproducer. But version
with changed maxMsgNum to atomic solved customer's issue.

For problem on XLogRecoveryCtlData->info_lck we have reproducer which shows
follower with a lot of logical walsenders with patched version lags less
and uses less CPU... but only 4-socket Xeon(R) Gold 6348H. There is no
benefit from patch on 2-socket Intel(R) Xeon(R) Gold 5220R.

>> So I propose to introduce another spin lock type capable for Exclusive and
>> Shared lock modes (i.e. Write/Read modes) and use it in this two places.
> 
> I am vehemently opposed to that. We should work towards getting rid of
> spinlocks, not introduce more versions of spinlocks. Userspace spinlocks
> largely are a bad idea. We should instead make properly queued locks cheaper
> (e.g. by having an exclusive-only lock, which can be cheaper to release on
> common platforms).

1. when faster lock will arrive?
2. "exclusive-only" lock will never scale at scenario when there are few
writers and a lot of readers. It is just direct consequence of Amdahl's Law.

To be honestly, our version of XLogRecoveryCtlData->info_lck fix is based
on optimistic read-write lock. I've tried to make more regular read-write
spin lock in previous patch version to make it more familiar.

But after playing a bit with variants using simple C program [1], I found
optimistic lock is only really scalable solution for the problem (aside of
direct use of atomic operations).

So attached version contains optimistic read-write lock used in these two
places.

[1] https://pastebin.com/CwSPkeGi

-- 
regards
Yura Sokolov aka funny-falconFrom 04ca9870a9bb1036aede6ee32b616908a065341a Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 2 Jun 2025 19:26:20 +0300
Subject: [PATCH v5] Read-Write optimistic spin lock.

There are couple of places where spin lock is used just to separate
readers and writers. And there are a lot of readers and one or few writers.
Those places are source of contention in a huge system setups because
readers create bottleneck on always exclusive slock_t.

Mitigate it by introducing read-write optimistic spin lock.
Read is performed without any write operation on lock itself. Correctness
of read is checked by comparing lock version before and after read.

Use this RWSpin lock in sending invalidations and notifies of walsenders.

If no native 64bit atomic support present, fallback to regular spin lock.
---
 src/backend/access/transam/xlogrecovery.c |  70 +--
 src/backend/storage/ipc/sinvaladt.c   |  14 +--
 src/backend/storage/lmgr/Makefile |   1 +
 src/backend/storage/lmgr/meson.build  |   1 +
 src/backend/storage/lmgr/rwoptspin.c  |  33 +
 src/include/storage/rwoptspin.h   | 139 ++
 src/tools/pgindent/typedefs.list  |   1 +
 7 files changed, 217 insertions(+), 42 deletions(-)
 create mode 100644 src/backend/storage/lmgr/rwoptspin.c
 create mode 100644 src/include/storage/rwoptspin.h

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..2c651d7b83d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -56,7 +56,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
+#include "storage/rwoptspin.h"
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
@@ -364,7 +364,7 @@ typedef struct XLogRecoveryCtlData
 	RecoveryPauseState recoveryPauseState;
 	ConditionVariable recoveryNotPausedCV;

-	slock_t		info_lck;		/* locks shared variables shown above */
+	RWOptSpin	info_lck;		/* locks shared variables shown above */
 } XLogRecoveryCtlData;

 static XLogRecoveryCtlData *XLogRecoveryCtl = NULL;
@@ -471,7 +471,7 @@ XLogRecoveryShmemInit(void)
 		return;
 	memset(XLogRecoveryCtl, 0, sizeof(XLogRecoveryCtlData));

-	SpinLockInit(&XLogRecoveryCtl->info_lck);
+	RWOptSpinInit(&XLogRecoveryCtl->info_lck);
 	InitSharedLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 	ConditionVariableInit(&XLogRecoveryCtl->recoveryNotPausedCV);
 }
@@ -1668,7 +1668,7 @@ PerformWalRecovery(void)
 	 * we had just replayed the record before the REDO location (or the
 	 * checkpoint record itself, if it's a shutdown checkpoint).
 	 */
-	SpinLockAcquire(&XLogRecoveryCtl->info_lck);
+	RWOptSpinAcquire(&XLogRecoveryCtl->info_lck);
 	if (RedoStartLSN < CheckPointLoc)
 	{
 		XLogRecoveryCtl->lastReplayedReadRecPtr = InvalidXLogRecPtr;
@@ -1686,7 +1686,7 @@ PerformWalRecovery(void)
 	XLogRecoveryCtl->recoveryLastXTime = 0;
 	XLogRecoveryCtl->currentChunkStartTime = 0;
 	XLogRecoveryCtl->recoveryPauseState = RECOVERY_NOT_PAUSED;
-	SpinLockRelease(&XLogRecoveryCtl->info

回复: Fix potential overflow risks from wcscpy and sprintf

2025-06-16 Thread Yan Haibo

Thank you. Peter. It seems the patch may have been lost during our earlier 
communication, so I’ve reattached it here.
I hope it comes through correctly this time.
Best regards,
Haibo

发件人: Peter Eisentraut 
发送时间: 2025年6月11日 00:43
收件人: Yan Haibo ; pgsql-hackers@lists.postgresql.org 

主题: Re: Fix potential overflow risks from wcscpy and sprintf

On 06.06.25 22:50, Yan Haibo wrote:
> This change stems from a recent static code analysis, which identified a
> minor potential overflow issue. I would appreciate it if someone could
> review the fix at their convenience.

Please provide more detail in each case what the issue is and how you
are proposing to fix it.

Note that PostgreSQL code is regularly run through various static
analysis tools, so if the issue were super-obvious, it would probably
have been noticed by now.

0001-Mitigate-potential-overflow-risks-from-wcscpy-and-sp.patch
Description:  0001-Mitigate-potential-overflow-risks-from-wcscpy-and-sp.patch

回复: 回复: Fix potential overflow risks from wcscpy and sprintf

2025-06-16 Thread Yan Haibo

Thank you. Tom. I agree that fixing the sprintf usage is not well-timed at the 
moment, so I’ve removed that change.
Regarding the use of wcsncpy with LOCALE_NAME_MAX_LENGTH - 1, it is a 
precaution in case the input string is not null-terminated.
Thanks again,
Haibo

发件人: Tom Lane 
发送时间: 2025年6月16日 11:28
收件人: Yan Haibo 
抄送: Peter Eisentraut ; pgsql-hackers@lists.postgresql.org 

主题: Re: 回复: Fix potential overflow risks from wcscpy and sprintf

Yan Haibo  writes:
> Thank you. Peter. It seems the patch may have been lost during our earlier 
> communication, so I¡¯ve reattached it here.
> I hope it comes through correctly this time.

Thanks for the patch.

Using wcsncpy in search_locale_enum() seems fine, assuming it exists
on Windows (note that code is Windows-only, possibly explaining why
we've not seen other static-analysis reports).  I doubt there's any
actual bug there, since we're relying on Windows' own
LOCALE_NAME_MAX_LENGTH constant; but I agree the chain of reasoning
is kind of long.  (But shouldn't you write LOCALE_NAME_MAX_LENGTH
not LOCALE_NAME_MAX_LENGTH - 1?)

I'm unexcited about the guc.c changes.  There is visibly no bug
there.  The only reason to change it would be if we were going
to introduce a strict project policy against using sprintf(),
which we're not likely to given that there are hundreds of other
occurrences in our code base.  I don't see a reason to think
that these three are more pressing than the others.

regards, tom lane

0001-Mitigate-potential-overflow-risks-from-wcscpy.patch
Description: 0001-Mitigate-potential-overflow-risks-from-wcscpy.patch

Re: Allow pg_dump --statistics-only to dump foreign table statistics?

2025-06-16 Thread Corey Huinker

>
> I skimmed through the main thread for the feature [0] (which seems to be so
> long that it sometimes doesn't load completely), and I didn't see anything
> directly related to the topic.  There was some discussion about importing
> foreign relation stats with the new functions instead of remote table
> samples, but that's a different thing


If we aren't exporting stats for foreign tables then that is an oversight,
the intention always was to fetch all available statistics for all
relations. I can't offhand think of where we even have the option to
exclude them.

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Tom Lane

I think the real problem here is that what the code is doing is
precisely not what is documented in compress_io.h:

/*
 * Read 'size' bytes of data from the file and store them into 'ptr'.
 * Optionally it will store the number of bytes read in 'rsize'.
 *
 * Returns true on success and throws an internal error otherwise.
 */
bool(*read_func) (void *ptr, size_t size, size_t *rsize,
  CompressFileHandle *CFH);

I've not checked to see what the other users of this API do, but
if they're all like this then we need to fix that comment.
Otherwise we probably need to make them all alike; even if there's
not a bug today, one will soon appear from someone believing the
comment.

regards, tom lane

Re: Add XMLNamespaces to XMLElement

2025-06-16 Thread Jim Jones

rebase

-- 
Jim
From 9c36deb99e600fe9a76fd1942081f3031bbd6216 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Mon, 16 Jun 2025 14:43:11 +0200
Subject: [PATCH v8] Add XMLNamespaces option to XMLElement

This patch adds the scoped option XMLNamespaces to the XMLElement()
function, as described in ISO/IEC 9075-14:2023, section 11.2,
"XML lexically scoped options".

Syntax:
  xmlnamespaces(uri AS prefix, ...)
  xmlnamespaces(DEFAULT uri, ...)
  xmlnamespaces(NO DEFAULT, ...)

* prefix: Namespace prefix to associate with the URI.
* uri:The namespace URI.
* DEFAULT uri:Specifies the default namespace within the scope.
* NO DEFAULT: Specifies that no default namespace is in effect.

== Examples ==

  SELECT xmlelement(NAME "foo", xmlnamespaces('http:/x.y' AS bar));
  SELECT xmlelement(NAME "foo", xmlnamespaces(DEFAULT 'http:/x.y'));
  SELECT xmlelement(NAME "foo", xmlnamespaces(NO DEFAULT));
---
 doc/src/sgml/func.sgml  |  57 +-
 src/backend/parser/gram.y   | 100 +--
 src/backend/parser/parse_clause.c   |   7 +-
 src/backend/parser/parse_expr.c |  80 +
 src/backend/utils/adt/ruleutils.c   |  79 +++--
 src/backend/utils/adt/xml.c |  55 +-
 src/include/nodes/primnodes.h   |   4 +-
 src/include/utils/xml.h |   6 +
 src/test/regress/expected/xml.out   | 259 
 src/test/regress/expected/xml_1.out | 197 +
 src/test/regress/expected/xml_2.out | 259 
 src/test/regress/sql/xml.sql| 133 ++
 12 files changed, 1202 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c67688cbf5..10b5682542 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14720,7 +14720,10 @@ SELECT xmlconcat('', '
 
 
-xmlelement ( NAME name , XMLATTRIBUTES ( attvalue  AS attname  , ... )  , content , ... ) xml
+xmlelement ( NAME name
+, XMLATTRIBUTES ( attvalue  AS attname  , ... ) 
+, XMLNAMESPACES ( {regular-nsuri AS nsprefix | DEFAULT default-nsuri | NO DEFAULT} , ... ) 
+, content , ... ) xml
 
 
 
@@ -14733,7 +14736,39 @@ SELECT xmlconcat('', 'PostgreSQL data type.  The
  argument(s) within XMLATTRIBUTES generate attributes
  of the XML element; the content value(s) are
- concatenated to form its content.
+ concatenated to form its content. The arguments within XMLNAMESPACES
+ constuct namespace declarations from values provided in nsuri
+ and nsprefix, which correspond to the URI of a namespace and
+ its prefix, respectively. The option DEFAULT can be used to set the
+ default namespace declaration (without a prefix) to the URI provided in default-nsuri.
+ The option NO DEFAULT states that a namespace scope has no default namespace. A valid
+ XMLNAMESPACES item must fulfill the following conditions:
+
+
+
+ 
+  Only a single DEFAULT declaration item within the same scope.
+ 
+
+
+ 
+  No two nsuri can be equal within the same scope.
+ 
+
+
+ 
+  No nsprefix can be equal to xml or xmlns,
+  and no nsuri can be equal to http://www.w3.org/2000/xmlns/
+  or to http://www.w3.org/XML/1998/namespace, as they are already bouned to standard XML declarations.
+ 
+
+
+ 
+  The value of a regular-nsuri cannot be a zero-length string.
+ 
+
+   
+
 
 
 
@@ -14756,6 +14791,24 @@ SELECT xmlelement(name foo, xmlattributes(current_date as bar), 'cont', 'ent');
  xmlelement
 -
  content
+
+SELECT xmlelement(NAME "foo:root", xmlnamespaces('http:/foo.bar/' AS foo), 'content');
+
+   xmlelement
+-
+ content
+
+ SELECT xmlelement(NAME "root", xmlnamespaces(DEFAULT 'http:/foo.bar/'), 'content');
+
+ xmlelement
+-
+ content
+
+ SELECT xmlelement(NAME "root", xmlnamespaces(NO DEFAULT), 'content');
+
+  xmlelement
+---
+ content
 ]]>
 
 
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 50f53159d5..e605d173e9 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -136,6 +136,12 @@ typedef struct KeyActions
 	KeyAction *deleteAction;
 } KeyActions;
 
+typedef struct XmlElementOpts
+{
+	List	   *xml_attributes;
+	List	   *xml_namespaces;
+} XmlElementOpts;
+
 /* ConstraintAttributeSpec yields an integer bitmask of these flags: */
 #define CAS_NOT_DEFERRABLE			0x01
 #define CAS_DEFERRABLE0x02
@@ -186,7 +192,7 @@ static Node *makeNotExpr(Node *expr, int location);
 static Node *makeAArrayExpr(List *elements, int location, int end_location);
 static Node *makeSQLValueFunction(SQLValueFunctionOp op, int32 typmod,
   int location);
-static Node *makeXmlExpr(XmlExprOp op, char *name, List *named_args,
+s

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Tom Lane

Daniel Gustafsson  writes:
> On 16 Jun 2025, at 09:29, Vaibhav Dalvi  
> wrote:
>> Why can't we strictly restrict object creation in pg_catalog?

> Do you have allow_system_table_mods set to ON by any chance?  As Laurenz said,
> such creation is already restricted, but it can be circumvented by using said
> GUC (which is *not* intended for production usage).

I think that setting only applies to creating or modifying *tables*,
not functions.  The point of it is to keep you from breaking the C
code's assumptions about the layout of system catalogs.

Having said that, I don't see a problem here.  You're not going
to be able to create/modify functions in pg_catalog unless you
are superuser (or a superuser gave you permissions you shouldn't
have).  There are already a near-infinite number of ways
for a superuser to break the system, so this one isn't making it
detectably worse.  Furthermore, there are legitimate use-cases
for adding/changing functions there.  (I recall that the old
"adminpack" extension used to do so, for example, and there are
probably others that still do.)

regards, tom lane

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 15:56, Tom Lane  wrote:

> I've not checked to see what the other users of this API do, but
> if they're all like this then we need to fix that comment.

AFAICT all other callers of this API are throwing an error with pg_fatal, and
so does the function in question for ZStd decompression errors.  If we handle
the case of fread() returning 0 to indicate an error like the below *untested
sketch* (with a better error message) this function is fully API compliant as
well.

/* If we have no more input to consume, we're done */
if (cnt == 0)
+   {
+   if (ferror(unconstify(void *, input->src)))
+   pg_fatal("could not read data to decompress: 
%m");
+
break;
+   }

If this seems like a good approach then Zstd_getc can be simplified as well as
it no longer needs to call ferror, it still needs to check feof though.

--
Daniel Gustafsson

Re: Amcheck verification of GiST and GIN

2025-06-16 Thread Arseniy Mukhin

On Mon, Jun 16, 2025 at 6:58 PM Tomas Vondra  wrote:
>
> Thanks.
>
> I went through the patches, polished the commit messages and did some
> minor tweaks in patch 0002 (to make the variable names a bit more
> consistent, and reduce the scope a little bit). I left it as a separate
> patch to make the changes clearer, but it should be merged into 0002.
>
> Please read through the commit messages, and let me know if I got some
> of the details wrong (or not clear enough). Otherwise I plan to start
> pushing this soon (~tomorrow).

LGTM.
Noticed a few typos in messages:
in v8-0002-amcheck-Fix-checks-of-entry-order-for-GIN-indexes.patch
   - parent key is creator
   - as the core incorrectly expected
and 'Arseniy Mikhin' in some patches.


Best regards,
Arseniy Mukhin

Re: Per-role disabling of LEAKPROOF requirements for row-level security?

2025-06-16 Thread Yugo Nagata

Hi,

On Sat, 19 Apr 2025 18:14:01 +0200
Andreas Lind  wrote:

> Hi hackers,
> 
> Row-level security is an awesome feature. I was originally won over by
> the simple mental model of implicitly adding WHERE clauses to all
> queries, and that it generally comes for free when they can be
> satisfied by Index Conds.
> 
> I've recently helped implement row-level security in a fairly big
> multi-tenant application with about 100 tables. RLS is being used
> "voluntarily" by the application, which switches to a role that is
> subject to RLS as the first thing after starting a transaction. I
> realize that this doesn't offer bulletproof tenant isolation, but it
> still makes a lot of sense as a defense-in-depth measure.
> 
> It took some time to work through various performance regressions
> caused by selectivity estimates changing due to "WHERE tenantId = ..."
> implicitly being added to all relations in all queries. Extended
> statistics (dependencies) helped with that, as tenantId is highly
> correlated with the other id columns that we were already querying and
> joining on.
> 
> The roughest edge turned out to be the well-known challenge with the
> LEAKPROOFness of built-in functions, previously discussed in earlier
> threads [1], in blog posts [2], on Stack Overflow [3] and pointed out
> in the docs for Row Security Policies [4].
> 
> I'm aware that the usual workaround is to ALTER FUNCTION ... LEAKPROOF
> for each of the relevant built-in functions after making an evaluation
> of whether that's acceptable for the given use case. We weren't able to
> do this because we're using a hosted Postgres solution where it's not
> permitted. We've tried to talk them (AWS) into offering a way to do it,
> but they haven't budged yet.
> 
> As in the earlier thread we were affected by enum_eq not being
> leakproof, which caused our indexes involving enum columns to not be
> fully utilized while RLS is active (because WHERE clauses on the enum
> columns cannot be implemented as Index Conds). We mostly overcame that
> by replacing each index with a series of partial ones, each with
> WHERE enum_column = ...
> 
> We also had some functional indexes that either explicitly used a
> non-leakproof built-in function, or implicitly exercised a non-leakproof
> conversion function such as textin or int8out. Here we ended up cheating
> the system by adding LEAKPROOF wrappers around the built-in functions,
> similar to [5].
> 
> The most difficult thing to overcome was GIN indexes on bigint[]
> columns not being used because the built-in arrayoverlap/arraycontained/
> array_eq/... functions aren't LEAKPROOF. We eventually figured out that
> we could make our own version of gin_array_ops which used functions
> marked as LEAKPROOF, and then recreate all the GIN indices with that.
> 
> My hunch is that there's a big class of applications like ours where
> enforcing LEAKPROOFness isn't really important. When application code is
> interacting with the database via its own role, you're not that
> concerned with invisible tuples leaking out via error messages or
> elaborate timing attacks. That seems more relevant when your
> adversaries have direct access to the database and are able to create
> their own functions and execute arbitrary queries. I think it would be
> extremely useful to be able to forefeit the LEAKPROOF guarantees when
> the database is being accessed by an application in exchange for
> avoiding the footguns described above.
> 
> I dug into the code and noticed that restrictinfo->leakproof is only
> being checked in two places (createplan.c and equivclass.c), so it seems
> fairly easy to only selectively enforce it. Then there's the question of
> how to configure it. I can think of a few possible ways:
> 
> 1) Add a BYPASSLEAKPROOF role attribute that can only be granted by a
>superuser, similar to the BYPASSRLS flag.
> 2) Add a session variable, eg. enable_security_leakproof, that can only
>be set or granted to another role by a superuser.
> 3) Make it a property of the individual POLICY that grants access to the
>table. This would be a bit more granular than a global switch, but
>there'd be some ambiguity when multiple policies are involved.
> 
> I took a stab at implementing solution 1) above, mostly to understand if
> I had found the right places in the code and how hard it'd be. This is
> my first time hacking on Postgres, and I'm not really sure if it's OK to
> introduce this extra check in a possibly hot code path. I thought about
> instead adding a "does the current user have bypassleakproof" flag in
> some struct available to the planner, but this is beyond my current
> abilities. Also, it's probably better to get some feedback on the idea
> before spending too much time optimizing.
> 
> What do you think?

I'm not sure whether multi-tenant applications fall into the category where
LEAKPROOFness isn't considered important, since security is typically a key
concern for users of such systems.

Even if t

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2025-06-16 Thread Melanie Plageman

On Fri, Jun 13, 2025 at 10:12 AM John Naylor  wrote:
>
> (Swapping this part back in my brain as well...) I actually don't
> think we need that where clause anymore since mwm can be super low
> now, and it's a bit mysterious what it was trying to accomplish. Maybe
> we can just use the lowest fill factor to reduce WAL -- having a few
> dozen pages should push it over the memory limit, regardless of how
> many dead tuples are on each pages.

Test in attached patch seems to do the job on 32 bit and 64 bit when tested.

- Melanie
From 17485392a10d12f1c04d8dddf90c1c89f641ee95 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 13 Jun 2025 09:25:40 -0400
Subject: [PATCH] Test that vacuum removes tuples older than OldestXmin

If vacuum fails to prune a tuple killed before OldestXmin, it will
decide to freeze its xmax and later error out in pre-freeze checks.

Add a test reproducing this scenario to the recovery suite which creates
a table on a primary, updates the table to generate dead tuples for
vacuum, and then, during the vacuum, uses a replica to force
GlobalVisState->maybe_needed on the primary to move backwards and
precede the value of OldestXmin set at the beginning of vacuuming the
table.

This test is coverage for a case fixed in 83c39a1f7f3. The test was
originally committed to master in aa607980aee but later reverted in
efcbb76efe4 due to test instability.

The test requires multiple index passes. In Postgres 17+, vacuum uses a
TID store for the dead TIDs that is very space efficient. With the old
minimum maintenance_work_mem of 1 MB, it required a large number of dead
rows to generate enough dead TIDs to force multiple index
vacuuming passes. Once the source code changes were made to allow a
minimum maintenance_work_mem value of 64kB, the test could be made much
faster and more stable.

Author: Melanie Plageman
Reviewed-by: Peter Geoghegan
DiscussioN: https://postgr.es/m/CAAKRu_ZhG5NEQU-h7m%3DaeocxRze4ALt5swuKM45bN0HRQBccew%40mail.gmail.com
---
 src/test/recovery/meson.build |   3 +-
 .../recovery/t/048_vacuum_horizon_floor.pl| 266 ++
 2 files changed, 268 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/048_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 92429d28402..52993c32dbb 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -55,7 +55,8 @@ tests += {
   't/044_invalidate_inactive_slots.pl',
   't/045_archive_restartpoint.pl',
   't/046_checkpoint_logical_slot.pl',
-  't/047_checkpoint_physical_slot.pl'
+  't/047_checkpoint_physical_slot.pl',
+  't/048_vacuum_horizon_floor.pl'
 ],
   },
 }
diff --git a/src/test/recovery/t/048_vacuum_horizon_floor.pl b/src/test/recovery/t/048_vacuum_horizon_floor.pl
new file mode 100644
index 000..141db78f87c
--- /dev/null
+++ b/src/test/recovery/t/048_vacuum_horizon_floor.pl
@@ -0,0 +1,266 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Test that vacuum prunes away all dead tuples killed before OldestXmin
+#
+# This test creates a table on a primary, updates the table to generate dead
+# tuples for vacuum, and then, during the vacuum, uses the replica to force
+# GlobalVisState->maybe_needed on the primary to move backwards and precede
+# the value of OldestXmin set at the beginning of vacuuming the table.
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 'physical');
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+hot_standby_feedback = on
+log_recovery_conflict_waits = true
+autovacuum = off
+log_min_messages = INFO
+maintenance_work_mem = 64
+]);
+$node_primary->start;
+
+my $node_replica = PostgreSQL::Test::Cluster->new('standby');
+
+$node_primary->backup('my_backup');
+$node_replica->init_from_backup($node_primary, 'my_backup',
+	has_streaming => 1);
+
+$node_replica->start;
+
+my $test_db = "test_db";
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+# Save the original connection info for later use
+my $orig_conninfo = $node_primary->connstr();
+
+my $table1 = "vac_horizon_floor_table";
+
+# Long-running Primary Session A
+my $psql_primaryA =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Long-running Primary Session B
+my $psql_primaryB  =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# The TIDStore which vacuum uses to store dead items is optimized for its
+# target system. On a 32-bit system, our example requires around 9000 rows to
+# have enough dead items spread out across enough pages to fill the TIDStore
+# and trigger a second round of index vacuuming. We could get away with fewer
+# rows on 64-bit systems, but it doesn't seem worth the special case.
+my $nrows = 9000;
+
+# Because vacuum's first pass, pruning, is where we use the GlobalVisState to
+# check tuple visibility, Globa

Returning nbtree posting list TIDs in DESC order during backwards scans

2025-06-16 Thread Peter Geoghegan

Currently, nbtree always returns individual posting list TIDs to the
scan in ASC order -- even during a backwards scan (more on why the
deduplication patch deliberately did things that way later). But
allowing backwards scans to return TIDs in "somewhat descending order"
like this results in needless unpinning and re-pinning of the same
heap page. This seems worth fixing.

Patch goals
===

nbtree scans ought to guarantee that they'll return all TIDs in
whatever order they appear on the page, when read either
front-to-back, or back-to-front (whichever order matches the scan's
current direction). Individual posting list TIDs should also be
returned in either front-to-back or back-to-front order, also
according to the scan's direction. In general, the order that TIDs are
returned in ought to never be affected by the use of deduplication --
including during backwards scans.

Attached patch teaches nbtree's _bt_readpage function to return TIDs
from a posting list in DESC order during backwards scans, bringing
nbtree in line with the ideal described in the previous paragraph.
This approach seems more principled to me, and is likely to have a
small performance benefit.

I'll submit this to the first commitfest for Postgres 19.

Test case
=

Here's an example of the problem on HEAD, that the patch fixes:

-- Setup table with low cardinality index:
create table duplicate_test(dup int4);
create index on duplicate_test (dup);
insert into duplicate_test
select val from generate_series(1, 18) val,
generate_series(1,1000) dups_per_val;

-- Forwards scan, gets 23 buffer hits:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup;

-- Equivalent backwards scan, gets 42 buffer hits on HEAD:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup desc;

With the patch applied, *both* queries get the expected 23 buffer hits.

I am generally suspicious of cases where backwards scans are at a
gratuitous disadvantage relative to equivalent forwards scans. See
previous work in commit c9c0589f and commit 1bd4bc85 for earlier
examples of fixing problems that have that same quality to them. This
might be the last such patch that I'll ever have to write.

Sorting so->killedItems[] in _bt_killitems
==

Making this change in _bt_readpage creates a problem in _bt_killitems:
it currently expects posting list TIDs in ascending order (the loop
invariant relies on that), which is why the deduplication patch didn't
just do things like this in _bt_readpage to begin with. The patch
deals with that new problem by qsort'ing the so->killedItems[] array
(at the top of _bt_killitems) such that the items always appear
exactly once, in the expected ASC order.

Another benefit of the qsort approach in _bt_killitems is that it'll
gracefully handle scrollable cursor scans that happen to scroll back
and forth on the same leaf page. This might result in the same dead
TID being added to the so->killedItems[] array multiple times, in
neither ASC or DESC order. That also confuses the loop inside
_bt_killitems, in a way that needlessly prevents setting LP_DEAD bits
(this is a preexisting problem, not a problem created by the changes
that the patch makes to _bt_readpage). Having duplicate TIDs in
so->killedItems[] like this isn't particularly likely, but it seems
worth having proper handling for all the same, on general principle.

I tend to doubt that the overhead of the qsort will ever really
matter, but if it does then we can always limit it to backwards scans
(meaning limit it to cases where so->currPos.dir indicates that
_bt_readpage processed the page in the backwards scan direction), and
drop the goal of dealing with duplicate so->killedItems[] TIDs
gracefully.

Note that the qsort is completely useless during standard forwards
scans. It's hard to tell those apart from scrollable cursor scans that
briefly changed directions "within a page", though, so I'm inclined to
just always do the qsort, rather than trying to cleverly avoid it.
That's what happens in this v1 of the patch (we qsort unconditionally
in _bt_killitems).

-- 
Peter Geoghegan


v1-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
Description: Binary data

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Tomas Vondra

On 6/16/25 17:41, Daniel Gustafsson wrote:
>> On 16 Jun 2025, at 16:20, Tom Lane  wrote:
>>
>> Daniel Gustafsson  writes:
 On 16 Jun 2025, at 15:56, Tom Lane  wrote:
 I've not checked to see what the other users of this API do, but
 if they're all like this then we need to fix that comment.
>>
>>> AFAICT all other callers of this API are throwing an error with pg_fatal, 
>>> and
>>> so does the function in question for ZStd decompression errors.
>>
>> I think I agree that we need to handle the ferror() case inside the
>> read_func for consistency.  But there is another problem, which is
>> that neither of its callers are paying attention to the API: as
>> defined, the read_func can never return anything but "true",
>> which is how Zstd_read behaves.  But both the callers in
>> compress_zstd.c seem to think they should test its result to
>> detect EOF.  
> 
> Attached is a stab at fixing both the error handling in read_func as well as
> the callers misuse of the API.
> 
>> Apparently, our tests do not cover the case of an
>> unexpected EOF.
> 
> I admittedly ran out of steam when looking at adding something like this to 
> our
> pg_dump tests.
> 
>> This API is actually quite bizarrely defined, if you ask me.
>> Returning the byte count is optional, but if you don't pay
>> attention to the byte count you cannot know if you got any
>> data.  At best, that's bug-encouraging.
> 
> Agreed.  Given that many of implementations in the gzip support code directly
> return the gzXXX function I suspect it was modeled around GZip but thats an
> unresearched guess.  The fact that this returns Z_NULL where the API defines
> NULL is unlikely to ever be an issue, but it also doesn't seem entirely neat..
> 

Yes. It's been a while since this commit, but I recall we started with a
gzip-only implementation. 16 introduced this API, but it's definitely
the case it was based on the initial gzip code.

Regarding the Z_NULL, I believe it has always been ignored like this, at
least since 9.1. The code simply returns what gzgets() returns, and then
compares that to NULL, etc. Is there there's a better way to deal with
Z_NULL? I suppose we could explicitly check/translate Z_NULL to NULL,
although Z_NULL is simply defined as 0. I don't recall if NULL has some
additional magic.

regards

-- 
Tomas Vondra

Re: 回复: 回复: Fix potential overflow risks from wcscpy and sprintf

2025-06-16 Thread Tom Lane

Yan Haibo  writes:
> Regarding the use of wcsncpy with LOCALE_NAME_MAX_LENGTH - 1, it is a 
> precaution in case the input string is not null-terminated.

I don't think it's a "precaution".  I think it's introducing a real
bug (that is, failure on a locale name of exactly the max allowed
length) to prevent a hypothetical bug.

regards, tom lane

Avoid possible dereference null pointer (contrib/postgres_fdw/postgres_fdw.c)

2025-06-16 Thread Ranier Vilela

 Hi.

In the function *estimate_path_cost_size* the parameter
fpextra can be NULL.

It is necessary to always check its validity,
as is already done in other parts of the source.

patch attached.

Best regards,
Ranier Vilela


fix-possible-dereference-null-pointer-postgres_fdw.patch
Description: Binary data

Re: pg_dump --with-* options

2025-06-16 Thread Corey Huinker

>
> I noticed that --statistics (i.e., the current --with-statistics) causes
> statistics to be dumped even when used with --data-only or --schema-only.
> So, as far as I understand, here are the possible combinations of dump
> targets and options:
>

Those should also be mutually exclusive, and I'll write up a patch to add
them to the checks.


>
>   schema, data, stats:   --statistics
>   schema, data:  (default)
>   schema, stats: --schema-only --statistics
>   data, stats:   --data-only --statistics
>   schema only:   --schema-only
>   data only: --data-only
>   stats only:--statistics-only
>
> This makes me wonder if --no-data and --no-schema are still necessary.
> They were also introduced in v18, but might now be redundant. If so,
> should we consider removing them?
>
> If we do keep them, we could also use --no-schema --statistics to
> dump data and statistics, but I find --data-only --statistics more
> intuitive.
>

I think this is the exact sort of confusion caused by having two of the
three types default to on in all circumstances, and one default to off in
one special circumstance.

Let's keep this simple, and have all three types default to on in all
circumstances.

Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c)

2025-06-16 Thread Ranier Vilela

Hi.

The function *ScanPgRelation* can return a invalid tuple.
It is necessary to check the function's return,
as is already done in other parts of the source.

patch attached.

Best regards,
Ranier Vilela


fix-possible-dereference-null-pointer-relcache.patch
Description: Binary data

Re: psql: tab-completion support for COPY ... TO/FROM STDIN, STDOUT, and PROGRAM

2025-06-16 Thread Yugo Nagata

On Thu, 5 Jun 2025 16:52:00 +0900
Yugo Nagata  wrote:

> On Thu, 5 Jun 2025 10:08:35 +0900
> Yugo Nagata  wrote:
> 
> > Hi,
> > 
> > Currently, tab completion for COPY only suggests filenames after TO or
> > FROM, even though STDIN, STDOUT, and PROGRAM are also valid syntax options.
> > 
> > I'd like to propose improving the tab completion behavior as described in
> > the subject, so that these keywords are suggested appropriately, and 
> > filenames
> > are offered as potential command names after the PROGRAM keyword.
> > 
> > I've attached this proposal as a patch series with the following three 
> > parts:
> 
> I'm sorry but the previous patches were accidentally broken and didn't work.
> I've attached fixed patches. 
>  
> > 
> > 0001: Refactor match_previous_words() to remove direct use of 
> > rl_completion_matches()
> > 
> > This is a preparatory cleanup. Most completions in match_previous_words() 
> > already use 
> > COMPLETE_WITH* macros, which wrap rl_completion_matches(). However, some 
> > direct calls
> > still remain.
> > 
> > This patch replaces the remaining direct calls with COMPLETE_WITH_FILES or 
> > COMPLETE_WITH_GENERATOR, improving consistency and readability.
> > 
> > 0002: Add tab completion support for COPY ... TO/FROM STDIN, STDOUT, and 
> > PROGRAM
> > 
> > This is the main patch. It extends tab completion to suggest STDIN, STDOUT, 
> > and PROGRAM
> > after TO or FROM. After PROGRAM, filenames are suggested as possible 
> > command names.
> > 
> > To support this, a new macro COMPLETE_WITH_FILES_PLUS is introduced. This 
> > allows
> > combining literal keywords with filename suggestions in the completion list.
> > 
> > 0003: Improve tab completion for COPY option lists
> > 
> > Currently, only the first option in a parenthesized list is suggested 
> > during completion,
> > and nothing is suggested after a comma.
> > 
> > This patch enables suggestions after each comma, improving usability when 
> > specifying
> > multiple options.
> > 
> > Although not directly related to the main proposal, I believe this is a 
> > helpful enhancement
> > to COPY tab completion and included it here for completeness.
> > 
> > I’d appreciate your review and feedback on this series.


I've attached rebased patches.

Best regards,
Yugo Nagata


-- 
Yugo Nagata 
>From 27ce6ff40386e0d43dc3dbf33ee0996365910655 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Thu, 5 Jun 2025 09:39:09 +0900
Subject: [PATCH v3 3/3] Improve tab completion for COPY option lists

Previously, only the first option in a parenthesized list was suggested
during tab completion. Subsequent options after a comma were not completed.
This commit enhances the behavior to suggest valid options after each comma.
---
 src/bin/psql/tab-complete.in.c | 47 +++---
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index e1605f8b73e..9f6ec4e61b2 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -3310,27 +3310,32 @@ match_previous_words(int pattern_id,
 		COMPLETE_WITH("WITH (", "WHERE");
 
 	/* Complete COPY  FROM|TO [PROGRAM]  WITH ( */
-	else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAnyExcept("PROGRAM"), "WITH", "(") ||
-			 Matches("COPY|\\copy", MatchAny, "FROM|TO", "PROGRAM", MatchAny, "WITH", "("))
-		COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
-	  "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
-	  "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING", "DEFAULT",
-	  "ON_ERROR", "LOG_VERBOSITY", "REJECT_LIMIT");
-
-	/* Complete COPY  FROM|TO [PROGRAM]  WITH (FORMAT */
-	else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAnyExcept("PROGRAM"), "WITH", "(", "FORMAT") ||
-			 Matches("COPY|\\copy", MatchAny, "FROM|TO", "PROGRAM", MatchAny, "WITH", "(", "FORMAT"))
-		COMPLETE_WITH("binary", "csv", "text");
-
-	/* Complete COPY  FROM [PROGRAM]  WITH (ON_ERROR */
-	else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAnyExcept("PROGRAM"), "WITH", "(", "ON_ERROR") ||
-			 Matches("COPY|\\copy", MatchAny, "FROM|TO", "PROGRAM", MatchAny, "WITH", "(", "ON_ERROR"))
-		COMPLETE_WITH("stop", "ignore");
-
-	/* Complete COPY  FROM [PROGRAM]  WITH (LOG_VERBOSITY */
-	else if (Matches("COPY|\\copy", MatchAny, "FROM|TO", MatchAnyExcept("PROGRAM"), "WITH", "(", "LOG_VERBOSITY") ||
-			 Matches("COPY|\\copy", MatchAny, "FROM|TO", "PROGRAM", MatchAny, "WITH", "(", "LOG_VERBOSITY"))
-		COMPLETE_WITH("silent", "default", "verbose");
+	else if (HeadMatches("COPY|\\copy", MatchAny, "FROM|TO", MatchAnyExcept("PROGRAM"), "WITH", "(*") ||
+			 HeadMatches("COPY|\\copy", MatchAny, "FROM|TO", "PROGRAM", MatchAny, "WITH", "(*"))
+	{
+		if (!HeadMatches("COPY|\\copy", MatchAny, "FROM|TO", MatchAnyExcept("PROGRAM"), "WITH", "(*)") &&
+			!HeadMatches("COPY|\\copy", MatchAny, "FROM|TO", "PROGRAM", MatchAny, "WITH", "(*)"))
+		{
+			/* We're in an unfinished parenthesized option list. */
+			if (en

Re: Non-reproducible AIO failure

2025-06-16 Thread Andres Freund

Hi,

On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:
> One more update: with the proposed patch (memory barrier before
> `ConditionVariableBroadcast` in `pgaio_io_process_completion`

I don't see how that barrier could be required for correctness -
ConditionVariableBroadcast() is a barrier itself (as the comment above the
call notes, too).

On 2025-06-15 14:48:43 +0300, Konstantin Knizhnik wrote:
> Also I think that replacing bitfields with `uint8` and may be even with
> `int`, is good idea at least to avoids false sharing.

I don't think there's false sharing here. And even if there were, the
granularity at which false sharing occurs is a cache line size, so either 64
or 128 byte.

I unfortunately can't repro this issue so far.  I don't think it's the same
problem as my patch fixes, so I'll push my patch.  How exactly did you
reproduce the probelm?

Greetings,

Andres Freund

Re: CHECKPOINT unlogged data

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 04:36:59PM +0200, Christoph Berg wrote:
> I spent some time digging through the code, but I'm still not entirely
> sure what's happening. There are several parts to it:
> 
> 1) the list of buffers to flush is determined at the beginning of the
> checkpoint, so running a 2nd FLUSH_UNLOGGED checkpoint will not make
> the running checkpoint write these
> 
> 2) running CHECKPOINT updates the checkpoint flags in shared memory so
> I think the currently running checkpoint picks "MODE FAST" up and
> speeds up. (But I'm not entirely sure, the call stack is quite deep
> there.)
> 
> 3) running CHECKPOINT (at least when waiting for it) seems to actually
> start a new checkpoint, so FLUSH_UNLOGGED should still be effective.
> (See the code arount "start_cv" in checkpointer.c)
> 
> Admittedly, adding these points together raises some question marks
> about the flag handling, so I would welcome clarification by someone
> more knowledgeable in this area.

I think you've got it right.  With CHECKPOINT_WAIT set, RequestCheckpoint()
will wait for a new checkpoint to start, at which point we know that the
new flags have been seen by the checkpointer.  If an immediate checkpoint
is pending, CheckpointWriteDelay() will skip sleeping in the
currently-running one, so the current checkpoint will be "upgraded" to
immediate in some sense, but IIUC there will still be another immediate
checkpoint after it completes.  But AFAICT it doesn't pick up
FLUSH_UNLOGGED until the next checkpoint begins.

Another thing to note is what I mentioned earlier:

+   Note that the server may consolidate concurrently requested checkpoints or
+   restartpoints.  Such consolidated requests will contain a combined set of
+   options.  For example, if one session requested an immediate checkpoint and
+   another session requested a non-immediate checkpoint, the server may combine
+   these requests and perform one immediate checkpoint.

-- 
nathan

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 16 Jun 2025, at 15:56, Tom Lane  wrote:
>> I've not checked to see what the other users of this API do, but
>> if they're all like this then we need to fix that comment.

> AFAICT all other callers of this API are throwing an error with pg_fatal, and
> so does the function in question for ZStd decompression errors.

I think I agree that we need to handle the ferror() case inside the
read_func for consistency.  But there is another problem, which is
that neither of its callers are paying attention to the API: as
defined, the read_func can never return anything but "true",
which is how Zstd_read behaves.  But both the callers in
compress_zstd.c seem to think they should test its result to
detect EOF.  Apparently, our tests do not cover the case of an
unexpected EOF.

This API is actually quite bizarrely defined, if you ask me.
Returning the byte count is optional, but if you don't pay
attention to the byte count you cannot know if you got any
data.  At best, that's bug-encouraging.

regards, tom lane

Re: [PATCH] Remove unused #include's in src/backend/utils/adt/*

2025-06-16 Thread Aleksander Alekseev

Hi Tom,

> The removals of  bother me a bit; I believe those used
> to be necessary on some platforms.  Maybe nowadays everybody is
> close enough to POSIX that you can safely extrapolate from what
> clangd says on your own machine, but it's not zero-risk.

Sounds good to me, let's keep .

-- 
Best regards,
Aleksander Alekseev


v2-0001-Remove-unused-include-s-in-src-backend-utils-adt.patch
Description: Binary data

Re: Non-reproducible AIO failure

2025-06-16 Thread Konstantin Knizhnik



On 16/06/2025 6:11 pm, Andres Freund wrote:

Hi,

On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:

One more update: with the proposed patch (memory barrier before
`ConditionVariableBroadcast` in `pgaio_io_process_completion`

I don't see how that barrier could be required for correctness -
ConditionVariableBroadcast() is a barrier itself (as the comment above the
call notes, too).


On 2025-06-15 14:48:43 +0300, Konstantin Knizhnik wrote:

Also I think that replacing bitfields with `uint8` and may be even with
`int`, is good idea at least to avoids false sharing.

I don't think there's false sharing here. And even if there were, the
granularity at which false sharing occurs is a cache line size, so either 64
or 128 byte.


I unfortunately can't repro this issue so far.  I don't think it's the same
problem as my patch fixes, so I'll push my patch.  How exactly did you
reproduce the probelm?

Greetings,

Andres Freund


Sorry, I was not sure that spinlock (used in 
`ConditionVariableBroadcast`) enforces memory barrier. Certainly in this 
case adding memory barrir here is not needed.


Concerning false sharing - I suspected that compiler can extract 
bitfield from the word loaded before read barrier. But it seems to be 
not possible (if barrier is correctl recognized by compiler) and more 
over - I have reproduced it with didsabled optimization, so unlikely 
compiler tries to eliminate some reads here. And I agree with your 
argument about cache line: even replacing uint8 with int will not 
prevent it.


But unfortunately it means that the problem is not fixed. I just did the 
same test as you proposed:


c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM 
large;") -T 1 -P 10


with the following config changes:
io_max_concurrency=1
io_combine_limit=1
synchronize_seqscans=false
restart_after_crash=false
max_parallel_workers_per_gather=0
fsync=off

Postgres was build with the following options: --without-icu --enable-debug 
--enable-cassert CFLAGS=-O0

As I wrote - it takes about 1 seconds to get this assertion failure.
I can  try to do it once again.
Looks like the problem is better reproduced with disabled optimizations.

Re: pglogical3 : support

2025-06-16 Thread Bruce Momjian

On Mon, Jun 16, 2025 at 02:50:18PM +0530, Amit Kapila wrote:
> On Sat, Jun 14, 2025 at 2:05 PM Perumal Raj  wrote:
> >
> > Hi Team,
> >
> > I am looking to upgrade pg17  with near zero downtime using logical 
> > replication(master <-> master) . The current pglogical2 ( open) has some 
> > limitation with 'REPLICA IDENTITY FULL' tables.
> >
> > Do we have any plan to make pglogical3 open source in near future? 😊  or is 
> > there better way to achieve the same using open source packages ?
> >
> 
> The PostgreSQL community doesn't support pglogical. You can contact
> pglogical community for your questions on the same.

FYI, I think the community is here:

https://github.com/2ndQuadrant/pglogical

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Tom Lane

Tomas Vondra  writes:
> For a moment I was worried about breaking ABI when fixing this in the
> backbranches, but I guess that's not an issue for tools like pg_dump.

Yeah, I think it'd be okay to change compress_io.h APIs in the back
branches; it's hard to see how anything outside pg_dump+pg_restore
would be depending on that.

regards, tom lane

Re: Allow pg_dump --statistics-only to dump foreign table statistics?

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 03:39:07PM -0400, Corey Huinker wrote:
> If we aren't exporting stats for foreign tables then that is an oversight,
> the intention always was to fetch all available statistics for all
> relations. I can't offhand think of where we even have the option to
> exclude them.

getRelationStatistics() enumerates the relation kinds:

if ((relkind == RELKIND_RELATION) ||
(relkind == RELKIND_PARTITIONED_TABLE) ||
(relkind == RELKIND_INDEX) ||
(relkind == RELKIND_PARTITIONED_INDEX) ||
(relkind == RELKIND_MATVIEW))

The proposed patch [0] adds RELKIND_FOREIGN_TABLE to this list.  That
appears to be the only missing relation kind that ANALYZE handles.

[0] 
https://postgr.es/m/attachment/177608/v1-0001-pg_dump-Allow-pg_dump-to-dump-the-statistics-for-.patch

-- 
nathan

Re: pg_dump --with-* options

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 03:35:48PM -0400, Corey Huinker wrote:
> I think this is the exact sort of confusion caused by having two of the
> three types default to on in all circumstances, and one default to off in
> one special circumstance.

I revisited the main thread to see how folks voted.  There are a lot of
messages over a long period of time, and folks may have changed their mind
since, but this is what I saw:

off-by-default: Mullane, Haas, Davis, Bossart
on-by-default: Huinker, Lane, Hagander, Frost

In fact, there seems to have been general agreement in 2024 that stats
_should_ be on by default [0].  So perhaps there's not as strong of a
consensus as we thought.  Maybe we should ask for any new/updated votes.

> Let's keep this simple, and have all three types default to on in all
> circumstances.

Assuming we did turn on stats by default, what is the minimum set of new
flags in v18 you'd like to see?

[0] 
https://postgr.es/m/e16cd9caf4f5229a152d318d70b4d323a03e3539.camel%40j-davis.com

-- 
nathan

Re: Per-role disabling of LEAKPROOF requirements for row-level security?

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 01:36:20PM -0400, Tom Lane wrote:
> There might be a genuine hazard if something thinks it can substitute
> use of enum_cmp for enum_eq, as indeed would happen in e.g. mergejoin.

Hm.  Wouldn't that be a mergejoin bug?  I guess I'm not sure how to
reconcile the leakproof criteria in the documentation with the idea that
we can't mark a suite of functions as leakproof if any individual one isn't
leakproof.

-- 
nathan

Re: [PATCH] Remove unused #include's in src/backend/utils/adt/*

2025-06-16 Thread Tom Lane

Aleksander Alekseev  writes:
> clangd indicates that certain #include's are redundant. Removing them
> will speed up the build process a bit.

The removals of  bother me a bit; I believe those used
to be necessary on some platforms.  Maybe nowadays everybody is
close enough to POSIX that you can safely extrapolate from what
clangd says on your own machine, but it's not zero-risk.

regards, tom lane

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 10:59, Vaibhav Dalvi  
> wrote:

> It's OFF.
> postgres=# select version();
> version   
>   
> 
>  PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 
> 14.2.0-4ubuntu2~24.04) 14.2.0, 64-bit
> (1 row)
> postgres=# show allow_system_table_mods ;
>  allow_system_table_mods 
> -
>  off
> (1 row)
> postgres=# CREATE FUNCTION pg_catalog.nont_ext_func() RETURNS char AS $$ 
> BEGIN return 'v'; END; $$ LANGUAGE plpgsql;
> CREATE FUNCTION

I stand corrected, I misremembered the extent to which we prohibit creation in
pg_catalog via that GUC.  It still feels like a case of getting to keep both
pieces when breaking it, but I wonder if we shouldn't make it harder to break?

--
Daniel Gustafsson

Re: CHECKPOINT unlogged data

2025-06-16 Thread Christoph Berg

Re: Laurenz Albe
> How about the following for added clarity:
> 
>Running an explicit CHECKPOINT is not required during
>normal operation; the system schedules checkpoints automatically 
> (controlled
>by the settings in ).
>However, it can be useful to perform an explicit checkpoint immediately
>before shutting down the server or performing an online file system backup,
>if you want the checkpoint implicit in these operations to be as fast as
>possible.  In particular, UNLOGGED table data only get
>flushed to disk during a shutdown checkpoint, so you should use the option
>FLUSH_UNLOGGED for an explicit checkpoint right before a
>shutdown.

Thanks for the suggestions.

I've taken this one, but replaced "be as fast as possible" by "have to
write out less data" to avoid confusion with the FAST mode.

> Oh, interesting; I wasn't aware of that.  That means that running CHECKPOINT
> (FLUSH_UNLOGGED) is less useful than I thought, since there is often already
> a spread checkpoint running.  Would it be useful to recommend that you should
> run the command twice when aiming for a fast shutdown?

I spent some time digging through the code, but I'm still not entirely
sure what's happening. There are several parts to it:

1) the list of buffers to flush is determined at the beginning of the
checkpoint, so running a 2nd FLUSH_UNLOGGED checkpoint will not make
the running checkpoint write these

2) running CHECKPOINT updates the checkpoint flags in shared memory so
I think the currently running checkpoint picks "MODE FAST" up and
speeds up. (But I'm not entirely sure, the call stack is quite deep
there.)

3) running CHECKPOINT (at least when waiting for it) seems to actually
start a new checkpoint, so FLUSH_UNLOGGED should still be effective.
(See the code arount "start_cv" in checkpointer.c)

Admittedly, adding these points together raises some question marks
about the flag handling, so I would welcome clarification by someone
more knowledgeable in this area.

> I find this somewhat confusing; how about
>   How about:

Taken unmodified, thanks!

Christoph
>From 19062d3739b3217781631dd78ed6ad0ed0df3bc5 Mon Sep 17 00:00:00 2001
From: Christoph Berg 
Date: Wed, 11 Jun 2025 16:32:23 +0200
Subject: [PATCH v5 1/2] Rename checkpoint options "immediate" and "flush-all"

There were two names in use for fast checkpoints, "immediate" and
"fast". While "immediate" was the prevalent spelling, one of the two
user-visible places was "pg_basebackup --checkpoint=fast" using the
other spelling. Moreover, the "immediate" naming clashed with shutdowns
where a "fast" shutdown used an "immediate" checkpoint and an
"immediate" shutdown doesn't write a checkpoint at all.

Rename "immediate" checkpoints to "fast" throughout the code. The
user-visible change here is that checkpoint log records will now also
say that a "fast" checkpoint is starting.

The CHECKPOINT_FLUSH_ALL flag was really all about also flushing
UNLOGGED buffers, so rename it to reflect that. Likewise, the checkpoint
start log message now says "flush-unlogged" instead of "flush-all".
---
 doc/src/sgml/backup.sgml  |  2 +-
 doc/src/sgml/func.sgml|  2 +-
 doc/src/sgml/ref/checkpoint.sgml  |  2 +-
 src/backend/access/transam/xlog.c | 24 +--
 src/backend/commands/dbcommands.c | 14 +--
 src/backend/commands/tablespace.c |  2 +-
 src/backend/postmaster/checkpointer.c | 22 -
 src/backend/storage/buffer/bufmgr.c   |  6 ++---
 src/backend/tcop/utility.c|  2 +-
 src/include/access/xlog.h |  6 ++---
 .../recovery/t/041_checkpoint_at_promote.pl   |  2 +-
 11 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 25b8904baf7..5f7489afbd1 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -991,7 +991,7 @@ SELECT pg_backup_start(label => 'label', fast => false);
  usually preferable as it minimizes the impact on the running system.  If you
  want to start the backup as soon as possible, pass true as
  the second parameter to pg_backup_start and it will
- request an immediate checkpoint, which will finish as fast as possible using
+ request a fast checkpoint, which will finish as fast as possible using
  as much I/O as possible.
 
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c67688cbf5f..a49eb8a2af9 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28920,7 +28920,7 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 will be stored.)
 If the optional second parameter is given as true,
 it specifies executing pg_backup_start as quickly
-as possible.  This forces an immediate checkpoint which will cause a
+as possible.  This forces a fast che

Re: 回复: Fix potential overflow risks from wcscpy and sprintf

2025-06-16 Thread Tom Lane

Yan Haibo  writes:
> Thank you. Peter. It seems the patch may have been lost during our earlier 
> communication, so I¡¯ve reattached it here.
> I hope it comes through correctly this time.

Thanks for the patch.

Using wcsncpy in search_locale_enum() seems fine, assuming it exists
on Windows (note that code is Windows-only, possibly explaining why
we've not seen other static-analysis reports).  I doubt there's any
actual bug there, since we're relying on Windows' own
LOCALE_NAME_MAX_LENGTH constant; but I agree the chain of reasoning
is kind of long.  (But shouldn't you write LOCALE_NAME_MAX_LENGTH
not LOCALE_NAME_MAX_LENGTH - 1?)

I'm unexcited about the guc.c changes.  There is visibly no bug
there.  The only reason to change it would be if we were going
to introduce a strict project policy against using sprintf(),
which we're not likely to given that there are hundreds of other
occurrences in our code base.  I don't see a reason to think
that these three are more pressing than the others.

regards, tom lane

Re: No error checking when reading from file using zstd in pg_dump

2025-06-16 Thread Daniel Gustafsson

> On 16 Jun 2025, at 16:20, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>>> On 16 Jun 2025, at 15:56, Tom Lane  wrote:
>>> I've not checked to see what the other users of this API do, but
>>> if they're all like this then we need to fix that comment.
> 
>> AFAICT all other callers of this API are throwing an error with pg_fatal, and
>> so does the function in question for ZStd decompression errors.
> 
> I think I agree that we need to handle the ferror() case inside the
> read_func for consistency.  But there is another problem, which is
> that neither of its callers are paying attention to the API: as
> defined, the read_func can never return anything but "true",
> which is how Zstd_read behaves.  But both the callers in
> compress_zstd.c seem to think they should test its result to
> detect EOF.  

Attached is a stab at fixing both the error handling in read_func as well as
the callers misuse of the API.

> Apparently, our tests do not cover the case of an
> unexpected EOF.

I admittedly ran out of steam when looking at adding something like this to our
pg_dump tests.

> This API is actually quite bizarrely defined, if you ask me.
> Returning the byte count is optional, but if you don't pay
> attention to the byte count you cannot know if you got any
> data.  At best, that's bug-encouraging.

Agreed.  Given that many of implementations in the gzip support code directly
return the gzXXX function I suspect it was modeled around GZip but thats an
unresearched guess.  The fact that this returns Z_NULL where the API defines
NULL is unlikely to ever be an issue, but it also doesn't seem entirely neat..

--
Daniel Gustafsson



0001-pg_dump-Handle-errors-in-reading-ZStd-streams.patch
Description: Binary data

Re: CHECKPOINT unlogged data

2025-06-16 Thread Christoph Berg

Re: Nathan Bossart
> I think you've got it right.  With CHECKPOINT_WAIT set, RequestCheckpoint()
> will wait for a new checkpoint to start, at which point we know that the
> new flags have been seen by the checkpointer.  If an immediate checkpoint
> is pending, CheckpointWriteDelay() will skip sleeping in the
> currently-running one, so the current checkpoint will be "upgraded" to
> immediate in some sense, but IIUC there will still be another immediate
> checkpoint after it completes.  But AFAICT it doesn't pick up
> FLUSH_UNLOGGED until the next checkpoint begins.

> Another thing to note is what I mentioned earlier:

Thanks. I now have this:

  
   If a checkpoint is already running when a CHECKPOINT
   is issued, a new checkpoint is queued.  The server will consolidate multiple
   concurrently requested checkpoints or restartpoints and merge their options.
   For example, if one session requests a fast checkpoint and another session
   requests a spread checkpoint, the server combines these requests and
   performs one fast checkpoint.  Queing a fast checkpoint will also switch a
   currently running spread checkpoint to run fast.
  

Christoph
>From 19062d3739b3217781631dd78ed6ad0ed0df3bc5 Mon Sep 17 00:00:00 2001
From: Christoph Berg 
Date: Wed, 11 Jun 2025 16:32:23 +0200
Subject: [PATCH v6 1/2] Rename checkpoint options "immediate" and "flush-all"

There were two names in use for fast checkpoints, "immediate" and
"fast". While "immediate" was the prevalent spelling, one of the two
user-visible places was "pg_basebackup --checkpoint=fast" using the
other spelling. Moreover, the "immediate" naming clashed with shutdowns
where a "fast" shutdown used an "immediate" checkpoint and an
"immediate" shutdown doesn't write a checkpoint at all.

Rename "immediate" checkpoints to "fast" throughout the code. The
user-visible change here is that checkpoint log records will now also
say that a "fast" checkpoint is starting.

The CHECKPOINT_FLUSH_ALL flag was really all about also flushing
UNLOGGED buffers, so rename it to reflect that. Likewise, the checkpoint
start log message now says "flush-unlogged" instead of "flush-all".
---
 doc/src/sgml/backup.sgml  |  2 +-
 doc/src/sgml/func.sgml|  2 +-
 doc/src/sgml/ref/checkpoint.sgml  |  2 +-
 src/backend/access/transam/xlog.c | 24 +--
 src/backend/commands/dbcommands.c | 14 +--
 src/backend/commands/tablespace.c |  2 +-
 src/backend/postmaster/checkpointer.c | 22 -
 src/backend/storage/buffer/bufmgr.c   |  6 ++---
 src/backend/tcop/utility.c|  2 +-
 src/include/access/xlog.h |  6 ++---
 .../recovery/t/041_checkpoint_at_promote.pl   |  2 +-
 11 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 25b8904baf7..5f7489afbd1 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -991,7 +991,7 @@ SELECT pg_backup_start(label => 'label', fast => false);
  usually preferable as it minimizes the impact on the running system.  If you
  want to start the backup as soon as possible, pass true as
  the second parameter to pg_backup_start and it will
- request an immediate checkpoint, which will finish as fast as possible using
+ request a fast checkpoint, which will finish as fast as possible using
  as much I/O as possible.
 
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c67688cbf5f..a49eb8a2af9 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28920,7 +28920,7 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 will be stored.)
 If the optional second parameter is given as true,
 it specifies executing pg_backup_start as quickly
-as possible.  This forces an immediate checkpoint which will cause a
+as possible.  This forces a fast checkpoint which will cause a
 spike in I/O operations, slowing any concurrently executing queries.


diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml
index db011a47d04..10a433e4757 100644
--- a/doc/src/sgml/ref/checkpoint.sgml
+++ b/doc/src/sgml/ref/checkpoint.sgml
@@ -37,7 +37,7 @@ CHECKPOINT
   
 
   
-   The CHECKPOINT command forces an immediate
+   The CHECKPOINT command forces a fast
checkpoint when the command is issued, without waiting for a
regular checkpoint scheduled by the system (controlled by the settings in
).
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47ffc0a2307..4badd87515c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6505,7 +6505,7 @@ PerformRecoveryXLogAction(void)
 	else
 	{
 		RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-		  CHECKPOINT_IMMEDIATE |
+		  C

Re: Per-role disabling of LEAKPROOF requirements for row-level security?

2025-06-16 Thread Nathan Bossart

Sorry for going on a bit of a tangent, but why is enum_eq not marked
leakproof when its code looks like this?

Datum
enum_eq(PG_FUNCTION_ARGS)
{
Oid a = PG_GETARG_OID(0);
Oid b = PG_GETARG_OID(1);

PG_RETURN_BOOL(a == b);
}

The only previous discussion I see [0] points to discussion about enum_cmp,
which seems to be more obviously non-leakproof due to its use of
enum_cmp_internal().

[0] https://postgr.es/m/14749.1550679857%40sss.pgh.pa.us

-- 
nathan

Re: Per-role disabling of LEAKPROOF requirements for row-level security?

2025-06-16 Thread Tom Lane

Nathan Bossart  writes:
> Sorry for going on a bit of a tangent, but why is enum_eq not marked
> leakproof when its code looks like this?

Perhaps it could be, but I'm not sure how useful that is if we don't
mark the remaining enum comparison functions leakproof.

There might be a genuine hazard if something thinks it can substitute
use of enum_cmp for enum_eq, as indeed would happen in e.g. mergejoin.

regards, tom lane

Re: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 06:31:11PM +, Devulapalli, Raghuveer wrote:
> Attached is a simple reproducer. It passes with clang v16 -O0, but fails
> with 17 and 18 only when built with -O0..

I've just started looking into this, but the difference in code generated
for _mm512_castsi128_si512() between gcc, clang 16, and clang 17 looks
interesting.

-- 
nathan

回复: 回复: 回复: Fix potential overflow risks from wcscpy and sprintf

2025-06-16 Thread Yan Haibo

Thank you, Tom. You’re absolutely right―this change is not necessary. I’ve 
updated the patch accordingly.
Best regards,
Haibo

发件人: Tom Lane 
发送时间: 2025年6月16日 12:46
收件人: Yan Haibo 
抄送: Peter Eisentraut ; pgsql-hackers@lists.postgresql.org 

主题: Re: 回复: 回复: Fix potential overflow risks from wcscpy and sprintf

Yan Haibo  writes:
> Regarding the use of wcsncpy with LOCALE_NAME_MAX_LENGTH - 1, it is a 
> precaution in case the input string is not null-terminated.

I don't think it's a "precaution".  I think it's introducing a real
bug (that is, failure on a locale name of exactly the max allowed
length) to prevent a hypothetical bug.

regards, tom lane

0001-Mitigate-potential-overflow-risks-from-wcscpy.patch
Description: 0001-Mitigate-potential-overflow-risks-from-wcscpy.patch

RE: pg_recvlogical cannot create slots with failover=true

2025-06-16 Thread Hayato Kuroda (Fujitsu)

Dear Amit, Sawada, Peter,

> > > I wonder if the option name --failover is ideal.  To me, it sounds like
> > > an action "do a failover!".  Also consider that pg_recvlogical has other
> > > action options such as --start and --create-slot, so it sounds a bit
> > > like those.
> >
> > Fair point.
> >
> > > Maybe something like --enable-failover would be better?
> >
> > Sounds better, but probably the --two-phase option has the same issue?
> >
> 
> Ideally, we should change both to maintain consistency across various
> slot options. OTOH, as we have already described these options as: "
> The --two-phase and --failover options can be specified with
> --create-slot.", it is clear that these are slot options. The previous
> version docs have a description: "The --two-phase can be specified
> with --create-slot to enable decoding of prepared transactions." which
> makes it even more clear that the two-phase is a slot option. The
> options are named similarly in pg_create_logical_replication_slot API
> and during CREATE SUBSCRIPTION, so, to some level, there is a
> consistency in naming of these options across all APIs/tools. But,
> their usage in a tool like pg_recvlogical could be perceived
> differently as well, so it is also okay to consider naming them
> differently.

Either name is fine for me, but I have a concern for the description. Now the
documentation says:

```
-t
--two-phase
Enables decoding of prepared transactions. This option may only be specified 
with --create-slot.
```

If we clarify the option is aimed for the slot, should we follow the
description in the protocol.sgml? I.e.,

```
-t
--two-phase
the slot supports decoding of two-phase commit. This option may only be 
specified with --create-slot.
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: pg_dump --with-* options

2025-06-16 Thread Jeff Davis

On Mon, 2025-06-16 at 16:09 -0500, Nathan Bossart wrote:
> So perhaps there's not as strong of a
> consensus as we thought.  Maybe we should ask for any new/updated
> votes.

Does it make any sense to be off by default in 18 and on in some later
release?

Regards
Jeff Davis

Re: Make COPY format extendable: Extract COPY TO format implementations

2025-06-16 Thread Michael Paquier

On Tue, Jun 17, 2025 at 08:50:37AM +0900, Sutou Kouhei wrote:
> OK. I'll implement the initial version with this
> design. (Allocating IDs local not shared.)

Sounds good to me.  Thanks Sutou-san!
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade fails with an error "object doesn't exist"

2025-06-16 Thread Vaibhav Dalvi

Hi Tom,

Should we at least restrict dumping privileges for user objects inside
pg_catalog to avoid pg_upgrade failure?

Regards,
Vaibhav


On Mon, Jun 16, 2025 at 7:11 PM Tom Lane  wrote:

> Daniel Gustafsson  writes:
> > On 16 Jun 2025, at 09:29, Vaibhav Dalvi 
> wrote:
> >> Why can't we strictly restrict object creation in pg_catalog?
>
> > Do you have allow_system_table_mods set to ON by any chance?  As Laurenz
> said,
> > such creation is already restricted, but it can be circumvented by using
> said
> > GUC (which is *not* intended for production usage).
>
> I think that setting only applies to creating or modifying *tables*,
> not functions.  The point of it is to keep you from breaking the C
> code's assumptions about the layout of system catalogs.
>
> Having said that, I don't see a problem here.  You're not going
> to be able to create/modify functions in pg_catalog unless you
> are superuser (or a superuser gave you permissions you shouldn't
> have).  There are already a near-infinite number of ways
> for a superuser to break the system, so this one isn't making it
> detectably worse.  Furthermore, there are legitimate use-cases
> for adding/changing functions there.  (I recall that the old
> "adminpack" extension used to do so, for example, and there are
> probably others that still do.)
>
> regards, tom lane
>

Re: BackendKeyData is mandatory?

2025-06-16 Thread Tom Lane

Tatsuo Ishii  writes:
> In the Frontend/Backend protocol, it is explained that after
> successful authentication following messages can be sent from backend
> to frontend[1]:

> BackendKeyData
> ParameterStatus
> ReadyForQuery
> ErrorResponse
> NoticeResponse

> My question is, BackendKeyData is mandatory or not. Currently
> Pgpool-II raises a fatal error if BackendKeyData is not sent before
> ReadyForQuery arrives. This is because without the message, frontend
> cannot send a CancelRequest message later on, as there's no secret
> key.

As you say, without BackendKeyData it's impossible to send a query
cancel, so we expect the server will always send that.

> I heard that some "PostgreSQL compatible" servers do not send
> BackendKeyData message to frontend. I wonder if this is a protocol
> violation.

I'd say so.  Maybe whoever that is doesn't care to support query
cancel.  They're within their rights to do that I guess, but
Pgpool-II does not have to support the case.  (A less incompatible
way of not supporting query cancel is to send dummy BackendKeyData
values and then just ignore cancel requests.  So I don't see that
you need to do anything towards this goal, if it is a goal and
not merely a broken implementation.)

regards, tom lane

Re: pg_dump --with-* options

2025-06-16 Thread Fujii Masao





On 2025/06/17 9:58, Nathan Bossart wrote:

On Mon, Jun 16, 2025 at 07:09:17PM -0400, Tom Lane wrote:

I find myself increasingly persuaded by Corey's point of view ...


+1


Can you clarify how using on-by-default would simplify things?
I'm not sure it actually makes the options simpler.

Regards,

--
Fujii Masao
NTT DATA Japan Corporation

BackendKeyData is mandatory?

2025-06-16 Thread Tatsuo Ishii

In the Frontend/Backend protocol, it is explained that after
successful authentication following messages can be sent from backend
to frontend[1]:

BackendKeyData
ParameterStatus
ReadyForQuery
ErrorResponse
NoticeResponse

My question is, BackendKeyData is mandatory or not. Currently
Pgpool-II raises a fatal error if BackendKeyData is not sent before
ReadyForQuery arrives. This is because without the message, frontend
cannot send a CancelRequest message later on, as there's no secret
key.

I heard that some "PostgreSQL compatible" servers do not send
BackendKeyData message to frontend. I wonder if this is a protocol
violation.

[1] 
https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-START-UP

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: pg_dump --with-* options

2025-06-16 Thread Tom Lane

Jeff Davis  writes:
> Does it make any sense to be off by default in 18 and on in some later
> release?

Probably not, especially if part of the argument for on-by-default
is to allow simplification of the switch set.  We don't get that
benefit if we ship with off-by-default, and we won't be able to
get it later.

I find myself increasingly persuaded by Corey's point of view ...

regards, tom lane

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2025-06-16 Thread vignesh C

Hi Alexander,

While tracking buildfarm for one of other commits, I noticed this failure:
TRAP: failed Assert("s->data.restart_lsn >=
s->last_saved_restart_lsn"), File:
"../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797
postgres: standby: checkpointer (ExceptionalCondition+0x83) [0x55fa69b79f5e]
postgres: standby: checkpointer
(InvalidateObsoleteReplicationSlots+0x53c) [0x55fa69982171]
postgres: standby: checkpointer (CreateCheckPoint+0x9ad) [0x55fa6971feb2]
postgres: standby: checkpointer (CheckpointerMain+0x4b1) [0x55fa6996431c]
postgres: standby: checkpointer (postmaster_child_launch+0x130) [0x55fa69964b41]
postgres: standby: checkpointer (+0x40a1a7) [0x55fa699671a7]
postgres: standby: checkpointer (PostmasterMain+0x1563) [0x55fa6996aed6]
postgres: standby: checkpointer (main+0x7f0) [0x55fa6989f798]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f1876a54ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f1876a54d65]
postgres: standby: checkpointer (_start+0x21) [0x55fa696421a1]

Scorpion is failing for pg_basebackup's 020_pg_receivewal test at [1].
[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=scorpion&dt=2025-06-17%2000%3A40%3A46&stg=pg_basebackup-check

Regards,
Vignesh

On Mon, 16 Jun 2025 at 16:47, Alexander Korotkov  wrote:
>
> Dear Kuroda-san,
>
> On Mon, Jun 16, 2025 at 12:11 PM Hayato Kuroda (Fujitsu)
>  wrote:
> > Thanks for pushing the fix patch! BTW, I have few comments for your commits.
> > Can you check and include them if needed?
> >
> > 01.
> > ```
> > $node->append_conf('postgresql.conf',
> > "shared_preload_libraries = 'injection_points'");
> > ```
> >
> > No need to set shared_preload_libraries in 046/047. ISTM it must be set 
> > when we
> > enable the statistics.
> >
> > 02.
> > We should also check whether the injection_points can be installed or not.
> > You can check check_extension() and callers.
>
> Thank you!  All of these totally make sense.  The updated patch is attached.
>
> --
> Regards,
> Alexander Korotkov
> Supabase

Re: BackendKeyData is mandatory?

2025-06-16 Thread David G. Johnston

On Monday, June 16, 2025, Tatsuo Ishii  wrote:

>
> My question is, BackendKeyData is mandatory or not. Currently
> Pgpool-II raises a fatal error if BackendKeyData is not sent before
> ReadyForQuery arrives. This is because without the message, frontend
> cannot send a CancelRequest message later on, as there's no secret
> key.

I wouldn’t expect a proxy to make a judgement here; but to simply forward
what does show up and otherwise stay silent.  If there is proxy layer code
needed to deal with its absence ignoring the cancel attempt with a log
warning would be sufficient.  Otherwise, the user has made their choices
and this is an optional feature in practice (though resorting to
pg_cancel_query make be required for truly hung processes).

David J.

Re: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Andy Fan

"Devulapalli, Raghuveer"  writes:

> Great catch! From the intrinsic manual: 
>
> Cast vector of type __m128i to type __m512i; the upper 384 bits of the
> result are undefined.

Just be curious, what kind of optimization (like what -O2 does) could
mask this issue?

> Replacing that with _mm512_zextsi128_si512 fixes the problem.

congratulations!

-- 
Best Regards
Andy Fan

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

> Just be curious, what kind of optimization (like what -O2 does) could mask 
> this
> issue?

>= -O1. Only -O0 shows this problem.

Raghuveer

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

Great catch! From the intrinsic manual: 

Cast vector of type __m128i to type __m512i; the upper 384 bits of the result 
are undefined.

Replacing that with _mm512_zextsi128_si512 fixes the problem. 

> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, June 16, 2025 3:14 PM
> To: Devulapalli, Raghuveer 
> Cc: John Naylor ; Andy Fan
> ; Jesper Pedersen ;
> Tomas Vondra ; pgsql-hackers@lists.postgresql.org;
> Shankaran, Akash 
> Subject: Re: Improve CRC32C performance on SSE4.2
> 
> On Mon, Jun 16, 2025 at 06:31:11PM +, Devulapalli, Raghuveer wrote:
> > Attached is a simple reproducer. It passes with clang v16 -O0, but
> > fails with 17 and 18 only when built with -O0..
> 
> I've just started looking into this, but the difference in code generated for
> _mm512_castsi128_si512() between gcc, clang 16, and clang 17 looks 
> interesting.
> 
> --
> nathan

Re: [BUG?] check_exclusion_or_unique_constraint false negative

2025-06-16 Thread Mihail Nikalayeu

Rebased.


v8-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patch
Description: Binary data


v8-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patch
Description: Binary data

Re: pg_dump --with-* options

2025-06-16 Thread Fujii Masao





On 2025/06/17 4:35, Corey Huinker wrote:

I noticed that --statistics (i.e., the current --with-statistics) causes
statistics to be dumped even when used with --data-only or --schema-only.
So, as far as I understand, here are the possible combinations of dump
targets and options:


Those should also be mutually exclusive, and I'll write up a patch to add them 
to the checks.


--sequence-data behaves similarly, i.e., it still dumps sequence data
even when used with --schema-only. So I've been thinking of both
--statistics and --sequence-data as options that include additional data,
regardless of whether --*-only is specified.

It seems better to keep their behavior consistent to avoid confusing users.

Regards,

--
Fujii Masao
NTT DATA Japan Corporation

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

2025-06-16 Thread Peter Geoghegan

On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan  wrote:
> Making this change in _bt_readpage creates a problem in _bt_killitems:
> it currently expects posting list TIDs in ascending order (the loop
> invariant relies on that), which is why the deduplication patch didn't
> just do things like this in _bt_readpage to begin with. The patch
> deals with that new problem by qsort'ing the so->killedItems[] array
> (at the top of _bt_killitems) such that the items always appear
> exactly once, in the expected ASC order.

Actually, we can just completely get rid of so->killedItems[]. We can
replace it with a new 1 bit itemDead field in the BTScanPosItem struct
(the struct used for so->currPos.items[] entries). That way, the patch
avoids a qsort. The patch doesn't need to change the order of anything
(except of course for the order that _bt_readpage initially saves
posting list tuple TIDs in, when scanning backwards).

The problem with so->killedItems[] is that it stores things in
whatever order successive btgettuple calls find most convenient in the
kill_prior_tuple path. It turns out that it's even more convenient for
btgettuple if its this kill_prior_tuple path simply sets the relevant
(prior/just-returned) item's so->currPos.items[].itemDead field. There
is no need to allocate memory for so->killedItems[], and no need to
worry about running out of space in so->killedItems[] in cases
involving scrollable cursors that happen to scroll back and forth,
triggering kill_prior_tuple for the same TID. We'll simply set the
TID/item's so->currPos.items[].itemDead field a second or a third
time, which can't complicate things, since my new approach makes that
idempotent.

Attached is v2 of the patch, which works that way. This seems like a
better direction to take things in within _bt_killitems.

In general we're quite sensitive to the size of the BTScanPosItem
struct, since (at least for now) we routinely allocate
MaxTIDsPerBTreePage-many of them (i.e. 1,358 of them). 1,358 * 10
bytes is already too much. But there's no need to make the relevant
allocation even larger. I can steal a bit from
BTScanPosItem.tupleOffset for these new
so->currPos.items[].itemDead/BTScanPosItem.itemDead fields. That's
safe, since we only really need 15 bits for each
BTScanPosItem.tupleOffset offset.

--
Peter Geoghegan

v2-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
Description: Binary data

1 2 >

1 - 100 of 117 matches

Mail list logo