Re: Statistics Import and Export

2025-02-25 Thread Jeff Davis
On Tue, 2025-02-25 at 11:30 +0800, jian he wrote:
> maybe we can just use
> elog(ERROR, "pg_class entry for relid %u not found", reloid)));

Thank you.

> also in stats_lock_check_privileges.
> check_inplace_rel_lock related comments should be removed?

In-place update locking rules still apply when updating pg_class or
pg_database even if the current caller is not performing an in-place
update. It might be better to point instead to 
check_lock_if_inplace_updateable_rel()?

Regards,
Jeff Davis




Re: Remove wal_[sync|write][_time] from pg_stat_wal and track_wal_io_timing

2025-02-25 Thread Bertrand Drouvot
Hi,

On Tue, Feb 25, 2025 at 02:20:48PM +0900, Michael Paquier wrote:
> On Mon, Feb 24, 2025 at 01:08:03PM +, Bertrand Drouvot wrote:
> > I agree that we've to put everything in the picture (system with or without
> > cheap timing functions, lock contention and WAL flush disk time) and that we
> > can certainly find configuration/workload that would get benefits from a
> > dedicated track_wal_io_timing GUC.
> > 
> > PFA a patch to re-introduce the track_wal_io_timing GUC and to ensure that 
> > the
> > WAL write and fsync timing activities are tracked in pg_stat_io (and
> > pg_stat_get_backend_io()) only if both track_io_timing and 
> > track_wal_io_timing
> > are enabled.
> > 
> > Note that to change the a051e71e28a behavior, the attached patch adds an 
> > extra
> > "track_io_guc" parameter to pgstat_count_io_op_time() (like 
> > pgstat_prepare_io_time
> > already had in a051e71e28a).
> 
> +  bool  track_timing = track_io_timing && track_wal_io_timing;
> -  start = pgstat_prepare_io_time();
> +  start = pgstat_prepare_io_time(track_timing);
> 
> This does not represent exactly what I am understanding from the
> comment of upthread, because WAL IO timings would require both
> track_wal_io_timing and track_io_timing to be enabled with this patch.
> It seems to me that what we should do is to decouple completely the
> timings for WAL and non-WAL IO across the two GUCs track_wal_io_timing
> and track_io_timing, without any dependency between one and the other.
> This way, it is possible to only enable one of them without affecting
> the paths of the other, or both if you are ready to pay the price.

The idea was to not let track_io_timing alone enable the timing in the WAL
code path. My reasoning was: if you want to see timing in pg_stat_io then you
need to enable track_io_timing. But that's not enough if you also want to see
WAL timing, then you also need to set track_wal_io_timing. Your proposal also
ensures that "track_io_timing alone can not enable the timing in the WAL code 
path",
with a clear separation of duties, it's probably better so I'm fine with it.

> Two new things tracked in pg_stat_io are WALRead() and XLogPageRead(),
> which are used at recovery, and for consistency with the rest there is
> a good argument for controlling these as well with
> track_wal_io_timing, I guess.

Yeah, I though about it too but decided to not change the READ part in v1 
(because
I think they are less of a concern). OTOH, if you want to see the READ timing 
then
you need to set track_wal_io_timing but then once the recovery is over then 
you'll
need to disable track_wal_io_timing (if you don't want to pay the price for
write/fsync activities).

OTOH pre a051e71e28a track_wal_io_timing was impacting only the write/fsyncs,
in that regard v1 was close to that.

I think both idea (v1 / v2) have pros and cons and I don't have a strong opinion
on it (though I do prefer v1 a bit for the reasons mentioned above).

>  void
>  pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> -instr_time start_time, uint32 cnt, uint64 bytes)
> +instr_time start_time, uint32 cnt, uint64 bytes,
> +bool track_io_guc)
> 
> Not much a fan of the new argument to pass the GUC value, which could
> be error prone.  It would be simpler to check that start_time is 0
> instead.  There is no need to change the other callers of
> pgstat_count_io_op_time() if we do that.

Yeah I thought about it too

-   if (track_io_guc)
+   if (!INSTR_TIME_IS_ZERO(start_time))

INSTR_TIME_IS_ZERO is cheap so it is fine by me.

> pgstat_count_backend_io_op_time() would have triggered an assertion
> failure, it needs to be adjusted.

With v2 in place, yeah.

> With more tweaks applied to the docs, the attached is my result.

In the track_io_timing GUC description Shouldn't we also mention the
wal object restriction, something like?

"
in pg_stat_database, pg_stat_io (if object is not wal), in the output of the
pg_stat_get_backend_io() function (if object is not wal)
"

Also in the track_wal_io_timing GUC description add the wal object restriction
for the function:

"
and in the output of the pg_stat_get_backend_io() function for the object wal
"

The proposed doc changes are in the .txt attached (that applies on top of v2).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6186648ccaf..e55700f35b8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8378,9 +8378,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 displayed in 
 pg_stat_database,
 
-pg_stat_io, in the output of the
+pg_stat_io (if 
object
+is not wal), in the output of the
 
-pg_stat_get_backend_io() function, in the
+pg_st

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-25 Thread Shubham Khanna
On Fri, Feb 21, 2025 at 8:31 AM Peter Smith  wrote:
>
> On Thu, Feb 20, 2025 at 5:44 PM Shubham Khanna
>  wrote:
> >
> > On Mon, Feb 17, 2025 at 9:49 AM Peter Smith  wrote:
> > >
> ...
> > > ==
> > > 1 GENERAL. Option Name?
> > >
> > > Wondering why the patch is introducing more terminology like
> > > "cleanup"; if we are dropping publications then why not say "drop"?
> > > Also I am not sure if "existing" means anything because you cannot
> > > cleanup/drop something that is not "existing".
> > >
> > > IOW, why not call this the "--drop-publications" option?
> > >
> >
> > I have retained the option name as '--cleanup-existing-publications'
> > for now. However, I understand the concern regarding terminology and
> > will revise it in the next patch version once there is a consensus on
> > the final name.
> >
>
> BTW, Is it even correct to assume that the user has to choose between
> (a) clean everything or (b) clean nothing? That is why I felt
> something that mentions only ‘publication’ gives more flexibility.
> Later, when some DROP  feature might be added then we can
> make additional --drop-XXX switches, or a --drop-all switch for
> cleaning everything
>
> //
>
> Some more review comments for patch v8-0001
>
> ==
> Commit message
>
> 1.
> To prevent accidental removal of user-created publications on the subscriber,
> this cleanup process only targets publications that existed on the source
> server and were replicated to the subscriber.
> Any manually created publications on the subscriber will remain intact.
>
> ~
>
> Is this scenario (manually created publications on the subscriber)
> possible to do?
> If yes, then there should be a test case for it
> If not, then maybe this whole paragraph needs rewording.
>
> ~~~

Fixed.

>
> 2.
> This feature is optional and only takes effect when the
> '--cleanup-existing-publications' option is explicitly specified.
> Since publication cleanup is not required when upgrading streaming replication
> clusters, this option provides users with the flexibility to enable or skip 
> the
> cleanup process as needed.
>
> ~
>
> AFAICT, this is pretty much just saying that the switch is optional
> and that the feature only does something when the switch is specified
> by the user. But, doesn't all that just go without saying?
>
>

Fixed.

> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> 3.
> + /* Drop all existing publications if requested */
> + if (drop_publications)
> + {
> + pg_log_info("Dropping all existing publications in database \"%s\"",
> + dbinfo[i].dbname);
> + drop_all_publications(conn, dbinfo[i].dbname);
> + }
> +
>   /*
>   * Since the publication was created before the consistent LSN, it is
>   * available on the subscriber when the physical replica is promoted.
>   * Remove publications from the subscriber because it has no use.
> + * Additionally, drop publications existed before this command if
> + * requested.
>   */
>   drop_publication(conn, &dbinfo[i]);
> ~
>
> 3a.
> The logic of this code seems strange - e.g. it is optionally dropping
> all publications, and then dropping yet another one again. Won't that
> be already dropped as part of the drop_all?
>
> ~
>

Fixed.

> 3b.
> Anyway, perhaps the logic should be encapsulated in a new helper
> check_and_drop_existing_publications() function to match the existing
> 'check_and_drop_existing_subscriptions()
>
> ~
>

Added a new function 'check_and_drop_exixsting_publications()'

> 3c.
> I felt logs like "Dropping all existing publications in database
> \"%s\"" should be logged within the function drop_all_publications,
> not at the caller.
>
> ~~~

Fixed.

>
> _drop_one_publication:
>
> 4.
> -/*
> - * Remove publication if it couldn't finish all steps.
> - */
> +/* Helper function to drop a single publication by name. */
>  static void
> -drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
> +_drop_one_publication(PGconn *conn, const char *pubname, const char *dbname)
>
> A better name for this helper might be drop_publication_by_name()
>
> ~~~

Removed this function.

>
> 5.
> + appendPQExpBuffer(query, "DROP PUBLICATION %s", pubname);
> + pg_log_debug("Executing: %s", query->data);
> + pg_log_info("Dropping publication %s in database \"%s\"", pubname,
> dbinfo->dbname);
>
> That debug seems more typically written as:
> pg_log_debug("command is: %s", query->data);
>
> ~~~

Fixed.

>
> drop_all_publications:
>
> 6.
> +/* Drop all publications in the database. */
> +static void
> +drop_all_publications(PGconn *conn, const char *dbname)
> +{
> + PGresult   *res;
> + int count = 0;
>
> The 'count' variable seems unused.
>
> ~~~

Removed the usage of 'count'.

>
> 7.
> + for (int i = 0; i < PQntuples(res); i++)
> + {
> + char*pubname_esc = PQescapeIdentifier(conn, PQgetvalue(res, i, 0),
> + strlen(PQgetvalue(res, i, 0)));
> +
> + _drop_one_publication(conn, pubname_esc, dbname);
> + PQfreemem(pubname_esc);
> + count++;
> + }
>
> Instead of escaping the name here, and also e

Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-25 Thread Shubham Khanna
On Fri, Feb 21, 2025 at 3:06 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shubham,
>
> Thanks for updating the patch. Here are comments.
>
> 01.
> ```
> /*
>  * Create the subscriptions, adjust the initial location for logical
>  * replication and enable the subscriptions. That's the last step for logical
>  * replication setup. If 'drop_publications' is true, existing publications on
>  * the subscriber will be dropped before creating new subscriptions.
>  */
> static void
> setup_subscriber(struct LogicalRepInfo *dbinfo, const char *consistent_lsn,
>  bool drop_publications)
> ```
>
> Even drop_publications is false, at least one publication would be dropped. 
> The
> argument is not correct. I prefer previous name.
>

Fixed.

> 02.
> ```
> /* Drop all existing publications if requested */
> if (drop_publications)
> {
> pg_log_info("Dropping all existing publications in 
> database \"%s\"",
> dbinfo[i].dbname);
> drop_all_publications(conn, dbinfo[i].dbname);
> }
>
> /*
>  * Since the publication was created before the consistent 
> LSN, it is
>  * available on the subscriber when the physical replica is 
> promoted.
>  * Remove publications from the subscriber because it has no 
> use.
>  * Additionally, drop publications existed before this 
> command if
>  * requested.
>  */
> drop_publication(conn, &dbinfo[i]);
> ```
>
> It is quite strange that drop_publication() is called even when 
> drop_all_publications() is called.
>

Fixed.

> 03.
> Also, I'm not sure pg_log_info() is needed here. _drop_one_publication() 
> outputs dropping publications.
>

Fixed.

> 04.
> ```
> /* Helper function to drop a single publication by name. */
> static void
> _drop_one_publication(PGconn *conn, const char *pubname, const char *dbname)
> ```
>
> Functions recently added do not have prefix "_". I think it can be removed.
>

Removed this function.

> 05.
> ```
> pg_log_debug("Executing: %s", query->data);
> pg_log_info("Dropping publication %s in database \"%s\"", pubname, 
> dbinfo->dbname);
> ```
> In _drop_one_publication(), dbname is used only for the message. Can we move 
> to pg_log_info()
> outside the function and reduce the argument?
>

Fixed.

> 06.
> Also, please start with lower case.
>

Fixed.

> 07.
> Also, please preserve the message as much as possible. Previously they are:
> ```
> pg_log_info("dropping publication \"%s\" in database \"%s\"", 
> dbinfo->pubname, dbinfo->dbname);
> pg_log_debug("command is: %s", str->data);
> ```
>

Fixed.

> 08.
> I feel we must update made_publication.
>

Fixed.

The attached patch at [1] contains the suggested changes.

[1] - 
https://www.postgresql.org/message-id/CAHv8Rj%2BzmkwunNubo%2B_Gp26S_qXbD%3Dp%2BrMfEnPnjiEE1A6GTXQ%40mail.gmail.com

Thanks and regards,
Shubham Khanna.




Security Label Inheritance

2025-02-25 Thread Damien Clochard

Hi

My name's Damien Clochard and I'm the main developer of the PostgreSQL 
Anonymizer extension. This extension relies heavily on security labels. 
Among other things, users can add labels to define a "masking rule" upon 
a column and to declare that a roles is "masked".


More details on how this works on the documentation:
https://postgresql-anonymizer.readthedocs.io/en/latest/dynamic_masking/

Currently a security label is applied to one and only one object. In 
particular a label does not apply to the objects that inherit from the 
objet it is originally associated with.


Consider the following

ALTER DATABASE foo SET session_preload_libraries = 'anon';
ALTER DATABASE foo SET anon.transparent_dynamic_masking = TRUE;

CREATE TABLE people AS SELECT 5432 AS id, 'slonik' AS name;
CREATE EXTENSION anon;
SECURITY LABEL FOR anon ON COLUMN people.name IS 'MASKED WITH VALUE 
NULL';

CREATE ROLE extern INHERIT;
SECURITY LABEL FOR anon ON ROLE extern IS 'MASKED';
GRANT USAGE ON SCHEMA public TO extern;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO extern;
CREATE ROLE joe LOGIN IN ROLE extern;

When connecting as role joe, I will have all the privileges granted to 
extern. However, joe is not affected by the security label defined on 
the extern group. He can read the personal data.


\connect - joe

SELECT * FROM people;
  id  |  name
--+
 5432 | slonik

I need to explicitly change the role for the label to be applied and in 
that case the mask upon the people.name column is applied


SET ROLE extern;

SELECT * FROM people;
  id  | name
--+--
 5432 |

I am not suggesting that this behaviour should change but I receive a 
lot of feedback from users showing some confusion as they would expect 
that the INHERIT clause would also apply to security labels, just like 
granted privileges.


A similar confusion occurs with inherited tables :

\connect - postgres
CREATE TABLE french (eat_frogs BOOLEAN) INHERITS(people);
INSERT INTO french VALUES (0,'damien', FALSE);
GRANT SELECT ON TABLE french TO extern;

SET ROLE extern;

The mask is applied to the parent table

SELECT * FROM people;
  id  |  name
--+
 5432 |
0 |

But it is not applied to the child table

SELECT * FROM french;
 id |  name  | eat_frogs
++---
  0 | damien | f

Again most users expect that the masking rule on people.name would also 
be applied to french.name.



So my first question is : Do you think it would be helpful to update the 
SECURITY LABEL command documentation to clarify that security labels are 
not concerned by object inheritance ?


My second question is more open : do you think it would be worth adding 
a new way to declare that a security label applies to an object and all 
its inheritants ?  As I understand this would concern only roles and 
tables.


Maybe a new optional `[ [WITH] INHERIT | NOINHERIT ]` syntax at the end 
of the SECURITY LABEL command


Something like this :

SECURITY LABEL FOR anon ON ROLE extern IS 'MASKED' WITH INHERIT;

SECURITY LABEL FOR anon ON COLUMN people.name
  IS 'MASKED WITH VALUE NULL'
  WITH INHERIT;

The default would be NOINHERIT and all extensions that rely on the 
current behaviour would continue to work without any change.


Let me know if I missed anything or if there's another way to achieve 
this kind of security label inheritance.


Regards,

--
Damien





Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor
On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer
 wrote:
>
> > Here's another idea to make it more automatic: Give up on initializing every
> > capability at once.
>
> I'm not sure I like giving up this. Initializing and running CPUID check with 
> the attribute constructor is very valuable for two reasons: (1) you get 
> everything done at load time before main and (2) you don’t have to run cpuid 
> check for every feature (popcount, crc32c, or anything else you add in the 
> future) multiple times. It keep the cpuid functionality in a central place 
> that makes it a modular design.

I agree it would be preferable to make a centralized check work.

> On MSVC, we could have the first SIMD feature call pg_cpucap_initialize() 
> which runs CPUID stores the cpu features. Any subsequent call can skip 
> (because it has already been initialized) by using a static variable or some 
> other approach. Does this make sense?

Correct me if I'm misunderstanding, but this sounds like in every
frontend program we'd need to know what the first call was, which
seems less maintainable than just initializing at the start of every
frontend program.


--
John Naylor
Amazon Web Services




Re: Changing shared_buffers without restart

2025-02-25 Thread Dmitry Dolgov
> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

1. Allowing to decouple a GUC value change from actually applying it, sort of a
"pending" change. The idea is to let a custom logic be triggered on an assign
hook, and then take responsibility for what happens later and how it's going to
be applied. This allows to use regular GUC infrastructure in cases where value
change requires some complicated processing. I was trying to make the change
not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen from few
more or less equal alternatives, which are evolving along following lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural idea,
but it poses certain challenges since it doesn't have locking infrastructure.
Another option would be to elect a single backend to be a coordinator, which
will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other backends
and trigger the resize process. Backends join a Barrier to synchronize and wait
untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the signal.
What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding of what
NBuffers value is, any failure is mostly a hard stop, since to rollback the
change coordination is needed as well and sounds a bit too complicated for now.

We've tested this change manually for now, although it might be useful to try
out injection points. The testing strategy, which has caught plenty of bugs,
was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by manually
injecting delays to trigger expected scenarios.

To reiterate, here is patches breakdown:

Patches 1-3 prepare the infrastructure and shared memory layout. They could be
useful even with multithreaded PostgreSQL, when there will be no need for
shared memory. I assume, in the multithreaded world there still will be need
for a contiguous chunk of memory to share between threads, and its layout would
be similar to the one with shared memory mappings. Note that patch nr 2 is
going away as soon as I'll get to implement shared memory address reservation,
but for now it's needed.

Patch 4 is a new addition to handle "pending" GUC changes.

Patch 5 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.

Patch 6 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.

I would like to get some feedback on the synchronization part. While waiting
I'll proceed implementing shared memory address space reservation and Ashutosh
will continue with buffer eviction to support shared memory reduction.
>From d88185fb3b4a3a0e102a3af52f4fb5564468db15 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Wed, 19 Feb 2025 17:43:13 +0100
Subject: [PATCH v2 1/6] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c   |   4 +-
 src/backend/port/sysv_sema.c|   4 +-
 src/backend/port/sysv_shmem.c   | 138 ++-
 src/backend/port/win32_sema.c   |   2 +-
 src/backend/storage/ipc/ipc.c   |   4 +-
 src/backend/storage/ipc/ipci.c  |  63 +++--
 src/backend/storage/ipc/shmem.c | 141 +++-
 src/backend/storage/lmgr/lwlock.c   |   5 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/ipc.h   |   2 +-
 src/includ

Re: Restrict copying of invalidated replication slots

2025-02-25 Thread Amit Kapila
On Tue, Feb 25, 2025 at 1:03 AM Masahiko Sawada  wrote:
>
> I've checked if this issue exists also on v15 or older, but IIUC it
> doesn't exist, fortunately. Here is the summary:
>
> Scenario-1: the source gets invalidated before the copy function
> fetches its contents for the first time. In this case, since the
> source slot's restart_lsn is already an invalid LSN it raises an error
> appropriately. In v15, we have only one slot invaldation reason, WAL
> removal, therefore we always reset the slot's restart_lsn to
> InvalidXlogRecPtr.
>
> Scenario-2: the source gets invalidated before the copied slot is
> created (i.e., between first content copy and
> create_logical/physical_replication_slot()). In this case, the copied
> slot could have a valid restart_lsn value that however might point to
> a WAL segment that might have already been removed. However, since
> copy_restart_lsn will be an invalid LSN (=0), it's caught by the
> following if condition:
>
> if (copy_restart_lsn < src_restart_lsn ||
> src_islogical != copy_islogical ||
> strcmp(copy_name, NameStr(*src_name)) != 0)
> ereport(ERROR,
> (errmsg("could not copy replication slot \"%s\"",
> NameStr(*src_name)),
>  errdetail("The source replication slot was
> modified incompatibly during the copy operation.")));
>
> Scenario-3: the source gets invalidated after creating the copied slot
> (i.e. after create_logical/physical_replication_slot()). In this case,
> since the newly copied slot have the same restart_lsn as the source
> slot, both slots are invalidated.
>

Which part of the code will cover Scenario-3? Shouldn't we give ERROR
for Scenario-3 as well?

-- 
With Regards,
Amit Kapila.




Re: long-standing data loss bug in initial sync of logical replication

2025-02-25 Thread Amit Kapila
On Tue, Feb 25, 2025 at 3:56 PM Benoit Lobréau
 wrote:
>
> After reading the thread and doing a bit of testing, the problem seems
> significant and is still present. The fact that it's probably not well
> known makes it more concerning, in my opinion. I was wondering what
> could be done to help move this topic forward (given my limited abilities)?
>

You can help with the review/test of the proposed patch. Also, help
with the performance impact of the patch, if possible. Shlok has done
some performance testing of the patch which you can perform
independently and then Sawada-San has asked for more performance tests
in his last email (1) which you can also help with.

(1) - 
https://www.postgresql.org/message-id/CAD21AoDoWc8MWTyKtmNF_606bcW6J0gV%3D%3Dr%3DVmPXKUN-e3o9ew%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Security Label Inheritance

2025-02-25 Thread Andres Freund
Hi,

On February 25, 2025 10:08:44 AM GMT+01:00, Damien Clochard 
 wrote:
>So my first question is : Do you think it would be helpful to update the 
>SECURITY LABEL command documentation to clarify that security labels are not 
>concerned by object inheritance ?

Couldn't hurt.


>My second question is more open : do you think it would be worth adding a new 
>way to declare that a security label applies to an object and all its 
>inheritants ?  As I understand this would concern only roles and tables.
>
>Maybe a new optional `[ [WITH] INHERIT | NOINHERIT ]` syntax at the end of the 
>SECURITY LABEL command
>
>Something like this :
>
>SECURITY LABEL FOR anon ON ROLE extern IS 'MASKED' WITH INHERIT;
>
>SECURITY LABEL FOR anon ON COLUMN people.name
>  IS 'MASKED WITH VALUE NULL'
>  WITH INHERIT;
>
>The default would be NOINHERIT and all extensions that rely on the current 
>behaviour would continue to work without any change.

I doubt that is viable. That'd mean we somehow need to teach the label 
infrastructure about all kinds of inheritance *and* make that recursive label 
collection fast. The caching right now uses generic infrastructure, it 
certainly couldn't with inheritance support.That'd be a fair bit of 
infrastructure. 

Greetings, 

Andres 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Conflict detection for multiple_unique_conflicts in logical replication

2025-02-25 Thread Nisha Moond
On Mon, Feb 24, 2025 at 7:39 AM Peter Smith  wrote:
>
> Hi Nisha.
>
> Some review comments for patch v1-0001
>
> ==
> GENERAL
>
> 1.
> This may be a basic/naive question, but it is unclear to me why we
> care about the stats of confl_multiple_unique_conflicts?
>
> I can understand it would be useful to get multiple conflicts logged
> at the same time so the user doesn't have to stumble across them one
> by one when fixing them, but as far as the stats go, why do we care
> about this stat? Also, since you don't distinguish between multiple
> insert conflicts versus multiple update conflicts the stat usefulness
> seemed even more dubious.
>
> (because of this question, I skipped reviewing some parts of this
> patch related to stats)
>

IMO, tracking multiple_unique_conflicts, like other conflict stats,
helps users understand their workload better, especially since in
these cases, stats aren't gathered under either insert_exists or
update_exists.
I'll wait for others' opinions too on the need for the stats in this case.

> ~~~
>
> 12.
> +ok( $logfile =~
> +   qr/conflict detected on relation \"public.conf_tab\":
> conflict=multiple_unique_conflicts*\n.*DETAIL:.* The remote tuple
> violates multiple unique constraints on the local table./,
> + 'multiple_unique_conflicts detected during update');
>
> (same as #12)
>
> Won't it be more useful to also log the column name causing the
> conflict? Otherwise, if there are 100s of unique columns and just 2 of
> them are conflicts how will the user know what to look for when fixing
> it?
>

The conflicting column details are logged. In the test case, only the
header line of the DETAIL message is compared to keep it simple.
For example, the full LOG message will look like -

ERROR:  conflict detected on relation "public.conf_tab":
conflict=multiple_unique_conflicts
DETAIL:  The remote tuple violates multiple unique constraints on the
local table.
Key already exists in unique index "conf_tab_pkey", modified
locally in transaction 757 at 2025-02-25 14:00:56.955403+05:30.
Key (a)=(2); existing local tuple (2, 2, 2); remote tuple (2, 3, 4).
Key already exists in unique index "conf_tab_b_key", modified
locally in transaction 758 at 2025-02-25 14:00:56.957092+05:30.
Key (b)=(3); existing local tuple (3, 3, 3); remote tuple (2, 3, 4).
Key already exists in unique index "conf_tab_c_key", modified
locally in transaction 759 at 2025-02-25 14:00:56.957337+05:30.
Key (c)=(4); existing local tuple (4, 4, 4); remote tuple (2, 3, 4).


Addressed comments and attached the v2 patch.

--
Thanks,
Nisha


v2-0001-Implement-the-conflict-detection-for-multiple_uni.patch
Description: Binary data


Re: SQL Property Graph Queries (SQL/PGQ)

2025-02-25 Thread Vladlen Popolitov

Hi!

 I would ask about CREATE GRAPH PROPERTY. From my point of view it makes 
very strong

check of types including check of typmod.

 Example:

CREATE TABLE userslist (
  user_id INT primary key,
  name   VARCHAR(40)
);

CREATE TABLE groupslist (
  group_id INT primary key,
  name   VARCHAR(30)
);

CREATE TABLE memberslist (
 member_id INT primary key,
 user_id INT,
 group_id INT,
 CONSTRAINT members_users_fk1 FOREIGN KEY (user_id) REFERENCES userslist 
(user_id),
 CONSTRAINT members_groups_fk1 FOREIGN KEY (group_id) REFERENCES 
groupslist (group_id)

);

CREATE PROPERTY GRAPH members_pg
  VERTEX TABLES (
userslist
  KEY (user_id)
  LABEL luser
  PROPERTIES ALL COLUMNS,
groupslist
  KEY (group_id)
  LABEL lgroup
  PROPERTIES ALL COLUMNS
  )
  edge TABLES (
memberslist
  KEY (member_id)
  SOURCE KEY (user_id) REFERENCES userslist (user_id)
  DESTINATION KEY (group_id) REFERENCES groupslist (group_id)
  LABEL lmember
  PROPERTIES ALL COLUMNS
  );

Last command fails with error:
ERROR:  property "name" data type modifier mismatch: 19 vs. 14
DETAIL:  In a property graph, a property of the same name has to have
the same type modifier in each label.

Labels luser and lgroup both have property "name" originated from tables 
userslist and groupslist
with types VARCHAR(40) and VARCHAR(30). Current implementation treats 
them as fields with
different types. I think, they should not be treated as different types. 
Typmod is additional

constrain, it does not create another type.
Documentation includes:"modifiers, that is optional constraints attached 
to a type declaration,

such as char(5) or numeric(30,2)"

Operations with values using different typmod do not require cast, they  
treated as the

same type.

Also I would add, that Oracle executes the code above without error.

I propose to exclude the check of typmod from the equivalence of types 
check in

src/backend/commands/propgraphcmds.c .

I suppose there is other reason to make this check (not the Standart 
SQL-2023 requirement,

but internal dependency), but I have not realised this reason.

--
Best regards,

Vladlen Popolitov.




Re: Add an option to skip loading missing publication to avoid logical replication failure

2025-02-25 Thread vignesh C
On Tue, 18 Feb 2025 at 16:53, Amit Kapila  wrote:
>
> On Tue, Feb 18, 2025 at 2:24 PM vignesh C  wrote:
> >
> > On Fri, 14 Feb 2025 at 15:36, Amit Kapila  wrote:
> > >
> > > Did you try to measure the performance impact of this change? We can
> > > try a few cases where DDL and DMLs are involved, missing publication
> > > (drop publication and recreate after a varying number of records to
> > > check the impact).
> >
> > Since we don't have an exact scenario to compare with the patch
> > (because, in the current HEAD, when the publication is missing, an
> > error is thrown and the walsender/worker restarts), I compared the
> > positive case, where records are successfully replicated to the
> > subscriber, as shown below. For the scenario with the patch, I ran the
> > same test, where the publication is dropped before the insert,
> > allowing the walsender to check whether the publication is present.
> > The test results, which represent the median of 7 runs and the
> > execution run is in milliseconds, are provided below:
> >
> > Brach/records  |  100 |  1000   |  1 |  10  |  100
> > Head   |   1.214  |  2.548  |  10.823|  90.3   |  
> > 951.833
> > Patch  |   1.215  |  2.5485 |  10.8545 |  90.94 |   955.134
> > % diff  |   0.082  |   0.020   |   0.291   |   0.704|
> > 0.347
> >
> > I noticed that the test run with patches is very negligible. The
> > scripts used for execution are attached.
> >
>
> You have used the synchronous_standby_name to evaluate the performance
> which covers other parts of replication than the logical decoding. It
> would be better to test using pg_recvlogical.

Here are the test runs with pg_recvlogical, the test results, which
represent the median of 10 runs and the execution run is in
milliseconds, are provided below:
Brach/records  |  100 |  1000   |  1   |  10  |  100
Head   |   9.95   |  15.26  |   62.62|  536.57  |   8480.83
Patch  |   9.218  |  10.32 |   23.05|  143.83  |   4852.43
% diff  |   7.356  |  32.38  |   63.19   |73.193|   42.783

We observe that test execution with the patch performs better between
7.35 percent to 73.19 percent. This is because, in HEAD, after loading
and verifying that the publication is valid, it must continue
processing to output the change. In contrast, with the patch,
outputting the change is skipped since the publication does not exist.

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

Regards,
Vignesh
#!/bin/bash

#

SLOT_NAME=test
PLUGIN_NAME=pgoutput
NUM_RECORDS=100
LOOP=10

#

for i in `seq 1 $LOOP`
do
# Cleanup previous result

pg_ctl stop -D data
rm -rf data logfile

# Initialize an instance

initdb -D data -U postgres -c wal_level=logical

# Start the instance
pg_ctl -D data -l logfile start

# Create a table
psql -U postgres -c "CREATE TABLE foo (id int);"
psql -U postgres -c "CREATE publication pub1 for table foo;"
#psql -U postgres -c "drop publication pub1;"

# Create a replication slot
psql -U postgres -c "SELECT * FROM pg_create_logical_replication_slot('$SLOT_NAME', '$PLUGIN_NAME')"

# Insert tuples (this transaction will be decoded)
psql -U postgres -c "INSERT INTO foo VALUES (generate_series(1, $NUM_RECORDS))"

# Confirm current WAL location
WAL_POS=$(psql -qtAX -U postgres -c "SELECT * FROM pg_current_wal_lsn()")


t1=$(($(date +%s%N)/1000))
echo $t1 > run_${i}.dat
# Run pg_recvlogical till the current WAL location
(time pg_recvlogical -d postgres -U postgres --start -S $SLOT_NAME -E $WAL_POS -f - -o publication_names='pub1' -o proto_version=4 ) &>> run_${i}.dat
t2=$(($(date +%s%N)/1000))
echo $t2 >> run_${i}.dat
t3=$((t2-t1))
echo "execution time=$t3" >> run_${i}.dat
done



Re: Fix logging for invalid recovery timeline

2025-02-25 Thread Benoit Lobréau

On 2/24/25 11:33 PM, Michael Paquier wrote:

On Mon, Feb 24, 2025 at 05:30:35PM +, David Steele wrote:

+/* translator: %s is a backup_label or
pg_control file */


See for example PostmasterMain() with the "/* translator: %s is a
configuration file */".


Thank you Michael and David.
I never paid attention to thoses...

--
Benoit Lobréau
Consultant
http://dalibo.com
From 9a12cad29afb86f2277bf76bc53757702715afd5 Mon Sep 17 00:00:00 2001
From: benoit 
Date: Fri, 21 Feb 2025 14:09:56 +0100
Subject: [PATCH] Adjust logging for invalid recovery timeline

The original message doesn't mention where the checkpoint was read from.
It refers to the "Latest checkpoint", the terminology is used for some
records from the control file (specifically, the pg_controldata output).
The backup label uses "Checkpoint location".
---
 src/backend/access/transam/xlogrecovery.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 915cb330295..65335e542ad 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -849,7 +849,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		ereport(FATAL,
 (errmsg("requested timeline %u is not a child of this server's history",
 		recoveryTargetTLI),
- errdetail("Latest checkpoint is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+		/* translator: %s is a backup_label or pg_control file */
+ errdetail("Latest checkpoint in %s is at %X/%X on timeline %u, but in the history of the requested timeline, the server forked off from that timeline at %X/%X.",
+		   haveBackupLabel ? "backup_label" : "pg_control",
 		   LSN_FORMAT_ARGS(CheckPointLoc),
 		   CheckPointTLI,
 		   LSN_FORMAT_ARGS(switchpoint;
-- 
2.48.1



Re: long-standing data loss bug in initial sync of logical replication

2025-02-25 Thread Benoit Lobréau

Hi,

After reading the thread and doing a bit of testing, the problem seems 
significant and is still present. The fact that it's probably not well 
known makes it more concerning, in my opinion. I was wondering what 
could be done to help move this topic forward (given my limited abilities)?


--
Benoit Lobréau
Consultant
http://dalibo.com





Re: SQL Property Graph Queries (SQL/PGQ)

2025-02-25 Thread Ashutosh Bapat
On Tue, Feb 25, 2025 at 3:17 PM Vladlen Popolitov
 wrote:
>
> Hi!
>
>   I would ask about CREATE GRAPH PROPERTY. From my point of view it makes
> very strong
> check of types including check of typmod.
>
>   Example:
>
> CREATE TABLE userslist (
>user_id INT primary key,
>name   VARCHAR(40)
> );
>
> CREATE TABLE groupslist (
>group_id INT primary key,
>name   VARCHAR(30)
> );
>
> CREATE TABLE memberslist (
>   member_id INT primary key,
>   user_id INT,
>   group_id INT,
>   CONSTRAINT members_users_fk1 FOREIGN KEY (user_id) REFERENCES userslist
> (user_id),
>   CONSTRAINT members_groups_fk1 FOREIGN KEY (group_id) REFERENCES
> groupslist (group_id)
> );
>
> CREATE PROPERTY GRAPH members_pg
>VERTEX TABLES (
>  userslist
>KEY (user_id)
>LABEL luser
>PROPERTIES ALL COLUMNS,
>  groupslist
>KEY (group_id)
>LABEL lgroup
>PROPERTIES ALL COLUMNS
>   )
>edge TABLES (
>  memberslist
>KEY (member_id)
>SOURCE KEY (user_id) REFERENCES userslist (user_id)
>DESTINATION KEY (group_id) REFERENCES groupslist (group_id)
>LABEL lmember
>PROPERTIES ALL COLUMNS
>   );
>
> Last command fails with error:
> ERROR:  property "name" data type modifier mismatch: 19 vs. 14
> DETAIL:  In a property graph, a property of the same name has to have
> the same type modifier in each label.
>
> Labels luser and lgroup both have property "name" originated from tables
> userslist and groupslist
> with types VARCHAR(40) and VARCHAR(30). Current implementation treats
> them as fields with
> different types. I think, they should not be treated as different types.
> Typmod is additional
> constrain, it does not create another type.
> Documentation includes:"modifiers, that is optional constraints attached
> to a type declaration,
> such as char(5) or numeric(30,2)"
>
> Operations with values using different typmod do not require cast, they
> treated as the
> same type.

When a property is projected in a COLUMNS clause, it will be added to
the SELECT list of the rewritten query, which in turn could be a UNION
query. Having the same typmod for all the properties with the same
name makes it easy to write the SELECT list of UNION. Otherwise, we
have to find a common typmod for all the properties. This restriction
is not expected to be permanent but makes the first set of patches
easy, which are quite large as is. I think in some next version, we
will lift this restriction as well as lift the restriction on all
properties with the same name needing the same datatype. According to
the SQL standard, they can be of different data types as long as
there's a common type to which they all can be cast to. It may be
possible to lift this restriction in this version itself, if we find
enough developer and reviewer bandwidth.


-- 
Best Wishes,
Ashutosh Bapat




Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2025-02-25 Thread Ashutosh Bapat
Hi,

On Thu, Feb 20, 2025 at 5:28 PM Ashutosh Bapat
 wrote:
>
> On Tue, Feb 4, 2025 at 4:07 PM Ashutosh Bapat
>  wrote:
> >
> > If we are not interested in saving memory, there is a simpler way to
> > improve planning time by adding a hash table per equivalence class to
> > store the derived clauses, instead of a linked list, when the number
> > of derived clauses is higher than a threshold (say 32 same as the
> > threshold for join_rel_list. Maybe that approach will yield stable
> > planning time.
> >

I implemented the above idea in attached patches. I also added the
following query, inspired from Alvaro's query, to summarise the
results.
with master_avgs as
(select code_tag, num_parts, num_joins, pwj, avg(planning_time_ms)
avg_pt, stddev(planning_time_ms) stddev_pt
from msmts where code_tag = 'master'
group by code_tag, num_parts, num_joins, pwj),
patched_avgs as
(select code_tag, num_parts, num_joins, pwj, avg(planning_time_ms)
avg_pt, stddev(planning_time_ms) stddev_pt
from msmts where code_tag = 'patched'
group by code_tag, num_parts, num_joins, pwj)
select num_parts,
num_joins,
format('s=%s%% md=%s%% pd=%s%%',
((m.avg_pt - p.avg_pt)/m.avg_pt * 100)::numeric(6, 2),
(m.stddev_pt/m.avg_pt * 100)::numeric(6, 2),
(p.stddev_pt/p.avg_pt * 100)::numeric(6, 2))
from master_avgs m join patched_avgs p using (num_parts, num_joins,
pwj) where not pwj order by 1, 2, 3;
\crosstabview 1 2 3

not pwj in the last line should be changed to pwj to get results with
enable_partitionwise_join = true.

With the attached patches, I observe following results

With PWJ disabled
 num_parts |   2   |  3
   | 4  |  5
---+---+--++-
 0 | s=-4.44% md=17.91% pd=23.05%  | s=-0.83% md=11.10%
pd=19.58% | s=0.87% md=4.04% pd=7.91%  | s=-35.24% md=7.63% pd=9.69%
10 | s=30.13% md=118.18% pd=37.44% | s=-3.49% md=0.58%
pd=0.49%   | s=-0.83% md=0.29% pd=0.35% | s=-0.24% md=0.35% pd=0.32%
   100 | s=1.94% md=13.19% pd=4.08%| s=-0.27% md=0.18%
pd=0.44%   | s=7.04% md=3.05% pd=3.11%  | s=12.75% md=1.69% pd=0.81%
   500 | s=4.39% md=1.71% pd=1.33% | s=10.17% md=1.28%
pd=1.90%   | s=23.04% md=0.24% pd=0.58% | s=30.87% md=0.30% pd=1.11%
  1000 | s=4.27% md=1.21% pd=1.97% | s=13.97% md=0.44%
pd=0.79%   | s=24.05% md=0.63% pd=1.02% | s=30.77% md=0.77% pd=0.17%


Each cell is a triple (s, md, pd) where s is improvement in planning
time using the patches in % as compared to the master (higher the
better), md = standard deviation as % of the average planning time on
master, pd = is standard deviation as % of the average planning time
with patches.

With PWJ enabled
 num_parts |  2   |  3
  |  4  |  5
---+--+--+-+--
 0 | s=-94.25% md=6.98% pd=56.03% | s=44.10% md=141.13%
pd=9.32% | s=42.71% md=46.00% pd=6.55% | s=-26.12% md=6.72% pd=15.20%
10 | s=-25.89% md=4.29% pd=63.75% | s=-1.34% md=3.15% pd=3.26%
  | s=0.31% md=4.13% pd=4.34%   | s=-1.34% md=3.10% pd=6.73%
   100 | s=-2.83% md=0.94% pd=1.31%   | s=-2.17% md=4.57% pd=4.41%
  | s=0.98% md=1.59% pd=1.81%   | s=1.87% md=1.10% pd=0.79%
   500 | s=1.57% md=3.01% pd=1.70%| s=6.99% md=1.58% pd=1.68%
  | s=11.11% md=0.24% pd=0.62%  | s=11.65% md=0.18% pd=0.90%
  1000 | s=3.59% md=0.98% pd=1.78%| s=10.83% md=0.88% pd=0.46%
  | s=15.62% md=0.46% pd=0.13%  | s=16.38% md=0.63% pd=0.29%

Same numbers measured for previous set of patches [1], which improves
both memory consumption as well as planning time.

With PWJ disabled
 num_parts |   2   |  3
   |  4   |   5
---+---+--+--+
 0 | s=4.68% md=18.17% pd=22.09%   | s=-2.54% md=12.00%
pd=13.81% | s=-2.02% md=3.84% pd=4.43%   | s=-69.14% md=11.06%
pd=126.87%
10 | s=-24.85% md=20.42% pd=35.69% | s=-4.31% md=0.73%
pd=1.53%   | s=-14.97% md=0.32% pd=31.90% | s=-0.57% md=0.79% pd=0.50%
   100 | s=0.27% md=4.69% pd=1.55% | s=4.16% md=0.29% pd=0.18%
   | s=11.76% md=0.85% pd=0.49%   | s=15.76% md=1.64% pd=2.32%
   500 | s=0.54% md=1.88% pd=1.81% | s=9.36% md=1.17% pd=0.87%
   | s=21.45% md=0.74% pd=0.88%   | s=30.47% md=0.17% pd=1.17%
  1000 | s=3.22% md=1.36% pd=0.99% | s=14.74% md=0.86%
pd=0.44%   | s=24.50% md=0.36% pd=0.31%   | s=27.97% md=0.27% pd=0.25%

With PWJ enabled
 num_parts |  2  | 3
| 4  |  5
---+-+++-

Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-02-25 Thread Sadeq Dousti
Thanks Greg and others for the feedback!

Please find attached the patch for implementing \dN (including \dNt, \dNi,
\dNit).

Best Regards,
Sadeq Dousti
From d8fde4b05eee95089548384c07b59304f2fecc1c Mon Sep 17 00:00:00 2001
From: Sadeq Dousti <3616518+msdou...@users.noreply.github.com>
Date: Tue, 25 Feb 2025 23:51:16 +0100
Subject: [PATCH v3] Add metacommand dN to psql

---
 doc/src/sgml/ref/psql-ref.sgml | 10 ++--
 src/bin/psql/command.c |  1 +
 src/bin/psql/describe.c| 49 ++
 src/bin/psql/help.c|  1 +
 src/bin/psql/tab-complete.in.c | 43 +++-
 src/test/regress/expected/psql.out | 81 ++
 src/test/regress/sql/psql.sql  | 15 ++
 7 files changed, 184 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cedccc14129..ea124e5a4d5 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1635,6 +1635,7 @@ SELECT $1 \parse stmt1
   
 \dE[Sx+] [ pattern ]
 \di[Sx+] [ pattern ]
+\dN[Sx+] [ pattern ]
 \dm[Sx+] [ pattern ]
 \ds[Sx+] [ pattern ]
 \dt[Sx+] [ pattern ]
@@ -1643,15 +1644,16 @@ SELECT $1 \parse stmt1
 
 
 In this group of commands, the letters E,
-i, m, s,
-t, and v
-stand for foreign table, index, materialized view,
+i, m, N,
+, s, t, and v
+stand for foreign table, index, materialized view, no partitions,
 sequence, table, and view,
 respectively.
 You can specify any or all of
 these letters, in any order, to obtain a listing of objects
 of these types.  For example, \dti lists
-tables and indexes.
+tables and indexes, and \dNti lists
+tables and indexes that are not partitions of any other relation.
 If x is appended to the command name, the results
 are displayed in expanded mode.
 If + is
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 0f27bf7a91f..cf65df42459 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1163,6 +1163,7 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd)
 			case 'i':
 			case 's':
 			case 'E':
+			case 'N':
 success = listTables(&cmd[1], pattern, show_verbose, show_system);
 break;
 			case 'r':
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index e6cf468ac9e..ba1d6c09bc9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4016,6 +4016,7 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	bool		showMatViews = strchr(tabtypes, 'm') != NULL;
 	bool		showSeq = strchr(tabtypes, 's') != NULL;
 	bool		showForeign = strchr(tabtypes, 'E') != NULL;
+	bool		showNoPartitions = strchr(tabtypes, 'N') != NULL;
 
 	int			ntypes;
 	PQExpBufferData buf;
@@ -4023,13 +4024,33 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	printQueryOpt myopt = pset.popt;
 	int			cols_so_far;
 	bool		translate_columns[] = {false, false, true, false, false, false, false, false, false};
+	char	   *no_partition_description = showNoPartitions ? " non-partition" : "";
+
+	/*
+	 * Note: Declarative table partitioning is only supported as of Pg 10.0.
+	 */
+	if (showNoPartitions && pset.sversion < 10)
+	{
+		char		sverbuf[32];
+
+		pg_log_error("The server (version %s) does not support declarative table partitioning.",
+	 formatPGVersionNumber(pset.sversion, false,
+		   sverbuf, sizeof(sverbuf)));
+		return true;
+	}
 
 	/* Count the number of explicitly-requested relation types */
 	ntypes = showTables + showIndexes + showViews + showMatViews +
 		showSeq + showForeign;
-	/* If none, we default to \dtvmsE (but see also command.c) */
+
 	if (ntypes == 0)
-		showTables = showViews = showMatViews = showSeq = showForeign = true;
+	{
+		if (showNoPartitions)
+			showTables = showIndexes = true;
+		else
+			/* If none, we default to \dtvmsE (but see also command.c) */
+			showTables = showViews = showMatViews = showSeq = showForeign = true;
+	}
 
 	initPQExpBuffer(&buf);
 
@@ -4155,6 +4176,9 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 			 "  AND n.nspname !~ '^pg_toast'\n"
 			 "  AND n.nspname <> 'information_schema'\n");
 
+	if (showNoPartitions)
+		appendPQExpBufferStr(&buf, " AND NOT c.relispartition\n");
+
 	if (!validateSQLNamePattern(&buf, pattern, true, false,
 "n.nspname", "c.relname", NULL,
 "pg_catalog.pg_table_is_visible(c.oid)",
@@ -4181,14 +4205,14 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 		if (pattern)
 		{
 			if (ntypes != 1)
-pg_log_error("Did not find any relations named \"%s\".",
-			 pattern);
+pg_log_error("Did not find any%s relations named \"%s\".",
+			 no_partition_de

Re: Make COPY format extendable: Extract COPY TO format implementations

2025-02-25 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Tue, 25 Feb 2025 17:14:43 -0800,
  Masahiko Sawada  wrote:

> I've attached updated patches.

Thanks.

I found one more missing last ".":

0002:

> --- a/src/backend/commands/copyfrom.c
> +++ b/src/backend/commands/copyfrom.c

> @@ -106,6 +106,145 @@ typedef struct CopyMultiInsertInfo

> +/*
> + * Built-in format-specific routines. One-row callbacks are defined in
> + * copyfromparse.c
> + */

copyfromparse.c -> copyfromparse.c.


Could you push them?


Thanks,
-- 
kou




Re: Statistics Import and Export

2025-02-25 Thread Jeff Davis
On Mon, 2025-02-24 at 09:54 -0500, Andres Freund wrote:
> Have you compared performance of with/without stats after these
> optimizations?

On unoptimized build with asserts enabled, dumping the regression
database:

  --no-statistics: 1.0s
  master:  3.6s
  v3j-0001:3.0s
  v3j-0002:1.7s

I plan to commit the patches soon.

Regards,
Jeff Davis

From d617fb142158e0ca964e5bc8bb3351d993de6062 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Fri, 21 Feb 2025 23:31:04 -0500
Subject: [PATCH v3j 1/2] Avoid unnecessary relation stats query in pg_dump.

The few fields we need can be easily collected in getTables() and
getIndexes() and stored in RelStatsInfo.

Co-authored-by: Corey Huinker 
Co-authored-by: Jeff Davis 
Discussion: https://postgr.es/m/CADkLM=f0a43atd88xw4xcfayef25g-7htrhx_whv40hyocs...@mail.gmail.com
---
 src/bin/pg_dump/pg_dump.c | 145 --
 src/bin/pg_dump/pg_dump.h |   5 +-
 2 files changed, 64 insertions(+), 86 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index afd79287177..a1823914656 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -56,6 +56,7 @@
 #include "common/connect.h"
 #include "common/int.h"
 #include "common/relpath.h"
+#include "common/shortest_dec.h"
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/option_utils.h"
@@ -524,6 +525,9 @@ main(int argc, char **argv)
 	pg_logging_set_level(PG_LOG_WARNING);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_dump"));
 
+	/* ensure that locale does not affect floating point interpretation */
+	setlocale(LC_NUMERIC, "C");
+
 	/*
 	 * Initialize what we need for parallel execution, especially for thread
 	 * support on Windows.
@@ -6814,7 +6818,8 @@ getFuncs(Archive *fout)
  *
  */
 static RelStatsInfo *
-getRelationStatistics(Archive *fout, DumpableObject *rel, char relkind)
+getRelationStatistics(Archive *fout, DumpableObject *rel, int32 relpages,
+	  float reltuples, int32 relallvisible, char relkind)
 {
 	if (!fout->dopt->dumpStatistics)
 		return NULL;
@@ -6839,6 +6844,9 @@ getRelationStatistics(Archive *fout, DumpableObject *rel, char relkind)
 		dobj->components |= DUMP_COMPONENT_STATISTICS;
 		dobj->name = pg_strdup(rel->name);
 		dobj->namespace = rel->namespace;
+		info->relpages = relpages;
+		info->reltuples = reltuples;
+		info->relallvisible = relallvisible;
 		info->relkind = relkind;
 		info->postponed_def = false;
 
@@ -6874,6 +6882,8 @@ getTables(Archive *fout, int *numTables)
 	int			i_relhasindex;
 	int			i_relhasrules;
 	int			i_relpages;
+	int			i_reltuples;
+	int			i_relallvisible;
 	int			i_toastpages;
 	int			i_owning_tab;
 	int			i_owning_col;
@@ -6924,7 +6934,7 @@ getTables(Archive *fout, int *numTables)
 		 "c.relowner, "
 		 "c.relchecks, "
 		 "c.relhasindex, c.relhasrules, c.relpages, "
-		 "c.relhastriggers, "
+		 "c.reltuples, c.relallvisible, c.relhastriggers, "
 		 "c.relpersistence, "
 		 "c.reloftype, "
 		 "c.relacl, "
@@ -7088,6 +7098,8 @@ getTables(Archive *fout, int *numTables)
 	i_relhasindex = PQfnumber(res, "relhasindex");
 	i_relhasrules = PQfnumber(res, "relhasrules");
 	i_relpages = PQfnumber(res, "relpages");
+	i_reltuples = PQfnumber(res, "reltuples");
+	i_relallvisible = PQfnumber(res, "relallvisible");
 	i_toastpages = PQfnumber(res, "toastpages");
 	i_owning_tab = PQfnumber(res, "owning_tab");
 	i_owning_col = PQfnumber(res, "owning_col");
@@ -7134,6 +7146,9 @@ getTables(Archive *fout, int *numTables)
 
 	for (i = 0; i < ntups; i++)
 	{
+		float		reltuples = strtof(PQgetvalue(res, i, i_reltuples), NULL);
+		int32		relallvisible = atoi(PQgetvalue(res, i, i_relallvisible));
+
 		tblinfo[i].dobj.objType = DO_TABLE;
 		tblinfo[i].dobj.catId.tableoid = atooid(PQgetvalue(res, i, i_reltableoid));
 		tblinfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_reloid));
@@ -7233,7 +7248,8 @@ getTables(Archive *fout, int *numTables)
 
 		/* Add statistics */
 		if (tblinfo[i].interesting)
-			getRelationStatistics(fout, &tblinfo[i].dobj, tblinfo[i].relkind);
+			getRelationStatistics(fout, &tblinfo[i].dobj, tblinfo[i].relpages,
+  reltuples, relallvisible, tblinfo[i].relkind);
 
 		/*
 		 * Read-lock target tables to make sure they aren't DROPPED or altered
@@ -7499,6 +7515,9 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 i_oid,
 i_indrelid,
 i_indexname,
+i_relpages,
+i_reltuples,
+i_relallvisible,
 i_parentidx,
 i_indexdef,
 i_indnkeyatts,
@@ -7552,6 +7571,7 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 	appendPQExpBufferStr(query,
 		 "SELECT t.tableoid, t.oid, i.indrelid, "
 		 "t.relname AS indexname, "
+		 "t.relpages, t.reltuples, t.relallvisible, "
 		 "pg_catalog.pg_get_indexdef(i.indexrelid) AS indexdef, "
 		 "i.indkey, i.indisclustered, "
 		 "c.contype, c.conname, "
@@ -7659,6 +7679,9 @@ getIndexes(Archive

Re: bug in stored generated column over domain with constraints.

2025-02-25 Thread jian he
hi.

comments refined and minor aesthetic adjustments made.
From 22ef7ce384de3098fc19ae0bb9bc9777b269b8ec Mon Sep 17 00:00:00 2001
From: jian he 
Date: Wed, 26 Feb 2025 11:29:04 +0800
Subject: [PATCH v2 1/1] fix default insertion for stored generated column with
 domain over constraints.

create domain d3 as int check (value is not null);
create table t0(b int, a d3 GENERATED ALWAYS as (b + 11) stored);

insert into t0 values (1, default);
ERROR:  value for domain d3 violates check constraint "d3_check"

explain(costs off, verbose) insert into t0 values (1, default);
  QUERY PLAN
---
 Insert on public.t0
   ->  Result
 Output: 1, NULL::integer

We first evaluate the Result node. Domain coercion in the Result node may lead
to domain constraint violations. These domain constraints should not be checked in
ExecResult, as ExecComputeStoredGenerated will handle them.

context: https://git.postgresql.org/cgit/postgresql.git/commit/?id=0da39aa7667b06e16189d318f7850d559d446d52
discussion: https://postgr.es/m/CACJufxG59tip2+9h=rev-ykofjt0cbspvchhi0rtij8babb...@mail.gmail.com
---
 src/backend/executor/nodeModifyTable.c| 36 +
 src/backend/optimizer/prep/preptlist.c|  3 +-
 src/backend/parser/parse_coerce.c |  5 ++-
 src/backend/rewrite/rewriteHandler.c  | 39 ---
 src/backend/rewrite/rewriteManip.c|  3 +-
 src/include/parser/parse_coerce.h |  2 +-
 .../regress/expected/generated_stored.out | 32 +++
 src/test/regress/sql/generated_stored.sql | 15 +++
 8 files changed, 118 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b0fe50075ad..c467c735334 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -67,6 +67,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
@@ -216,13 +217,34 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
 		{
 			/* Normal case: demand type match */
 			if (exprType((Node *) tle->expr) != attr->atttypid)
-ereport(ERROR,
-		(errcode(ERRCODE_DATATYPE_MISMATCH),
-		 errmsg("table row type and query-specified row type do not match"),
-		 errdetail("Table has type %s at ordinal position %d, but query expects %s.",
-   format_type_be(attr->atttypid),
-   attno,
-   format_type_be(exprType((Node *) tle->expr);
+			{
+if (attr->attgenerated == '\0')
+{
+	ereport(ERROR,
+			errcode(ERRCODE_DATATYPE_MISMATCH),
+			errmsg("table row type and query-specified row type do not match"),
+			errdetail("Table has type %s at ordinal position %d, but query expects %s.",
+	   format_type_be(attr->atttypid),
+	   attno,
+	   format_type_be(exprType((Node *) tle->expr;
+}
+else
+{
+	/* see rewriteTargetListIU comments for domain over generated column */
+	Oid			baseTypeId;
+	int32		baseTypeMod = attr->atttypmod;
+	baseTypeId = getBaseTypeAndTypmod(attr->atttypid, &baseTypeMod);
+
+	if (exprType((Node *) tle->expr) != baseTypeId)
+		ereport(ERROR,
+errcode(ERRCODE_DATATYPE_MISMATCH),
+errmsg("table row type and query-specified row type do not match"),
+errdetail("Table has type %s at ordinal position %d, but query expects %s.",
+		   format_type_be(attr->atttypid),
+		   attno,
+		   format_type_be(exprType((Node *) tle->expr;
+}
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index c2a77503d4b..769ff600eaf 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -440,7 +440,8 @@ expand_insert_targetlist(PlannerInfo *root, List *tlist, Relation rel)
  att_tup->atttypmod,
  att_tup->attcollation,
  att_tup->attlen,
- att_tup->attbyval);
+ att_tup->attbyval,
+ true);
 /* Must run expression preprocessing on any non-const nodes */
 if (!IsA(new_expr, Const))
 	new_expr = eval_const_expressions(root, new_expr);
diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 0b5b81c7f27..e2db53690b6 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -1266,10 +1266,11 @@ coerce_to_specific_type(ParseState *pstate, Node *node,
  *		Build a NULL constant, then wrap it in CoerceToDomain
  *		if the desired type is a domain type.  This allows any
  *		NOT NULL domain constraint to be enforced at runtime.
+ *		need_coerce is false means no need warp it in CoerceToDomain.
  */
 Node *
 coerce_null_to_domain(Oid typid, int32 typmod, Oid collation,
-	  int typlen, bool 

Re: Restrict copying of invalidated replication slots

2025-02-25 Thread Amit Kapila
On Tue, Feb 25, 2025 at 11:21 PM Masahiko Sawada  wrote:
>
> On Tue, Feb 25, 2025 at 2:36 AM Amit Kapila  wrote:
> >
> > >
> > > Scenario-3: the source gets invalidated after creating the copied slot
> > > (i.e. after create_logical/physical_replication_slot()). In this case,
> > > since the newly copied slot have the same restart_lsn as the source
> > > slot, both slots are invalidated.
> > >
> >
> > Which part of the code will cover Scenario-3? Shouldn't we give ERROR
> > for Scenario-3 as well?
>
> In scenario-3, the backend process executing
> pg_copy_logical/physical_replication_slot() already holds the new
> copied slot and its restart_lsn is the same or older than the source
> slot's restart_lsn. Therefore, if the source slot is invalidated at
> that timing, the copied slot is invalidated too, resulting in an error
> by the backend.
>

AFAICU, InvalidateObsoleteReplicationSlots() is not serialized with
this operation. So, isn't it possible that the source slot exists at
the later position in ReplicationSlotCtl->replication_slots and the
loop traversing slots is already ahead from the point where the newly
copied slot is created? If so, the newly created slot won't be marked
as invalid whereas the source slot will be marked as invalid. I agree
that even in such a case, at a later point, the newly created slot
will also be marked as invalid.

-- 
With Regards,
Amit Kapila.




Postmaster crashed during start

2025-02-25 Thread Srinath Reddy
Hi,
when we kill postmaster using kill -9 and start immediately it crashes with

> FATAL:  pre-existing shared memory block (key 2495405, ID 360501) is still
> in use

HINT:  Terminate any old server processes associated with data directory

*We can reproduce this*

> kill -9 $(head -n 1 $PGDATA/postmaster.pid) & ./pg_ctl -D $PGDATA -l
> $PGDATA/logfile start


*Reason of crash:*
when we kill postmaster with -9 signal the clean up does not happen where
the shared memory segment won't be detached but kernel will does this when
a process dies means the process which attached to segment will be detached
so shm_nattch will be 0 but in case before kernel comes up to detach the
process if we try to start postmaster again, during creation postmaster.pid
using CreateDataDirLockFile() postmaster checks for whether previous shmem
segment is still in use ,for this we are depending on shmStat.shm_nattch ==
0 ? SHMSTATE_UNATTACHED : SHMSTATE_ATTACHED; as if kernel didn't come up so
shm_attach is still 1 so the new postmaster will think the shmem segment is
in use and crashes.

should we even consider this as a bug or we should leave it as it depends
of how busy the kernel is and it didn't got time to do the clean up of the
dead postmaster process so didn't detached and decrement the shmem_nattach.

thoughts?

Thanks and Regards
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com 


Re: Statistics Import and Export

2025-02-25 Thread Tom Lane
Corey Huinker  writes:
> We can still convert the "EXECUTE getAttributeStats" call to a Params call,
> but that involves creating an ExecuteSqlQueryParams(), which starts to
> snowball in the changes required.

Yeah, let's leave that for some other day.  It's not really apparent
that it'd buy us much performance-wise, though maybe the code would
net out cleaner.

To my mind the next task is to get the buildfarm green again by
fixing the expression-index-stats problem.  I can have a go at
that once Jeff pushes these patches, unless one of you are already
on it?

regards, tom lane




Re: long-standing data loss bug in initial sync of logical replication

2025-02-25 Thread Shlok Kyal
On Wed, 11 Dec 2024 at 12:37, Masahiko Sawada  wrote:
>
> On Tue, Oct 8, 2024 at 2:51 AM Shlok Kyal  wrote:
> >
> > On Wed, 31 Jul 2024 at 03:27, Masahiko Sawada  wrote:
> > >
> > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C  wrote:
> > > > >
> > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C  
> > > > > > wrote:
> > > > > >
> > > > > > BTW, I noticed that we don't take any table-level locks for Create
> > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that 
> > > > > > create
> > > > > > a similar problem? I haven't tested so not sure but even if there 
> > > > > > is a
> > > > > > problem for the Create case, it should lead to some ERROR like 
> > > > > > missing
> > > > > > publication.
> > > > >
> > > > > I tested these scenarios, and as you expected, it throws an error for
> > > > > the create publication case:
> > > > > 2024-07-17 14:50:01.145 IST [481526] 481526  ERROR:  could not receive
> > > > > data from WAL stream: ERROR:  publication "pub1" does not exist
> > > > > CONTEXT:  slot "sub1", output plugin "pgoutput", in the change
> > > > > callback, associated LSN 0/1510CD8
> > > > > 2024-07-17 14:50:01.147 IST [481450] 481450  LOG:  background worker
> > > > > "logical replication apply worker" (PID 481526) exited with exit code
> > > > > 1
> > > > >
> > > > > The steps for this process are as follows:
> > > > > 1) Create tables in both the publisher and subscriber.
> > > > > 2) On the publisher: Create a replication slot.
> > > > > 3) On the subscriber: Create a subscription using the slot created by
> > > > > the publisher.
> > > > > 4) On the publisher:
> > > > > 4.a) Session 1: BEGIN; INSERT INTO T1;
> > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES
> > > > > 4.c) Session 1: COMMIT;
> > > > >
> > > > > Since we are throwing out a "publication does not exist" error, there
> > > > > is no inconsistency issue here.
> > > > >
> > > > > However, an issue persists with DROP ALL TABLES publication, where
> > > > > data continues to replicate even after the publication is dropped.
> > > > > This happens because the open transaction consumes the invalidation,
> > > > > causing the publications to be revalidated using old snapshot. As a
> > > > > result, both the open transactions and the subsequent transactions are
> > > > > getting replicated.
> > > > >
> > > > > We can reproduce this issue by following these steps in a logical
> > > > > replication setup with an "ALL TABLES" publication:
> > > > > On the publisher:
> > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1);
> > > > > In another session on the publisher:
> > > > > Session 2: DROP PUBLICATION
> > > > > Back in Session 1 on the publisher:
> > > > > COMMIT;
> > > > > Finally, in Session 1 on the publisher:
> > > > > INSERT INTO T1 VALUES (val2);
> > > > >
> > > > > Even after dropping the publication, both val1 and val2 are still
> > > > > being replicated to the subscriber. This means that both the
> > > > > in-progress concurrent transaction and the subsequent transactions are
> > > > > being replicated.
> > > > >
> > > > > I don't think locking all tables is a viable solution in this case, as
> > > > > it would require asking the user to refrain from performing any
> > > > > operations on any of the tables in the database while creating a
> > > > > publication.
> > > > >
> > > >
> > > > Indeed, locking all tables in the database to prevent concurrent DMLs
> > > > for this scenario also looks odd to me. The other alternative
> > > > previously suggested by Andres is to distribute catalog modifying
> > > > transactions to all concurrent in-progress transactions [1] but as
> > > > mentioned this could add an overhead. One possibility to reduce
> > > > overhead is that we selectively distribute invalidations for
> > > > catalogs-related publications but I haven't analyzed the feasibility.
> > > >
> > > > We need more opinions to decide here, so let me summarize the problem
> > > > and solutions discussed. As explained with an example in an email [1],
> > > > the problem related to logical decoding is that it doesn't process
> > > > invalidations corresponding to DDLs for the already in-progress
> > > > transactions. We discussed preventing DMLs in the first place when
> > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in
> > > > progress. The solution discussed was to acquire
> > > > ShareUpdateExclusiveLock for all the tables being added via such
> > > > commands. Further analysis revealed that the same handling is required
> > > > for ALTER PUBLICATION ... ADD TABLES IN SCHEMA which means locking all
> > > > the tables in the specified schemas. Then DROP PUBLICATION also seems
> > > > to have similar symptoms which means in the worst case (where
> > > > publication is for ALL TABLES) we have to lock all the tables in the
> > 

Re: Postmaster crashed during start

2025-02-25 Thread Tom Lane
Srinath Reddy  writes:
> when we kill postmaster using kill -9 and start immediately it crashes with
>> FATAL:  pre-existing shared memory block (key 2495405, ID 360501) is still
>> in use

"Doctor, it hurts when I do this!"

"So don't do that!"

This is not a supported way of shutting down the postmaster, and it
never will be.  Use SIGINT, or SIGQUIT if you are in a desperate
hurry and are willing to have the next startup take longer.

I think the specific reason you are seeing this is that it takes
nonzero time for the postmaster's orphaned child processes to
notice that the postmaster is gone and terminate.  As long as
any of those children remain, the shared memory block will have
a nonzero reference count.  The new postmaster sees that and
refuses to start, for the very sound reason that it risks
data corruption if it brings up a new set of worker processes
while any of the old ones are still running.

regards, tom lane




Re: Statistics Import and Export

2025-02-25 Thread Jeff Davis
On Tue, 2025-02-25 at 22:40 -0500, Tom Lane wrote:
> To my mind the next task is to get the buildfarm green again by
> fixing the expression-index-stats problem.  I can have a go at
> that once Jeff pushes these patches, unless one of you are already
> on it?

I just committed them.

Regards,
Jeff Davis





Re: Statistics Import and Export

2025-02-25 Thread Corey Huinker
>
> To my mind the next task is to get the buildfarm green again by
> fixing the expression-index-stats problem.  I can have a go at
> that once Jeff pushes these patches, unless one of you are already
> on it?
>

Already on it, but I can step aside if you've got a clearer vision of how
to solve it.

My solution so far is to take allo the v11+ (SELECT array_agg...) functions
and put them into a LATERAL, two of them filtered by attstattarget > 0 and
a new one aggregating attnames with no filter.

An alternative would be a new subselect for array_agg(attname) WHERE
in.indexprs IS NOT NULL, thus removing the extra compute for the indexes
that lack an index expression (i.e. most of them), and thus lack settable
stats (at least for now) and wouldn't be affected by the name-jitter issue
anyway.

I'm on the fence about how to handle pg_clear_attribute_stats(), leaning
toward overloaded functions.


Re: Extend postgres_fdw_get_connections to return remote backend pid

2025-02-25 Thread Sagar Shedge
> However, if PGconn is NULL, it seems like postgres_fdw_get_connections()
> wouldn't include that connection in the result. If the connection status
> is not CONNECTION_OK, it looks like the connection would be closed by
> pgfdw_reset_xact_state() before the local backend processes the query
> calling postgres_fdw_get_connections(). So, can remote_backend_pid really
> be NULL?
I agree on this point with the current implementation of postgres_fdw.
There can be different state of connection (PGconn) status like
CONNECTION_OK, CONNECTION_BAD, CONNECTION_STARTED etc. Currently
connect_pg_server makes sure to create connections with status
CONNECTION_OK or abort current transaction if it fails. All other
intermediate states are handled within the libpq library API
libpqsrv_connect_params. So there is no way in the connection workflow
to return connection with status other than CONNECTION_OK and we are
safe here.
There is a case where connection status can be set to CONNECTION_BAD
if we failed to read the result. But in that case we invoke an error
and abort the transaction. As you mentioned, pgfdw_reset_xact_state
gets called in a transaction callback which will make sure to close
the connection at the end of transaction. Again here also we look safe
in the query execution workflow.

Only thing which bothers me is the asynchronous workflow. postgres_fdw
uses libpq library and library provides mechanism to perform
asynchronous API [1]. These asynchronous API's can set connection
status to CONNECTION_BAD (during pqReadData). Currently postgres_fdw
makes sure to close connections at the end of query if something
fails. But let's say in the future we support SQL commands to initiate
pipeline mode and retrieve data asynchronously.In this case we end up
with CONNECTION_BAD status across query?
   Other states might also occur during (and only during) an
asynchronous connection procedure. These indicate the current stage of
the connection procedure and might be useful to provide feedback to
the user for example.

[1] - 
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PQCONNECTSTARTPARAMS





--
Sagar Dilip Shedge,
SDE AWS




Re: Spinlock can be released twice in procsignal.c

2025-02-25 Thread Andrey Borodin



> On 26 Feb 2025, at 00:34, Maksim.Melnikov  wrote:
> 
> In applied patch I removed spinlock release in if clause.

Looks like the oversight in 9d9b9d4. IMO the fix is correct.


Best regards, Andrey Borodin.



Re: Log connection establishment timings

2025-02-25 Thread Melanie Plageman
On Tue, Feb 25, 2025 at 3:23 PM Melanie Plageman
 wrote:
>
> Thanks for doing this! I have implemented your suggestion in attached v3.

I missed an include in the EXEC_BACKEND not defined case. attached v4
is fixed up.

- Melanie
From c493f0be7243bfe965d3493f07982ea1072d89b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 25 Feb 2025 13:08:48 -0500
Subject: [PATCH v4] Add connection establishment duration logging

Add durations for several key parts of connection establishment when
log_connections is enabled.

For a new incoming conneciton, starting from when the postmaster gets a
socket from accept() and ending when the forked child backend is first
ready for query, there are multiple steps that could each take longer
than expected due to external factors. Provide visibility into
authentication and fork duration as well as the end-to-end connection
establishment time with logging.

To make this portable, the timestamps captured in the postmaster (socket
creation time, fork initiation time) are passed through the ClientSocket
and BackendStartupData structures instead of simply saved in backend
local memory inherited by the child process.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Jelte Fennema-Nio 
Reviewed-by: Jacob Champion 
Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com
---
 src/backend/postmaster/launch_backend.c | 20 
 src/backend/postmaster/postmaster.c |  8 
 src/backend/tcop/postgres.c | 24 
 src/backend/utils/init/globals.c|  2 ++
 src/backend/utils/init/postinit.c   | 13 +
 src/include/libpq/libpq-be.h|  2 ++
 src/include/miscadmin.h |  2 ++
 src/include/postmaster/postmaster.h |  7 +++
 src/include/tcop/backend_startup.h  |  3 +++
 src/tools/pgindent/typedefs.list|  1 +
 10 files changed, 82 insertions(+)

diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 47375e5bfaa..37b31069120 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -232,6 +232,10 @@ postmaster_child_launch(BackendType child_type, int child_slot,
 
 	Assert(IsPostmasterEnvironment && !IsUnderPostmaster);
 
+	/* Capture time Postmaster initiates fork for logging */
+	if (child_type == B_BACKEND)
+		INSTR_TIME_SET_CURRENT(((BackendStartupData *) startup_data)->fork_time);
+
 #ifdef EXEC_BACKEND
 	pid = internal_forkexec(child_process_kinds[child_type].name, child_slot,
 			startup_data, startup_data_len, client_sock);
@@ -240,6 +244,14 @@ postmaster_child_launch(BackendType child_type, int child_slot,
 	pid = fork_process();
 	if (pid == 0)/* child */
 	{
+		/* Calculate total fork duration in child backend for logging */
+		if (child_type == B_BACKEND)
+		{
+			INSTR_TIME_SET_CURRENT(conn_timing.fork_duration);
+			INSTR_TIME_SUBTRACT(conn_timing.fork_duration,
+((BackendStartupData *) startup_data)->fork_time);
+		}
+
 		/* Close the postmaster's sockets */
 		ClosePostmasterPorts(child_type == B_LOGGER);
 
@@ -618,6 +630,14 @@ SubPostmasterMain(int argc, char *argv[])
 	/* Read in the variables file */
 	read_backend_variables(argv[2], &startup_data, &startup_data_len);
 
+	/* Calculate total fork duration in child backend for logging */
+	if (child_type == B_BACKEND)
+	{
+		INSTR_TIME_SET_CURRENT(conn_timing.fork_duration);
+		INSTR_TIME_SUBTRACT(conn_timing.fork_duration,
+((BackendStartupData *) startup_data)->fork_time);
+	}
+
 	/* Close the postmaster's sockets (as soon as we know them) */
 	ClosePostmasterPorts(child_type == B_LOGGER);
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5dd3b6a4fd4..880f491a9f7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1685,7 +1685,14 @@ ServerLoop(void)
 ClientSocket s;
 
 if (AcceptConnection(events[i].fd, &s) == STATUS_OK)
+{
+	/*
+	 * Capture time that Postmaster got a socket from accept
+	 * (for logging connection establishment duration)
+	 */
+	INSTR_TIME_SET_CURRENT(s.creation_time);
 	BackendStartup(&s);
+}
 
 /* We no longer need the open socket in this process */
 if (s.sock != PGINVALID_SOCKET)
@@ -3511,6 +3518,7 @@ BackendStartup(ClientSocket *client_sock)
 
 	/* Pass down canAcceptConnections state */
 	startup_data.canAcceptConnections = cac;
+	INSTR_TIME_SET_ZERO(startup_data.fork_time);
 	bn->rw = NULL;
 
 	/* Hasn't asked to be notified about any bgworkers yet */
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f2f75aa0f88..abd7e648657 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4153,6 +4153,7 @@ PostgresMain(const char *dbname, const char *username)
 	volatile bool send_ready_for_query = true;
 	volatile bool idle_in_transaction_timeout_enab

Re: Parallel heap vacuum

2025-02-25 Thread Masahiko Sawada
On Tue, Feb 25, 2025 at 9:59 AM Melanie Plageman
 wrote:
>
> On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada  wrote:
> >
> > What I can see from these results was that we might not benefit much
> > from parallelizing phase III, unfortunately. Although in the best case
> > the phase III got about 2x speedup, as for the total duration it's
> > about only 10% speedup. My analysis for these results matches what
> > John mentioned; phase III is already the fastest phase and accounts
> > only ~10% of the total execution time, and the overhead of shared
> > TidStore offsets the speedup of phase III.
>
> So, are you proposing to drop the patches for parallelizing phase III
> for now? If so, are you planning on posting a set of patches just to
> parallelize phase I? I haven't looked at the prelim refactoring
> patches to see if they have independent value. What do you think is
> reasonable for us to try and do in the next few weeks?

Given that we have only about one month until the feature freeze, I
find that it's realistic to introduce either one parallelism for PG18
and at least we might want to implement the one first that is more
beneficial and helpful for users. Since we found that parallel phase
III is not very efficient in many cases, I'm thinking that in terms of
PG18 development, we might want to switch focus to parallel phase I,
and then go for phase III if we have time.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2025-02-25 Thread Dagfinn Ilmari Mannsåker
Hi,

While working on another round of the long option and fat comma style
cleanup, I noticed that the test for pg_upgrade --set-char-signedess
doesn't test what it's supposed to:

Masahiko Sawada  writes:

> diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl 
> b/src/bin/pg_upgrade/t/005_char_signedness.pl
> index 05c3014a27d..c024106863e 100644
> --- a/src/bin/pg_upgrade/t/005_char_signedness.pl
> +++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
> @@ -40,6 +40,23 @@ command_like(
>   qr/Default char data signedness:\s+unsigned/,
>   'updated default char signedness is unsigned in control file');
>  
> +# Cannot use --set-char-signedness option for upgrading from v18+
> +command_fails(
> + [
> + 'pg_upgrade', '--no-sync',
> + '-d', $old->data_dir,
> + '-D', $new->data_dir,
> + '-b', $old->config_data('--bindir'),
> + '-B', $new->config_data('--bindir'),
> + '-s', $new->host,
> + '-p', $old->port,
> + '-P', $new->port,
> + '-set-char-signedness', 'signed',

This is missing a dash, which causes the command to fail, but for the
wrong reason.  pg_uprade seems to print all its errors on stdout, which
I guess is why the test use plain command_fails() instead of
command_fails_like(). However, we have another function to deal with
this: command_checks_all().  Attached are patches that fix the above
test, and also convert the other command_fails() calls in the pg_upgrade
tests to test for specific messages.

- ilmari

>From b80c653c6635096345ec453bfe9445a82b7ab049 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 25 Feb 2025 22:27:36 +
Subject: [PATCH 1/2] pg_upgrade: fix --set-char-signedness failure test

The test had the option misspelled with only one dash, which fails, but
not with the expected message. Since pg_upgrade gives its error messages
on stdout(?!), use command_checks_all() instead of command_fails_like().
---
 src/bin/pg_upgrade/t/005_char_signedness.pl | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl b/src/bin/pg_upgrade/t/005_char_signedness.pl
index c024106863e..d186822ac77 100644
--- a/src/bin/pg_upgrade/t/005_char_signedness.pl
+++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
@@ -41,7 +41,7 @@ command_like(
 	'updated default char signedness is unsigned in control file');
 
 # Cannot use --set-char-signedness option for upgrading from v18+
-command_fails(
+command_checks_all(
 	[
 		'pg_upgrade', '--no-sync',
 		'-d', $old->data_dir,
@@ -51,9 +51,12 @@ command_fails(
 		'-s', $new->host,
 		'-p', $old->port,
 		'-P', $new->port,
-		'-set-char-signedness', 'signed',
+		'--set-char-signedness', 'signed',
 		$mode
 	],
+	1,
+	[qr/--set-char-signedness option cannot be used/],
+	[],
 	'--set-char-signedness option cannot be used for upgrading from v18 or later'
 );
 
-- 
2.43.0

>From 03d25df987029da7511ff5a6f4ce69d1aef16ea9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 25 Feb 2025 22:56:12 +
Subject: [PATCH 2/2] pg_upgrade: use command_checks_all() to check for errors

pg_upgrade prints its error messages on stdout, so we can't use
command_fails_like() to check that it fails for the right reason.
Instead, we have to use command_checks_all(), which this patch does.
---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl   | 5 -
 src/bin/pg_upgrade/t/004_subscription.pl | 8 +++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 45ea94c84bb..cccba9dc3db 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -396,7 +396,7 @@ $oldnode->stop;
 # Cause a failure at the start of pg_upgrade, this should create the logging
 # directory pg_upgrade_output.d but leave it around.  Keep --check for an
 # early exit.
-command_fails(
+command_checks_all(
 	[
 		'pg_upgrade', '--no-sync',
 		'-d', $oldnode->data_dir,
@@ -408,6 +408,9 @@ command_fails(
 		'-P', $newnode->port,
 		$mode, '--check',
 	],
+	1,
+	[qr{check for ".*?does.not.exist" failed:}],
+	[],
 	'run of pg_upgrade --check for new instance with incorrect binary path');
 ok(-d $newnode->data_dir . "/pg_upgrade_output.d",
 	"pg_upgrade_output.d/ not removed after pg_upgrade failure");
diff --git a/src/bin/pg_upgrade/t/004_subscription.pl b/src/bin/pg_upgrade/t/004_subscription.pl
index 13773316e1d..3f9734f7b3f 100644
--- a/src/bin/pg_upgrade/t/004_subscription.pl
+++ b/src/bin/pg_upgrade/t/004_subscription.pl
@@ -130,7 +130,7 @@ $old_sub->safe_psql('postgres',
 
 $old_sub->stop;
 
-command_fails(
+command_checks_all(
 	[
 		'pg_upgrade',
 		'--no-sync',
@@ -144,6 +144,12 @@ command_fails(
 		$mode,
 		'--check',
 	],
+	1,
+	[
+		qr{\QYour installation contains subscriptions without origin},
+		qr{\Qrelations not in i (initialize) 

Re: Make COPY format extendable: Extract COPY TO format implementations

2025-02-25 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Tue, 25 Feb 2025 14:05:28 -0800,
  Masahiko Sawada  wrote:

> The first two patches are refactoring patches (+ small performance
> improvements). I've reviewed these patches again and attached the
> updated patches. I reorganized the function order and updated comments
> etc. I find that these patches are reasonably ready to push. Could you
> review these versions? I'm going to push them, barring objections and
> further comments.

Sure. Here are some minor comments:

0001:

Commit message:

> or CSV mode. The performance benchmark results showed ~5% performance
> gain intext or CSV mode.

intext -> in text

> --- a/src/backend/commands/copyto.c
> +++ b/src/backend/commands/copyto.c

> @@ -20,6 +20,7 @@

>  #include "commands/copy.h"
> +#include "commands/copyapi.h"

We can remove '#include "commands/copy.h"' because it's
included in copyapi.h. (0002 does it.)

> @@ -254,6 +502,35 @@ CopySendEndOfRow(CopyToState cstate)

> +/*
> + * Wrapper function of CopySendEndOfRow for text and CSV formats. Sends the
> + * the line termination and do common appropriate things for the end of row.
> + */

Sends the the line ->
Sends the line

> --- /dev/null
> +++ b/src/include/commands/copyapi.h

> + /* End a COPY TO. This callback is called once at the end of COPY FROM 
> */

The last "." is missing: ... COPY FROM.

0002:

Commit message:

> This change is a preliminary step towards making the COPY TO command
> extensible in terms of output formats.

COPY TO -> COPY FROM

> --- a/src/backend/commands/copyfromparse.c
> +++ b/src/backend/commands/copyfromparse.c

> @@ -1087,7 +1132,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext 
> *econtext,

>  static bool
> -CopyReadLine(CopyFromState cstate)
> +CopyReadLine(CopyFromState cstate, bool is_csv)

> @@ -1163,7 +1208,7 @@ CopyReadLine(CopyFromState cstate)

>  static bool
> -CopyReadLineText(CopyFromState cstate)
> +CopyReadLineText(CopyFromState cstate, bool is_csv)

We may want to add a comment why we don't use "inline" nor
"pg_attribute_always_inline" here:

https://www.postgresql.org/message-id/CAD21AoBNfKDbJnu-zONNpG820ZXYC0fuTSLrJ-UdRqU4qp2wog%40mail.gmail.com

> Yes, I'm not sure it's really necessary to make it inline since the
> benchmark results don't show much difference. Probably this is because
> the function has 'is_csv' in some 'if' branches but the compiler
> cannot optimize out the whole 'if' branches as most 'if' branches
> check 'is_csv' and other variables.

Or we can add "inline" not "pg_attribute_always_inline" here
as a hint for compiler.

> --- a/src/include/commands/copyapi.h
> +++ b/src/include/commands/copyapi.h

> @@ -52,4 +52,50 @@ typedef struct CopyToRoutine

> + /* End a COPY FROM. This callback is called once at the end of COPY 
> FROM */

The last "." is missing: ... COPY FROM.


I think that these patches are ready to push too.

Thanks,
-- 
kou




Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-02-25 Thread Greg Sabino Mullane
The patch applies cleanly, and works as advertised. Nice work!

Quick notes:

* doc/src/sgml/ref/psql-ref.sgml

In the varlistentry section, the order should be the same as the other
places (N after m)

Line 1644 has an extra comma

Line 1651, maybe the example is simpler as \dNt to keep the wording better,
because "indexes that are not partitions" looks odd.

These bits:

  pg_log_error("Did not find any%s relations named \"%s\".",
no_partition_description, pattern);

are not good for translation. We want things simple with replaceable
args/constants, but not replaceable words.

I think the myopt.title ones are fine.

* bin/psql/help.c:

\\dN[Sx+] [PATTERN] list relation, table, index (no partitions)

better as:

\\dN[Sx+] [PATTERN] list tables and indexes (no partitions)

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: Make COPY format extendable: Extract COPY TO format implementations

2025-02-25 Thread Masahiko Sawada
On Tue, Feb 25, 2025 at 3:52 PM Sutou Kouhei  wrote:
>
>

Thank you for reviewing the patches. I've addressed comments except
for the following comment:

> > --- a/src/backend/commands/copyfromparse.c
> > +++ b/src/backend/commands/copyfromparse.c
>
> > @@ -1087,7 +1132,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext 
> > *econtext,
>
> >  static bool
> > -CopyReadLine(CopyFromState cstate)
> > +CopyReadLine(CopyFromState cstate, bool is_csv)
>
> > @@ -1163,7 +1208,7 @@ CopyReadLine(CopyFromState cstate)
>
> >  static bool
> > -CopyReadLineText(CopyFromState cstate)
> > +CopyReadLineText(CopyFromState cstate, bool is_csv)
>
> We may want to add a comment why we don't use "inline" nor
> "pg_attribute_always_inline" here:
>
> https://www.postgresql.org/message-id/CAD21AoBNfKDbJnu-zONNpG820ZXYC0fuTSLrJ-UdRqU4qp2wog%40mail.gmail.com
>
> > Yes, I'm not sure it's really necessary to make it inline since the
> > benchmark results don't show much difference. Probably this is because
> > the function has 'is_csv' in some 'if' branches but the compiler
> > cannot optimize out the whole 'if' branches as most 'if' branches
> > check 'is_csv' and other variables.
>
> Or we can add "inline" not "pg_attribute_always_inline" here
> as a hint for compiler.

I think we should not add inline unless we see a performance
improvement. Also, I find that it would be independent with this
refactoring so we can add it later if needed.

I've attached updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v34-0002-Refactor-COPY-FROM-to-use-format-callback-functi.patch
Description: Binary data


v34-0001-Refactor-COPY-TO-to-use-format-callback-function.patch
Description: Binary data


Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Nathan Bossart
On Tue, Feb 25, 2025 at 05:19:30PM -0500, Melanie Plageman wrote:
> Yes, so one thing you haven't said yet is if you are +1 on going
> forward with these patches in general.

Sorry, yes, I'm +1 in general.  It conceptually makes sense to me that we
should disregard frozen pages when deciding whether to do an insert vacuum,
and it's hard to argue with the results in your original post.  I also am
not overly concerned about worker starvation.  While this patch does give
higher priority to insert-only/mostly tables, it's also reducing the amount
of resources required to vacuum them, anyway.

-- 
nathan




Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Melanie Plageman
On Tue, Feb 25, 2025 at 3:05 PM Nathan Bossart  wrote:
>
> On Tue, Feb 25, 2025 at 01:52:28PM -0500, Robert Haas wrote:
> > Given that users could manually update the catalog, we have to be able
> > to tolerate bad data in the catalogs without the world ending. If that
> > code has to exist anyway, then it's not mandatory to cap. On the other
> > hand, there's no great virtue in refusing to correct data that we know
> > to be wrong. Unless there is some other consideration which makes one
> > way better than the other, this feels like author's choice.
>
> Maybe the most conservative choice is to simply follow the example of
> surrounding code.  If it's careful to cap relallvisible to relpages, also
> have it cap relallfrozen.  If not, don't.  *shrug*

Agreed. I've done this in attached v10. I handle relallfrozen values >
relpages in the second patch in the set when using the relallfrozen
value, so I think we are all good.

> In any case, I don't want to hold up this patch on this relatively minor
> point.  This seems like something we could pretty easily change in the
> future if needed.

Yes, so one thing you haven't said yet is if you are +1 on going
forward with these patches in general.

As for the code, I'm not 100% convinced I've got all the stats
import/export bits perfect (those are changing under my feet right now
anyway).


- Melanie
From bdeafa6bffbcbb895ceba5acc8433695ef1e29bf Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 16 Jan 2025 16:31:55 -0500
Subject: [PATCH v10 2/2] Trigger more frequent autovacuums with relallfrozen

Calculate the insert threshold for triggering an autovacuum of a
relation based on the number of unfrozen pages. By only considering the
"active" (unfrozen) portion of the table when calculating how many
tuples to add to the insert threshold, we can trigger more frequent
vacuums of insert-heavy tables and increase the chances of vacuuming
those pages when they still reside in shared buffers.

Reviewed-by: Greg Sabino Mullane
---
 doc/src/sgml/config.sgml  | 16 +--
 src/backend/postmaster/autovacuum.c   | 27 ---
 src/backend/utils/misc/postgresql.conf.sample |  4 +--
 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a8354576108..9d8e42cd3db 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8745,14 +8745,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;


 
- Specifies a fraction of the table size to add to
- autovacuum_vacuum_insert_threshold
- when deciding whether to trigger a VACUUM.
- The default is 0.2 (20% of table size).
- This parameter can only be set in the postgresql.conf
- file or on the server command line;
- but the setting can be overridden for individual tables by
- changing table storage parameters.
+Specifies a fraction of the active (unfrozen) table size to add to
+autovacuum_vacuum_insert_threshold
+when deciding whether to trigger a VACUUM.
+The default is 0.2 (20% of active table size).
+This parameter can only be set in the postgresql.conf
+file or on the server command line;
+but the setting can be overridden for individual tables by
+changing table storage parameters.
 

   
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ddb303f5201..0aca7d78b90 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2938,7 +2938,6 @@ relation_needs_vacanalyze(Oid relid,
 {
 	bool		force_vacuum;
 	bool		av_enabled;
-	float4		reltuples;		/* pg_class.reltuples */
 
 	/* constants from reloptions or GUC variables */
 	int			vac_base_thresh,
@@ -3052,7 +3051,11 @@ relation_needs_vacanalyze(Oid relid,
 	 */
 	if (PointerIsValid(tabentry) && AutoVacuumingActive())
 	{
-		reltuples = classForm->reltuples;
+		float4		pcnt_unfrozen = 1;
+		float4		reltuples = classForm->reltuples;
+		int32		relpages = classForm->relpages;
+		int32		relallfrozen = classForm->relallfrozen;
+
 		vactuples = tabentry->dead_tuples;
 		instuples = tabentry->ins_since_vacuum;
 		anltuples = tabentry->mod_since_analyze;
@@ -3061,11 +3064,29 @@ relation_needs_vacanalyze(Oid relid,
 		if (reltuples < 0)
 			reltuples = 0;
 
+		/*
+		 * If we have data for relallfrozen, calculate the unfrozen percentage
+		 * of the table to modify insert scale factor. This helps us decide
+		 * whether or not to vacuum an insert-heavy table based on the number
+		 * of inserts to the "active" part of the table.
+		 */
+		if (relpages > 0 && relallfrozen > 0)
+		{
+			/*
+			 * It could be the stats were updated manually and relallfrozen >
+			 * relpages. Clamp relallfrozen to relpages to avoid nonsensical
+			 * calculations.
+			 */
+			relallfrozen = Min(relallfrozen, relpages);
+			pcnt

Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Melanie Plageman
On Tue, Feb 25, 2025 at 1:52 PM Robert Haas  wrote:
>
> On Tue, Feb 25, 2025 at 11:03 AM Melanie Plageman
>  wrote:
> > This does however leave me with the question of how to handle the
> > original question of whether or not to cap the proposed relallfrozen
> > to the value of relallvisible when updating stats at the end of
> > vacuum. The current code in heap_vacuum_rel() caps relallvisible to
> > relpages, so capping relallfrozen to relallvisible would follow that
> > pattern. However, the other places relallvisible is updated do no such
> > capping (do_analyze_rel(), index_update_stats()). It doesn't seem like
> > there is a good reason to do it one place and not the others. So, I
> > suggest either removing all the caps and adding a WARNING or capping
> > the value in all places. Because users can now manually update these
> > values in pg_class, there wouldn't be a way to detect the difference
> > between a bogus relallfrozen value due to VM corruption or a bogus
> > value due to manual statistics intervention. This led me to think that
> > a WARNING and no cap would be more effective for heap_vacuum_rel().
>
> I mean, does it really make any difference one way or the other?
>
> Given that users could manually update the catalog, we have to be able
> to tolerate bad data in the catalogs without the world ending. If that
> code has to exist anyway, then it's not mandatory to cap. On the other
> hand, there's no great virtue in refusing to correct data that we know
> to be wrong. Unless there is some other consideration which makes one
> way better than the other, this feels like author's choice.

I realized that whether or not we add a WARNING is an independent
question from whether or not we cap these values. In these instances,
we happen to have just read the whole VM and so we can tell you if it
is broken in a particular way. If I want to write a patch to warn
users of visibility map corruption after calling
visibilitymap_count(), I could do that and it might be a good idea,
but it should probably be a separate commit anyway.

- Melanie




Re: Statistics Import and Export

2025-02-25 Thread Tom Lane
Corey Huinker  writes:
> My solution so far is to take allo the v11+ (SELECT array_agg...) functions
> and put them into a LATERAL, two of them filtered by attstattarget > 0 and
> a new one aggregating attnames with no filter.

> An alternative would be a new subselect for array_agg(attname) WHERE
> in.indexprs IS NOT NULL, thus removing the extra compute for the indexes
> that lack an index expression (i.e. most of them), and thus lack settable
> stats (at least for now) and wouldn't be affected by the name-jitter issue
> anyway.

Yeah, I've been thinking about that.  I think that the idea of the
current design is that relatively few indexes will have explicit stats
targets set on them, so most of the time the sub-SELECTs produce no
data.  (Which is not to say that they're cheap to execute.)  If we
pull all the column names for all indexes then we'll likely bloat
pg_dump's working storage quite a bit.  Pulling them only for indexes
with expression columns should fix that, and as you say we don't need
the names otherwise.

I still fear that those sub-selects are pretty expensive in aggregate
-- they are basically forcing a nestloop join -- and maybe we need to
rethink that whole idea.

BTW, just as a point of order: it is not the case that non-expression
indexes are free of name-jitter problems.  That's because we don't
bother to rename index columns when the underlying table column is
renamed, thus:

regression=# create table t1 (id int primary key);
CREATE TABLE
regression=# \d t1_pkey
Index "public.t1_pkey"
 Column |  Type   | Key? | Definition 
+-+--+
 id | integer | yes  | id
primary key, btree, for table "public.t1"

regression=# alter table t1 rename column id to xx;
ALTER TABLE
regression=# \d t1_pkey
Index "public.t1_pkey"
 Column |  Type   | Key? | Definition 
+-+--+
 id | integer | yes  | xx
primary key, btree, for table "public.t1"

After dump-n-reload, this index's column will be named "xx".
That's not relevant to our current problem as long as we
don't store stats on such index columns, but it's plenty
relevant to the ALTER INDEX ... SET STATISTICS code.

> I'm on the fence about how to handle pg_clear_attribute_stats(), leaning
> toward overloaded functions.

I kinda felt that we didn't need to bother with an attnum-based
variant of pg_clear_attribute_stats(), since pg_dump has no
use for that.  I won't stand in the way if you're desperate to
do it, though.

regards, tom lane




Re: Log connection establishment timings

2025-02-25 Thread Fujii Masao




On 2025/02/26 6:36, Melanie Plageman wrote:

On Tue, Feb 25, 2025 at 3:23 PM Melanie Plageman
 wrote:


Thanks for doing this! I have implemented your suggestion in attached v3.


I missed an include in the EXEC_BACKEND not defined case. attached v4
is fixed up.


Thanks for updating the patch!

+   /* Capture time Postmaster initiates fork for logging */
+   if (child_type == B_BACKEND)
+   INSTR_TIME_SET_CURRENT(((BackendStartupData *) 
startup_data)->fork_time);

When log_connections is enabled, walsender connections are also logged.
However, with the patch, it seems the connection time for walsenders isn't 
captured.
Is this intentional?


With the current patch, when log_connections is enabled, the connection time is 
always
captured, and which might introduce performance overhead. No? Some users who 
enable
log_connections may not want this extra detail and want to avoid such overhead.
So, would it make sense to extend log_connections with a new option like 
"timing" and
log the connection time only when "timing" is specified?


+   ereport(LOG,
+   errmsg("backend ready for query. 
pid=%d. socket=%d. connection establishment times (ms): total: %ld, fork: %ld, 
authentication: %ld",
+  MyProcPid,
+  (int) 
MyClientSocket->sock,

Why expose the socket's file descriptor? I'm not sure how users would use that 
information.


Including the PID seems unnecessary since it's already available via 
log_line_prefix with %p?


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: Spinlock can be released twice in procsignal.c

2025-02-25 Thread Michael Paquier
On Wed, Feb 26, 2025 at 09:08:53AM +0500, Andrey Borodin wrote:
> Looks like the oversight in 9d9b9d4. IMO the fix is correct.

if (pg_atomic_read_u32(&slot->pss_pid) != 0)
{
-   SpinLockRelease(&slot->pss_mutex);
elog(LOG, "process %d taking over ProcSignal slot %d, but it's 
not empty",
 MyProcPid, MyProcNumber);
}

This fix is not correct.  No system function calls (well basically
most of them) or even more no PostgreSQL-specific calls should happen
while holding a spinlock.  elog() is a good example of what not to do.
One example: imagine a palloc failure while holding this spinlock in
this elog().

The code should be restructured so as we read pss_pid while holding
the spinlock, release the spinlock, and then LOG.  Based on the
structure of this routine, you could just assign a boolean to decide
if something should be logged and delay the LOG until the spinlock is
released, because we don't intend an early exit like in
CleanupProcSignalState().  In ~16, ProcSignalInit() is able to LOG
things early, but the ProcSignal is forced even if pss_pid is set, so
delaying the LOG to be generated after updating ProcSignal does not
matter.
--
Michael


signature.asc
Description: PGP signature


Re: Statistics Import and Export

2025-02-25 Thread Corey Huinker
On Tue, Feb 25, 2025 at 11:36 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > My solution so far is to take allo the v11+ (SELECT array_agg...)
> functions
> > and put them into a LATERAL, two of them filtered by attstattarget > 0
> and
> > a new one aggregating attnames with no filter.
>
> > An alternative would be a new subselect for array_agg(attname) WHERE
> > in.indexprs IS NOT NULL, thus removing the extra compute for the indexes
> > that lack an index expression (i.e. most of them), and thus lack settable
> > stats (at least for now) and wouldn't be affected by the name-jitter
> issue
> > anyway.
>
> Yeah, I've been thinking about that.  I think that the idea of the
> current design is that relatively few indexes will have explicit stats
> targets set on them, so most of the time the sub-SELECTs produce no
> data.  (Which is not to say that they're cheap to execute.)  If we
> pull all the column names for all indexes then we'll likely bloat
> pg_dump's working storage quite a bit.  Pulling them only for indexes
> with expression columns should fix that, and as you say we don't need
> the names otherwise.
>
> I still fear that those sub-selects are pretty expensive in aggregate
> -- they are basically forcing a nestloop join -- and maybe we need to
> rethink that whole idea.
>
> BTW, just as a point of order: it is not the case that non-expression
> indexes are free of name-jitter problems.  That's because we don't
> bother to rename index columns when the underlying table column is
> renamed, thus:
>

Ouch.


> After dump-n-reload, this index's column will be named "xx".
> That's not relevant to our current problem as long as we
> don't store stats on such index columns, but it's plenty
> relevant to the ALTER INDEX ... SET STATISTICS code.
>

The only way I can imagine those columns getting their own stats is if we
start adding stats for columns of partial indexes, in which case we'd just
bump the predicate to WHERE (i.indexprs IS NOT NULL OR i.indpred IS NOT
NULL)

Just to confirm, we ARE able to assume dense packing of attributes in an
index, and thus we can infer the attnum from the position of the attname in
the aggregated array, and there's no need to do a parallel array_agg of
attnums, yes?


>
> > I'm on the fence about how to handle pg_clear_attribute_stats(), leaning
> > toward overloaded functions.
>
> I kinda felt that we didn't need to bother with an attnum-based
> variant of pg_clear_attribute_stats(), since pg_dump has no
> use for that.  I won't stand in the way if you're desperate to
> do it, though.
>

I'm not desperate to slow this thread down, no. We'll stick with
attname-only.


Re: Statistics Import and Export

2025-02-25 Thread Tom Lane
Corey Huinker  writes:
> Just to confirm, we ARE able to assume dense packing of attributes in an
> index, and thus we can infer the attnum from the position of the attname in
> the aggregated array, and there's no need to do a parallel array_agg of
> attnums, yes?

Yes, absolutely, there are no dropped columns in indexes.  See
upthread discussion.

We could have avoided two sub-selects for attstattarget too,
on the same principle: just collect them all and sort it out
later.  That'd risk bloating pg_dump's storage, although maybe
we could have handled that by doing additional processing
while inspecting the results of getIndexes' query, so as not
to store anything in the common case.

regards, tom lane




Enhances pg_createsubscriber documentation for the -d option.

2025-02-25 Thread vignesh C
Hi,

Currently, pg_createsubscriber converts streaming replication to
logical replication using the specified database name. If the database
name is not provided, it is obtained from the --publisher-server
option. If the database name is not specified in either the --database
or --publisher-server option, an error is reported. This behavior is
not documented. The attached patch includes the necessary
documentation updates.

Regards,
Vignesh


0001-Enhances-pg_createsubscriber-documentation-for-the-d.patch
Description: Binary data


Re: Log connection establishment timings

2025-02-25 Thread Bertrand Drouvot
Hi,

On Wed, Feb 26, 2025 at 01:46:19PM +0900, Fujii Masao wrote:
> 
> 
> On 2025/02/26 6:36, Melanie Plageman wrote:
> > On Tue, Feb 25, 2025 at 3:23 PM Melanie Plageman
> >  wrote:
> > > 
> > > Thanks for doing this! I have implemented your suggestion in attached v3.

Thanks for the new patch version!

> + /* Capture time Postmaster initiates fork for logging */
> + if (child_type == B_BACKEND)
> + INSTR_TIME_SET_CURRENT(((BackendStartupData *) 
> startup_data)->fork_time);
> 
> When log_connections is enabled, walsender connections are also logged.
> However, with the patch, it seems the connection time for walsenders isn't 
> captured.
> Is this intentional?

Good point. I'm tempted to say that it should also be, specially because a
connection done as "psql replication=database" is of "walsender" backend type 
and
would not be reported.

> With the current patch, when log_connections is enabled, the connection time 
> is always
> captured, and which might introduce performance overhead. No? Some users who 
> enable
> log_connections may not want this extra detail and want to avoid such 
> overhead.
> So, would it make sense to extend log_connections with a new option like 
> "timing" and
> log the connection time only when "timing" is specified?

+1, I also think it's a good idea to let users decide if they want the timing
measurement overhead (and it's common practice with track_io_timing,
track_wal_io_timing, the newly track_cost_delay_timing for example)

> + ereport(LOG,
> + errmsg("backend ready for 
> query. pid=%d. socket=%d. connection establishment times (ms): total: %ld, 
> fork: %ld, authentication: %ld",
> +MyProcPid,
> +(int) 
> MyClientSocket->sock,
> 
> Why expose the socket's file descriptor? I'm not sure how users would use 
> that information.
> 
> 
> Including the PID seems unnecessary since it's already available via 
> log_line_prefix with %p?

Yeah, we would get things like:

[539] LOG:  connection received: host=[local]
[539] LOG:  connection authenticated: user="postgres" method=trust 
(/home/postgres/postgresql/pg_installed/pg18/data/pg_hba.conf:117)
[539] LOG:  connection authorized: user=postgres database=postgres 
application_name=psql
[539] LOG:  backend ready for query. pid=539. socket=9. connection 
establishment times (ms): total: 2, fork: 0, authentication: 0

I also wonder if "backend ready for query" is worth it. Maybe something like:

2025-02-26 06:44:23.265 UTC [539] LOG:  connection establishment times 
(ms): total: 2, fork: 0, authentication: 0

would be good enough?

A few random comments:

=== 1

+typedef struct ConnectionTiming
+{
+   instr_time  fork_duration;
+   instr_time  auth_duration;
+} ConnectionTiming;

As it's all about instr_time, I wonder if we could use an enum + array instead.
That's probably just a matter of taste but that sounds more flexible to extend
(should we want to add more timing in the future).

=== 2

+ConnectionTiming conn_timing = {0};

There is no padding in ConnectionTiming and anyway we just access its fields
so that's ok to initialize that way.

=== 3

Add a few words in the log_connections GUC doc? (anyway we will have to if
Fujii-san idea above about the timing is implemented)

=== 4

+   /* Calculate total fork duration in child backend for logging */
+   if (child_type == B_BACKEND)
+   {
+   INSTR_TIME_SET_CURRENT(conn_timing.fork_duration);
+   INSTR_TIME_SUBTRACT(conn_timing.fork_duration,
+   
((BackendStartupData *) startup_data)->fork_time);
+   }
+
/* Close the postmaster's sockets */
ClosePostmasterPorts(child_type == B_LOGGER);

@@ -618,6 +630,14 @@ SubPostmasterMain(int argc, char *argv[])
/* Read in the variables file */
read_backend_variables(argv[2], &startup_data, &startup_data_len);

+   /* Calculate total fork duration in child backend for logging */
+   if (child_type == B_BACKEND)
+   {
+   INSTR_TIME_SET_CURRENT(conn_timing.fork_duration);
+   INSTR_TIME_SUBTRACT(conn_timing.fork_duration,
+   ((BackendStartupData *) 
startup_data)->fork_time);
+   }

worth to add a helper function to avoid code duplication?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [BUG]: the walsender does not update its IO statistics until it exits

2025-02-25 Thread Michael Paquier
On Tue, Feb 25, 2025 at 01:42:08PM +, Bertrand Drouvot wrote:
> Now we can see that the numbers increased for the relation object and that we
> get non zeros numbers for the wal object too (which makes fully sense).
> 
> With the attached patch applied, we would get the same numbers already in
> step 4. (means the stats are flushed without the need to wait for the 
> walsender
> to exit).

@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
 if (pq_flush_if_writable() != 0)
 WalSndShutdown();
 
+/*
+ * Report IO statistics
+ */
+pgstat_flush_io(false);
+(void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
 /* If nothing remains to be sent right now ... */
 if (WalSndCaughtUp && !pq_is_send_pending())
 {

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

A non-forced periodic flush like you are proposing here sounds OK to
me, but the position of the flush could be positioned better in the
loop.  If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
would shut down, so it seems rather pointless to do a flush just
before exiting the process in WalSndDone(), no?  I'd suggest to move
the flush attempt closer to where we wait for some activity, just
after WalSndKeepaliveIfNecessary().
--
Michael


signature.asc
Description: PGP signature


Re: per backend WAL statistics

2025-02-25 Thread Michael Paquier
On Tue, Feb 25, 2025 at 03:00:35PM +, Bertrand Drouvot wrote:
> That makes fully sense. Done in 0004 attached. Somehow related to that, I've
> a patch in progress to address some of Rahila's comments ([1]) (the one 
> related
> to the AuxiliaryPidGetProc() call is relevant specially since a051e71e28a 
> where
> pgstat_tracks_backend_bktype() has been modified for B_WAL_RECEIVER, 
> B_WAL_SUMMARIZER
> and B_WAL_WRITER). I'll wait for 0004 to go in before sharing the patch.

Applied v9-0001 and v9-0003 as these were fine, with more
documentation added in pgstat.h for the new WAL structure, and the 
reason why it exists.  I've noticed the difference with bktype in
v9-0004 as the WAL part does not need this information when generating
its tuple, OK here.

Doing v9-0003 after v9-0002 felt a bit odd, changing twice the
signature of pg_stat_wal_build_tuple() to adapt with the split for the
reset timestamp.

-   values[4] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+   if (wal_stats.stat_reset_timestamp != 0)
+   values[4] = TimestampTzGetDatum(wal_stats.stat_reset_timestamp);
+   else
+   nulls[4] = true;

In patch v9-0002, is this nulls[4] required for the backend part?
--
Michael


signature.asc
Description: PGP signature


Re: Experimental tool to explore commitfest patches

2025-02-25 Thread Jacob Brazeal
I wanted to provide a quick update on the app [0]. Here are the main issues
I've seen flagged so far:

1. The ranking system needs improvement. Ideally it should promote *relevant,
important, ready-for-review* patches.
2. We should display contributor names as they appear in the commitfest app
(this is relevant because we have to correlate names from several different
systems.)

I will be working on all of these, but tonight I want to provide an update
on the ranking system. The app now predicts which committers might be a
good fit for a patch, and displays this information in the app. If you are
committer and select your name in the queue, those patches will float to
the top. As a quick sanity check, most of the cases I've seen flagged so
far are correctly handled by the new system. Here are some more details on
how it works and how

The new recommendation system is based on keywords. I used an LLM to
extract technical keywords from the mailing list threads associated with
the last 10,000 git commits, and then trained a logistic regression model
to match the keywords to committers. I'm no expert at this, but I did some
basic statistical validation of the result on a training/test split and got
decent results: around 44% of the top choices of the model were correct,
and just to be safe, I show the top 3 predicted committer for each patch in
the UX. When looking at specific folks like Robert, in our test dataset,
about 77% of the results matched to him he actually committed (precision)
and we overall identify about 45% of his commits (recall.) So, not perfect,
but actually pretty likely to tag a mailing list thread to the person who
will commit it.

In the UX, if you are one of the top 3 identified committers, you will also
see a list of the top keywords from the mailing thread that were associated
with you.

[0] https://patchwork-three.vercel.app/


Re: Spinlock can be released twice in procsignal.c

2025-02-25 Thread Maxim Orlov
On Wed, 26 Feb 2025 at 07:56, Michael Paquier  wrote:

>
> This fix is not correct.  No system function calls (well basically
> most of them) or even more no PostgreSQL-specific calls should happen
> while holding a spinlock.  elog() is a good example of what not to do.
> One example: imagine a palloc failure while holding this spinlock in
> this elog().


Indeed. PFA the correct one.


-- 
Best regards,
Maxim Orlov.


v2-0001-Avoid-double-spinlock-release.patch
Description: Binary data


Re: support virtual generated column not null constraint

2025-02-25 Thread ego alter
Hi, I’ve had a chance to review the patch.  As I am still getting familiar
with executor part, questions and feedbacks could be relatively trivial.

There are two minor issues i want to discuss:
1. The way of caching constraint-checking expr for virtual generated not
null
The existing array for caching constraint-checking expr is
 /* array of constraint-checking expr states */
ExprState **ri_ConstraintExprs;

the proposed changes for virtual generated not null in patch
+  /* array of virtual generated not null constraint-checking expr states */
+  ExprState **ri_VirGeneratedConstraintExprs;
/*
Could you explain the rationale behind adding this new field instead of
reusing ri_ConstraintExprs? The comment suggests it’s used specifically for
not null constraint-checking, but could it be extended to handle other
constraints in the future as well? I assume one benefit is that it
separates the handling of normal constraints from virtual ones, but I'd
like to appreciate more context on the decision.

2. The naming of variable gen_virtualnn.
Gen_virtualnn looks confusing at first glance. Checkconstr seems to be more
straightforward.

/* And evaluate the check constraints for virtual generated column */
+  for (i = 0; i < bms_num_members(gen_virtual_cols); i++)
+  {
+  ExprState  *gen_virtualnn =
resultRelInfo->ri_VirGeneratedConstraintExprs[i];
+
+  if (gen_virtualnn && !ExecCheck(gen_virtualnn, econtext))
+  return attnums[i];
+  }
+

  /* And evaluate the constraints */
for (i = 0; i < ncheck; i++)
{
ExprState  *checkconstr = resultRelInfo->ri_ConstraintExprs[i];

/*
* NOTE: SQL specifies that a NULL result from a constraint expression
* is not to be treated as a failure.  Therefore, use ExecCheck not
* ExecQual.
*/
if (checkconstr && !ExecCheck(checkconstr, econtext))
return check[i].ccname;
}

/* NULL result means no error */
return NULL;

Best regards,
Xuneng

jian he  于2025年2月10日周一 21:34写道:

> hi.
>
> Virtual generated columns committed,
>
> https://git.postgresql.org/cgit/postgresql.git/commit/?id=83ea6c54025bea67bcd4949a6d58d3fc11c3e21b
>
> This patch is for implementing not null constraints on virtual
> generated columns.
>
> NOT NULL constraints on virtual generated columns mean that
> if we INSERT a row into the table and the evaluation of the generated
> expression results in a null value,
> an ERRCODE_NOT_NULL_VIOLATION error will be reported.
>
> main gotcha is in ExecConstraints, expand the generated expression
> and convert a not null constraint to a check constraint and evaluate it.
>


Re: Doc fix of aggressive vacuum threshold for multixact members storage

2025-02-25 Thread John Naylor
On Wed, Feb 5, 2025 at 2:23 AM Sami Imseih  wrote:
>
> I confirmed the 20GB value as is described here [1].
> 8k page can hold 409 member groups and each
> member group can hold 4 members, thus
> (2^32/(409 *4))*8192 = 20GB.
>
> I also fixed whitespace issues in v3.

It seems at a minimum this one-line patch is sufficient for the correction:

- storage occupied by multixacts members exceeds 2GB, aggressive vacuum
+ storage occupied by multixacts members exceeds about 10GB,
aggressive vacuum

Commit c552e171d16e removed the percentage as part of a judgment call
on clarity, and I'm not sure that was wrong.

We could add the proposed language on "can grow up to about 20GB" at
the end of this paragraph, which seems more natural -- first mention
the amount that triggers aggressive vacuum, then the maximum size.

On Fri, Feb 21, 2025 at 9:14 PM Sami Imseih  wrote:
>
> > Maybe we could also add a comment in multixact.c to update the doc 
> > accordingly if
> > the computation changes? (I think that will be easy to miss).
>
> Thanks for the comments!
>
> I rather we not touch the .c file for this update. It's unlikely the actual
> computation will change.

I'm on the fence about putting a hint in the C file, but the
computation has changed in the past, see commit b4d4ce1d50bbdf , so
it's a reasonable idea.

-- 
John Naylor
Amazon Web Services




Re: Fix logging for invalid recovery timeline

2025-02-25 Thread Michael Paquier
On Tue, Feb 25, 2025 at 04:25:40PM +, David Steele wrote:
> On 2/25/25 04:07, Benoit Lobréau wrote:
>> Thank you Michael and David.
>> I never paid attention to thoses...

Don't worry about that.  There is a lot of Postgres-ism in this code
base, and we all keep learning stuff :D

> This looks good to me.

Applied on HEAD.
--
Michael


signature.asc
Description: PGP signature


Re: Postmaster crashed during start

2025-02-25 Thread Srinath Reddy
On Wed, Feb 26, 2025 at 9:50 AM Srinath Reddy  wrote:

>
>
> On Wed, Feb 26, 2025 at 9:23 AM Tom Lane  wrote:
>
>> Srinath Reddy  writes:
>> > when we kill postmaster using kill -9 and start immediately it crashes
>> with
>> >> FATAL:  pre-existing shared memory block (key 2495405, ID 360501) is
>> still
>> >> in use
>>
>> "Doctor, it hurts when I do this!"
>>
>> "So don't do that!"
>>
>> This is not a supported way of shutting down the postmaster, and it
>> never will be.  Use SIGINT, or SIGQUIT if you are in a desperate
>> hurry and are willing to have the next startup take longer.
>>
> i was actually trying to recreate power outage scenario using
> node->kill9(),node->start() in a custom tap test,then i found this crash.
>
>
>>
>> I think the specific reason you are seeing this is that it takes
>> nonzero time for the postmaster's orphaned child processes to
>> notice that the postmaster is gone and terminate.  As long as
>> any of those children remain, the shared memory block will have
>> a nonzero reference count.  The new postmaster sees that and
>> refuses to start, for the very sound reason that it risks
>> data corruption if it brings up a new set of worker processes
>> while any of the old ones are still running.
>>
>> regards, tom lane
>>
>
> i am guessing you mean "reference count to shared memory block"  means
> shmem_nattach right? i think this will be incremented by 1 when a process
> attached to the shmem segment using shmat() in postgres case its the
> postmaster who attaches during creation of shmem segment and detaches
> during postmaster's on_shmem_exit is called during if it exits properly or
> not dies suddenly (as the case with kill -9) ,during detaching only the
> shmem_nattach will be decremented by 1 ,AFAIK the child processes will get
> to use the shmem segment but never attaches or detaches so they are not
> effecting the shmem_nattach.so as the shmem_nattach is not
> 0 PGSharedMemoryAttach thinks the shmem state is still attached and in use.
>
>
>


Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer
 wrote:
>
> > I agree it would be preferable to make a centralized check work.
>
> Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid 
> checks into one single place including the AVX512 popcount code. Any new 
> feature that requires CPUID information can access that information with 
> pg_cpu_have[FEATURE] defined in pg_cpucap.h and initialized in pg_cpucap.c. 
> v8 also had a typo in configure files which caused a build failure. Its fixed 
> in v9.
>
> Pretty sure the ARM code paths need some correction. Let me know what you 
> think.

Thanks, I think this is a good direction. Some comments:

+static void pg_cpuid(int leaf, int subleaf, unsigned int* exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, leaf, subleaf);
+#else
+#error cpuid instruction not available
+#endif
+}

Our current configure still looks for __get_cpuid and __cpuid. We
committed checking these new ones fairly recently, and they were
further gated by USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. It seems like
here we should do something like the following, where "+" lines are
from the patch and other lines are mine:

+static void
+pg_cpucap_x86(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
__cpuid(exx, 1);
#endif

+pg_cpucap[PG_CPU_FEATURE_SSE42] = (exx[2] & (1 << 20)) != 0;
+pg_cpucap[PG_CPU_FEATURE_PCLMUL] = (exx[2] & (1 << 1)) != 0;
+pg_cpucap[PG_CPU_FEATURE_POPCNT] = (exx[2] & (1 << 23)) != 0;
+/* osxsave */
+if ((exx[2] & (1 << 27)) == 0) {
+return;
+}
+/* avx512 os support */
+if (zmm_regs_available()) {
+return;
+}

// BTW, I do like the gating here that reduces the number of places
that have to know about zmm and xsave. (Side note: shouldn't that be
"if !(zmm_regs_available())"?)

+/* reset for second cpuid call on leaf 7 to check extended avx512
support */

exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID_COUNT)
__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUIDEX)
__cpuidex(exx, 7, 0);
#endif

+pg_cpucap[PG_CPU_FEATURE_AVX512F]  = (exx[1] & (1 << 16)) != 0;
+pg_cpucap[PG_CPU_FEATURE_AVX512BW]= (exx[1] & (1 << 30)) != 0;
+pg_cpucap[PG_CPU_FEATURE_AVX512VPOPCNTDQ] = (exx[2] & (1 << 14)) != 0;
+
+}

What do you think?

-#define PGCPUCAP_INIT   (1 << 0)
-#define PGCPUCAP_POPCNT (1 << 1)
-#define PGCPUCAP_VPOPCNT(1 << 2)
-#define PGCPUCAP_CRC32C (1 << 3)
-#define PGCPUCAP_CLMUL  (1 << 4)
+enum pg_cpucap__
+{
+PG_CPU_FEATURE_INIT  = 0,
+// X86
+PG_CPU_FEATURE_SSE42 = 1,
+PG_CPU_FEATURE_POPCNT= 2,
+PG_CPU_FEATURE_PCLMUL= 3,
[...]
+PG_CPU_FEATURE_ARMV8_CRC32C  = 100,

I'm not a fan of exposing these architecture-specific details to
places that consult the capabilities. That requires things like

+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
[...]
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C)

I'd prefer to have 1 capability <-> one place to check at runtime for
architectures that need that, and to keep architecture details private
to the initialization step. . Even for things that test for which
function pointer to use, I think it's a cleaner interface to look at
one thing.

+static void
+pg_cpucap_arch()
+{
+/* WIP: configure checks */
+#ifdef __x86_64__
+pg_cpucap_x86();
+#else // ARM:
+pg_cpucap_arm();
+#endif
+}

If we're going to have a single file for the init step, we don't need
this -- we'd just have a different definition of
pg_cpucap_initialize() in each part, with a default that only adds the
"init" slot:

#if defined( __i386__ ) || defined(i386) || defined(_M_IX86) ||
  defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64)



#elif defined(__arm__) || defined(__arm) ||
   defined(__aarch64__) || defined(_M_ARM64)



#else

#endif

--

John Naylor
Amazon Web Services




Re: Parallel heap vacuum

2025-02-25 Thread Masahiko Sawada
On Tue, Feb 25, 2025 at 2:44 PM Melanie Plageman
 wrote:
>
> On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada  wrote:
> >
> > Given that we have only about one month until the feature freeze, I
> > find that it's realistic to introduce either one parallelism for PG18
> > and at least we might want to implement the one first that is more
> > beneficial and helpful for users. Since we found that parallel phase
> > III is not very efficient in many cases, I'm thinking that in terms of
> > PG18 development, we might want to switch focus to parallel phase I,
> > and then go for phase III if we have time.
>
> Okay, well let me know how I can be helpful. Should I be reviewing a
> version that is already posted?

Thank you so much. I'm going to submit the latest patches in a few
days for parallelizing the phase I. I would appreciate it if you could
review that version.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Remove wal_[sync|write][_time] from pg_stat_wal and track_wal_io_timing

2025-02-25 Thread Michael Paquier
On Tue, Feb 25, 2025 at 08:12:53AM +, Bertrand Drouvot wrote:
> The idea was to not let track_io_timing alone enable the timing in the WAL
> code path. My reasoning was: if you want to see timing in pg_stat_io then you
> need to enable track_io_timing. But that's not enough if you also want to see
> WAL timing, then you also need to set track_wal_io_timing. Your proposal also
> ensures that "track_io_timing alone can not enable the timing in the WAL code 
> path",
> with a clear separation of duties, it's probably better so I'm fine with it.

One problem with this approach is that you would need to always pay
the price of the timings under track_io_timing if the WAL part is
enabled, so you'd lose flexibility compared to the past configuration.
Based on the complaint of upthread, a split makes things more
flexible, as it is possible to monitor one or the other or both.  I'm
OK to tweak things more as required, we still have time for that.

> Yeah, I though about it too but decided to not change the READ part in v1 
> (because
> I think they are less of a concern). OTOH, if you want to see the READ timing 
> then
> you need to set track_wal_io_timing but then once the recovery is over then 
> you'll
> need to disable track_wal_io_timing (if you don't want to pay the price for
> write/fsync activities).
> 
> OTOH pre a051e71e28a track_wal_io_timing was impacting only the write/fsyncs,
> in that regard v1 was close to that.

Still less consistent..  We've never tracked WAL reads in the stats up
to now, so we need to put it in one of these buckets.

> In the track_io_timing GUC description Shouldn't we also mention the
> wal object restriction, something like?
> 
> "
> in pg_stat_database, pg_stat_io (if object is not wal), in the output of the
> pg_stat_get_backend_io() function (if object is not wal)
> "
>
> The proposed doc changes are in the .txt attached (that applies on top of v2).

Sounds like a good set of additions as we have the two GUCs.

Bonus idea: We could also have more GUCs to control all that, or just
put eveything in one single GUC that takes a list of values made of
pairs of (object,op), then set the timings only for the combinations
listed in the GUC.  That feels a bit over-engineered, but you could
deeply control the areas where the data is aggregated.  I'm not really
convinced it is strictly necessary, but, well, that would not be the
first time people tell me I'm wrong..  All things said, v2 sounds like
it has the right balance for now based on what I am reading, providing
the same control as pg_stat_wal previously, so I've used it for now.
--
Michael


signature.asc
Description: PGP signature


Re: Get rid of WALBufMappingLock

2025-02-25 Thread Michael Paquier
On Tue, Feb 25, 2025 at 05:19:29PM +0200, Alexander Korotkov wrote:
> It seems that I managed to reproduce the issue on my Raspberry PI 4.
> After running our test suite in a loop for 2 days I found one timeout.

Hmm.  It's surprising to not see a higher occurence.  My buildfarm
host has caught that on its first run after the patch, for two
different animals which are both on the same machine.

> One way or another, we need protection against this situation any way.
> The updated patch is attached.  Now, after acquiring ReservedPtr it
> waits till OldPageRqstPtr gets initialized.  Additionally I've to
> implement more accurate calculation of OldPageRqstPtr.  I run tests
> with new patch on my Raspberry in a loop.  Let's see how it goes.

Perhaps you'd prefer that I do more tests with your patch?  This is
time-consuming for you.  This is not a review of the internals of the
patch, and I cannot give you access to the host, but if my stuff is
the only place where we have a good reproducibility of the issue, I'm
OK to grab some time and run a couple of checks to avoid again a
freeze of the buildfarm.
--
Michael


signature.asc
Description: PGP signature


Re: pgbench client-side performance issue on large scripts

2025-02-25 Thread Tom Lane
"Daniel Verite"  writes:
> For the moment I'll stay with my quick fix, then l'll try
> to come up with something to replace expr_scanner_get_lineno() .

I got nerd-sniped by this question and spent some time looking into
it.  ParseScript has got worse problems than just being slow: it's
actively buggy.  Notice that start_offset is set only once before
entering the loop, and doesn't change thereafter.  How is it that
we're getting sane line numbers at all?  The reason is that (1) if
we've not called yylex() at all yet, expr_scanner_offset() gives the
distance to the end of the string, since the yytext-ending NUL it's
looking for isn't there yet; and (2) expr_scanner_get_lineno() treats
the given start_offset as an upper bound, and won't complain if it
finds the NUL earlier than that.  So it gave the desired
line-number-of-the-current-token on all iterations after the first,
but on the first time through we get the line number of the script
end.  You can only see that in the case of \gset as the first command,
and I guess nobody noticed it yet.

Furthermore, it's not only ParseScript that's got O(N^2) problems;
so does process_backslash_command.  Your test case didn't show that
up, but a test with 50K backslash commands would.  We were actually
doing a strlen() of the whole string for each word of a backslash
command.  strlen() is likely faster than expr_scanner_get_lineno(),
but it's not so fast that O(N^2) effects don't matter.

The attached patch gets rid of both expr_scanner_offset() and
expr_scanner_get_lineno() altogether, in favor of using a new
function I added to psqlscan.l.  That uses the idea from plpgsql
of tracking the last-detected line end so that we don't have to
rescan prior lines over and over.  On my machine, parsing 50K-line
scripts goes from more than 10 seconds to perhaps 50 ms.

regards, tom lane

From 542c081fd2fe2b60713b6fabbc59598c449f4c9e Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Tue, 25 Feb 2025 16:30:48 -0500
Subject: [PATCH v1] Get rid of O(N^2) script-parsing overhead in pgbench.

pgbench wants to record the starting line number of each command
in its scripts.  It was computing that by scanning from the script
start and counting newlines, so that O(N^2) work had to be done
for an N-command script.  In a script with 50K lines, this adds
up to about 10 seconds on my machine.

To add insult to injury, the results were subtly wrong, because
expr_scanner_offset() scanned to find the NUL that flex inserts
at the end of the current token --- and before the first yylex
call, no such NUL has been inserted.  So we ended by computing the
script's last line number not its first one.  This was visible only
in case of \gset at the start of a script, which perhaps accounts
for the lack of complaints.

To fix, steal an idea from plpgsql and track the most recently
identified line start as we advance through the script.  This
is potentially still O(N^2) with sufficiently long lines, but
that seems unlikely to be a problem in practice.  Also adjust
a couple of other places that were invoking scans from script
start when they didn't really need to.  I made a new psqlscan
function psql_scan_get_location() that replaces both
expr_scanner_offset() and expr_scanner_get_lineno(), since
in practice expr_scanner_get_lineno() was only being invoked
to find the line number of the current parsing offset.

Reported-by: Daniel Verite 
Author: Tom Lane 
Discussion: https://postgr.es/m/84a8a89e-adb8-47a9-9d34-c13f7150e...@manitou-mail.org
---
 src/bin/pgbench/exprscan.l  | 74 ++---
 src/bin/pgbench/pgbench.c   | 14 +++---
 src/bin/pgbench/pgbench.h   |  4 +-
 src/fe_utils/psqlscan.l | 54 +
 src/include/fe_utils/psqlscan.h |  3 ++
 src/include/fe_utils/psqlscan_int.h |  4 ++
 6 files changed, 93 insertions(+), 60 deletions(-)

diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
index 8943a52e9f0..b4800937691 100644
--- a/src/bin/pgbench/exprscan.l
+++ b/src/bin/pgbench/exprscan.l
@@ -271,10 +271,14 @@ void
 expr_yyerror_more(yyscan_t yyscanner, const char *message, const char *more)
 {
 	PsqlScanState state = yyget_extra(yyscanner);
-	int			error_detection_offset = expr_scanner_offset(state) - 1;
+	int			lineno;
+	int			error_detection_offset;
 	YYSTYPE		lval;
 	char	   *full_line;
 
+	psql_scan_get_location(state, &lineno, &error_detection_offset);
+	error_detection_offset--;
+
 	/*
 	 * While parsing an expression, we may not have collected the whole line
 	 * yet from the input source.  Lex till EOL so we can report whole line.
@@ -289,7 +293,6 @@ expr_yyerror_more(yyscan_t yyscanner, const char *message, const char *more)
 	/* Extract the line, trimming trailing newline if any */
 	full_line = expr_scanner_get_substring(state,
 		   expr_start_offset,
-		   expr_scanner_offset(state),
 		   true);
 
 	syntax_error(expr_source, expr_lineno, full_line, expr_

Re: Parallel heap vacuum

2025-02-25 Thread Melanie Plageman
On Tue, Feb 25, 2025 at 5:14 PM Masahiko Sawada  wrote:
>
> Given that we have only about one month until the feature freeze, I
> find that it's realistic to introduce either one parallelism for PG18
> and at least we might want to implement the one first that is more
> beneficial and helpful for users. Since we found that parallel phase
> III is not very efficient in many cases, I'm thinking that in terms of
> PG18 development, we might want to switch focus to parallel phase I,
> and then go for phase III if we have time.

Okay, well let me know how I can be helpful. Should I be reviewing a
version that is already posted?

- Melanie




Re: pgbench client-side performance issue on large scripts

2025-02-25 Thread Tom Lane
I wrote:
> I got nerd-sniped by this question and spent some time looking into
> it.

On second look, I'd failed to absorb your point about how the main
loop of ParseScript doesn't need the line number at all; only if
it's a backslash command are we going to use that.  So we can
move the calculation to be done only after we see a backslash.

I'd spent a little time worrying about how the calculation was
really giving a wrong line number: typically, it'd return the
line number of the previous semicolon, since we haven't lexed
any further than that.  That could be fixed with more code,
but it's pretty pointless if we don't need the value in the
first place.

Also, I did a tiny bit of micro-optimization in the first
patch to remove the hazard that it'd still be O(N^2) with
very long input lines.

regards, tom lane

From 116fe4191f81736f55d851de9115218d3e3f132c Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Tue, 25 Feb 2025 18:13:51 -0500
Subject: [PATCH v2 1/2] Get rid of O(N^2) script-parsing overhead in pgbench.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

pgbench wants to record the starting line number of each command
in its scripts.  It was computing that by scanning from the script
start and counting newlines, so that O(N^2) work had to be done
for an N-command script.  In a script with 50K lines, this adds
up to about 10 seconds on my machine.

To add insult to injury, the results were subtly wrong, because
expr_scanner_offset() scanned to find the NUL that flex inserts
at the end of the current token --- and before the first yylex
call, no such NUL has been inserted.  So we ended by computing the
script's last line number not its first one.  This was visible only
in case of \gset at the start of a script, which perhaps accounts
for the lack of complaints.

To fix, steal an idea from plpgsql and track the current lexer
ending position and line count as we advance through the script.
(It's a bit simpler than plpgsql since we can't need to back up.)
Also adjust a couple of other places that were invoking scans
from script start when they didn't really need to.  I made a new
psqlscan function psql_scan_get_location() that replaces both
expr_scanner_offset() and expr_scanner_get_lineno(), since in
practice expr_scanner_get_lineno() was only being invoked to find
the line number of the current lexer end position.

Reported-by: Daniel Vérité 
Author: Tom Lane 
Discussion: https://postgr.es/m/84a8a89e-adb8-47a9-9d34-c13f7150e...@manitou-mail.org
---
 src/bin/pgbench/exprscan.l  | 74 ++---
 src/bin/pgbench/pgbench.c   | 14 +++---
 src/bin/pgbench/pgbench.h   |  4 +-
 src/fe_utils/psqlscan.l | 54 +
 src/include/fe_utils/psqlscan.h |  3 ++
 src/include/fe_utils/psqlscan_int.h |  4 ++
 6 files changed, 93 insertions(+), 60 deletions(-)

diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
index 8943a52e9f0..b4800937691 100644
--- a/src/bin/pgbench/exprscan.l
+++ b/src/bin/pgbench/exprscan.l
@@ -271,10 +271,14 @@ void
 expr_yyerror_more(yyscan_t yyscanner, const char *message, const char *more)
 {
 	PsqlScanState state = yyget_extra(yyscanner);
-	int			error_detection_offset = expr_scanner_offset(state) - 1;
+	int			lineno;
+	int			error_detection_offset;
 	YYSTYPE		lval;
 	char	   *full_line;
 
+	psql_scan_get_location(state, &lineno, &error_detection_offset);
+	error_detection_offset--;
+
 	/*
 	 * While parsing an expression, we may not have collected the whole line
 	 * yet from the input source.  Lex till EOL so we can report whole line.
@@ -289,7 +293,6 @@ expr_yyerror_more(yyscan_t yyscanner, const char *message, const char *more)
 	/* Extract the line, trimming trailing newline if any */
 	full_line = expr_scanner_get_substring(state,
 		   expr_start_offset,
-		   expr_scanner_offset(state),
 		   true);
 
 	syntax_error(expr_source, expr_lineno, full_line, expr_command,
@@ -336,12 +339,15 @@ expr_lex_one_word(PsqlScanState state, PQExpBuffer word_buf, int *offset)
 	/* And lex. */
 	lexresult = yylex(&lval, state->scanner);
 
-	/*
-	 * Save start offset of word, if any.  We could do this more efficiently,
-	 * but for now this seems fine.
-	 */
+	/* Save start offset of word, if any. */
 	if (lexresult)
-		*offset = expr_scanner_offset(state) - word_buf->len;
+	{
+		int			lineno;
+		int			end_offset;
+
+		psql_scan_get_location(state, &lineno, &end_offset);
+		*offset = end_offset - word_buf->len;
+	}
 	else
 		*offset = -1;
 
@@ -404,65 +410,35 @@ expr_scanner_finish(yyscan_t yyscanner)
 }
 
 /*
- * Get offset from start of string to end of current lexer token.
+ * Get a malloc'd copy of the lexer input string from start_offset
+ * to end of current lexer token.  If chomp is true, drop any trailing
+ * newline(s).
  *
  * We rely on the knowledge that flex modifies the scan buffer by storing
  * a NUL at the end 

Re: Statistics Import and Export

2025-02-25 Thread Jeff Davis
On Tue, 2025-02-25 at 15:31 -0500, Corey Huinker wrote:
> Documentation:
> 
> +         The currently-supported relation statistics are
> +         relpages with a value of type
> +         integer, reltuples with a
> value of
> +         type real, and
> relallvisible with a
> +         value of type integer.
> 
> Could we make this a bullet-list? Same for the required attribute
> stats and optional attribute stats. I think it would be more eye-
> catching and useful to people skimming to recall the name of a
> parameter, which is probably what most people will do after they've
> read it once to get the core concepts.

I couldn't make that look quite right. These functions are mostly for
use by pg_dump, and while documentation is necessary, I don't think we
should go so far as to make it "eye-catching". At least not until
things settle a bit.

> Question:
> 
> Do we want to re-compact the oids we consumed in pg_proc.dat?

Done.

> Specifically missing are:
> 
> * regclass not found
> * attribute is system column
> * scalars can't have mcelem
> * mcelem / mcelem freqs mismatch (parts 1 and 2)
> * scalars can't have elem_count_histogram
> * cannot set most_common_elems for range type

Done.

And committed.

Regards,
Jeff Davis





RE: Improve CRC32C performance on SSE4.2

2025-02-25 Thread Devulapalli, Raghuveer
> I agree it would be preferable to make a centralized check work.

Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid 
checks into one single place including the AVX512 popcount code. Any new 
feature that requires CPUID information can access that information with 
pg_cpu_have[FEATURE] defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 
also had a typo in configure files which caused a build failure. Its fixed in 
v9.

Pretty sure the ARM code paths need some correction. Let me know what you think.
 
> Correct me if I'm misunderstanding, but this sounds like in every frontend
> program we'd need to know what the first call was, which seems less
> maintainable than just initializing at the start of every frontend program.

No, sorry for the confusion but that is not what I meant. Lets ignore the 
attribute constructor for now. We can probably revisit this at a later point.

Raghuveer


v9-0001-Dispatch-CRC-computation-by-branching-rather-than.patch
Description: v9-0001-Dispatch-CRC-computation-by-branching-rather-than.patch


v9-0002-Rename-CRC-choose-files-to-cpucap-files.patch
Description: v9-0002-Rename-CRC-choose-files-to-cpucap-files.patch


v9-0003-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch
Description: v9-0003-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch


v9-0004-Improve-CRC32C-performance-on-x86_64.patch
Description: v9-0004-Improve-CRC32C-performance-on-x86_64.patch


v9-0005-Move-all-cpuid-checks-to-one-location.patch
Description: v9-0005-Move-all-cpuid-checks-to-one-location.patch


Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2025-02-25 Thread Masahiko Sawada
On Tue, Feb 25, 2025 at 3:03 PM Dagfinn Ilmari Mannsåker
 wrote:
>
> Hi,
>
> While working on another round of the long option and fat comma style
> cleanup, I noticed that the test for pg_upgrade --set-char-signedess
> doesn't test what it's supposed to:
>
> Masahiko Sawada  writes:
>
> > diff --git a/src/bin/pg_upgrade/t/005_char_signedness.pl 
> > b/src/bin/pg_upgrade/t/005_char_signedness.pl
> > index 05c3014a27d..c024106863e 100644
> > --- a/src/bin/pg_upgrade/t/005_char_signedness.pl
> > +++ b/src/bin/pg_upgrade/t/005_char_signedness.pl
> > @@ -40,6 +40,23 @@ command_like(
> >   qr/Default char data signedness:\s+unsigned/,
> >   'updated default char signedness is unsigned in control file');
> >
> > +# Cannot use --set-char-signedness option for upgrading from v18+
> > +command_fails(
> > + [
> > + 'pg_upgrade', '--no-sync',
> > + '-d', $old->data_dir,
> > + '-D', $new->data_dir,
> > + '-b', $old->config_data('--bindir'),
> > + '-B', $new->config_data('--bindir'),
> > + '-s', $new->host,
> > + '-p', $old->port,
> > + '-P', $new->port,
> > + '-set-char-signedness', 'signed',
>
> This is missing a dash, which causes the command to fail, but for the
> wrong reason.  pg_uprade seems to print all its errors on stdout, which
> I guess is why the test use plain command_fails() instead of
> command_fails_like(). However, we have another function to deal with
> this: command_checks_all().  Attached are patches that fix the above
> test, and also convert the other command_fails() calls in the pg_upgrade
> tests to test for specific messages.

Thank you for the report. I agree with both points.

I believe that replacing command_fails_like() with
command_checks_all() is an improvement whereas adding the missing dash
to --set-char-signendess option is a bug fix. How about reorganizing
the patches as follows?

- 0001 patch just adds the dash to the --set-char-signedness.
- 0002 patch uses command_checks_all() in 002_pg_upgrade.pl,
004_subscription.pl, and 005_char_signedness.pl for checking the
output better.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: psql \dh: List High-Level (Root) Tables and Indexes

2025-02-25 Thread Sadeq Dousti
Dear Greg,

Thank you so much for the kind and prompt review!

Please find the patches attached. The second patch (0002) is where I
applied the requested changes.

Best regards,
Sadeq Dousti


On Wed, Feb 26, 2025 at 1:01 AM Greg Sabino Mullane 
wrote:

> The patch applies cleanly, and works as advertised. Nice work!
>
> Quick notes:
>
> * doc/src/sgml/ref/psql-ref.sgml
>
> In the varlistentry section, the order should be the same as the other
> places (N after m)
>
> Line 1644 has an extra comma
>
> Line 1651, maybe the example is simpler as \dNt to keep the wording
> better, because "indexes that are not partitions" looks odd.
>
> These bits:
>
>   pg_log_error("Did not find any%s relations named \"%s\".",
> no_partition_description, pattern);
>
> are not good for translation. We want things simple with replaceable
> args/constants, but not replaceable words.
>
> I think the myopt.title ones are fine.
>
> * bin/psql/help.c:
>
> \\dN[Sx+] [PATTERN] list relation, table, index (no partitions)
>
> better as:
>
> \\dN[Sx+] [PATTERN] list tables and indexes (no partitions)
>
> Cheers,
> Greg
>
> --
> Crunchy Data - https://www.crunchydata.com
> Enterprise Postgres Software Products & Tech Support
>
>
From 29a59cb46665312810c9a73a6964afd3db0e03f6 Mon Sep 17 00:00:00 2001
From: Sadeq Dousti <3616518+msdou...@users.noreply.github.com>
Date: Wed, 26 Feb 2025 01:23:52 +0100
Subject: [PATCH v4 2/2] Apply Greg's comments

---
 doc/src/sgml/ref/psql-ref.sgml |  8 
 src/bin/psql/describe.c| 19 +--
 src/bin/psql/help.c|  2 +-
 3 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index ea124e5a4d5..37b516fc558 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1635,8 +1635,8 @@ SELECT $1 \parse stmt1
   
 \dE[Sx+] [ pattern ]
 \di[Sx+] [ pattern ]
-\dN[Sx+] [ pattern ]
 \dm[Sx+] [ pattern ]
+\dN[Sx+] [ pattern ]
 \ds[Sx+] [ pattern ]
 \dt[Sx+] [ pattern ]
 \dv[Sx+] [ pattern ]
@@ -1645,15 +1645,15 @@ SELECT $1 \parse stmt1
 
 In this group of commands, the letters E,
 i, m, N,
-, s, t, and v
+s, t, and v
 stand for foreign table, index, materialized view, no partitions,
 sequence, table, and view,
 respectively.
 You can specify any or all of
 these letters, in any order, to obtain a listing of objects
 of these types.  For example, \dti lists
-tables and indexes, and \dNti lists
-tables and indexes that are not partitions of any other relation.
+tables and indexes, and \dNt lists
+tables that are not partitions of any other relation.
 If x is appended to the command name, the results
 are displayed in expanded mode.
 If + is
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ba1d6c09bc9..d4be468de55 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4024,7 +4024,6 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	printQueryOpt myopt = pset.popt;
 	int			cols_so_far;
 	bool		translate_columns[] = {false, false, true, false, false, false, false, false, false};
-	char	   *no_partition_description = showNoPartitions ? " non-partition" : "";
 
 	/*
 	 * Note: Declarative table partitioning is only supported as of Pg 10.0.
@@ -4205,14 +4204,14 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 		if (pattern)
 		{
 			if (ntypes != 1)
-pg_log_error("Did not find any%s relations named \"%s\".",
-			 no_partition_description, pattern);
+pg_log_error("Did not find any relations named \"%s\".",
+			 pattern);
 			else if (showTables)
-pg_log_error("Did not find any%s tables named \"%s\".",
-			 no_partition_description, pattern);
+pg_log_error("Did not find any tables named \"%s\".",
+			 pattern);
 			else if (showIndexes)
-pg_log_error("Did not find any%s indexes named \"%s\".",
-			 no_partition_description, pattern);
+pg_log_error("Did not find any indexes named \"%s\".",
+			 pattern);
 			else if (showViews)
 pg_log_error("Did not find any views named \"%s\".",
 			 pattern);
@@ -4232,11 +4231,11 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 		else
 		{
 			if (ntypes != 1)
-pg_log_error("Did not find any%s relations.", no_partition_description);
+pg_log_error("Did not find any relations.");
 			else if (showTables)
-pg_log_error("Did not find any%s tables.", no_partition_description);
+pg_log_error("Did not find any tables.");
 			else if (showIndexes)
-pg_log_error("Did not find any%s indexes.", no_partition_description);
+pg_log_error("Did not find any indexes.");
 			else if (showViews)
 pg_log_error("Did not find any view

Re: Make COPY format extendable: Extract COPY TO format implementations

2025-02-25 Thread Masahiko Sawada
On Thu, Feb 20, 2025 at 6:48 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Thu, 20 Feb 2025 15:28:26 -0800,
>   Masahiko Sawada  wrote:
>
> > Looking at the 0001 patch again, I have a question: we have
> > CopyToTextLikeOneRow() for both CSV and text format:
> >
> > +/* Implementation of the per-row callback for text format */
> > +static void
> > +CopyToTextOneRow(CopyToState cstate, TupleTableSlot *slot)
> > +{
> > +   CopyToTextLikeOneRow(cstate, slot, false);
> > +}
> > +
> > +/* Implementation of the per-row callback for CSV format */
> > +static void
> > +CopyToCSVOneRow(CopyToState cstate, TupleTableSlot *slot)
> > +{
> > +   CopyToTextLikeOneRow(cstate, slot, true);
> > +}
> >
> > These two functions pass different is_csv value to that function,
> > which is used as follows:
> >
> > +   if (is_csv)
> > +   CopyAttributeOutCSV(cstate, string,
> > +
> >  cstate->opts.force_quote_flags[attnum - 1]);
> > +   else
> > +   CopyAttributeOutText(cstate, string);
> >
> > However, we can know whether the format is CSV or text by checking
> > cstate->opts.csv_mode instead of passing is_csv. That way, we can
> > directly call CopyToTextLikeOneRow() but not via CopyToCSVOneRow() or
> > CopyToTextOneRow(). It would not help performance since we already
> > inline CopyToTextLikeOneRow(), but it looks simpler.
>
> This means the following, right?
>
> 1. We remove CopyToTextOneRow() and CopyToCSVOneRow()
> 2. We remove "bool is_csv" parameter from CopyToTextLikeOneRow()
>and use cstate->opts.csv_mode in CopyToTextLikeOneRow()
>instead of is_csv
> 3. We use CopyToTextLikeOneRow() for
>CopyToRoutineText::CopyToOneRow and
>CopyToRoutineCSV::CopyToOneRow
>
> If we use this approach, we can't remove the following
> branch in compile time:
>
> +   if (is_csv)
> +   CopyAttributeOutCSV(cstate, string,
> +   
> cstate->opts.force_quote_flags[attnum - 1]);
> +   else
> +   CopyAttributeOutText(cstate, string);
>
> We can remove the branch in compile time with the current
> approach (constant argument + inline).
>
> It may have a negative performance impact because the "if"
> is used many times with large data. (That's why we choose
> the constant argument + inline approach in this thread.)
>

Thank you for the explanation, I missed that fact. I'm fine with having is_csv.

The first two patches are refactoring patches (+ small performance
improvements). I've reviewed these patches again and attached the
updated patches. I reorganized the function order and updated comments
etc. I find that these patches are reasonably ready to push. Could you
review these versions? I'm going to push them, barring objections and
further comments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v33-0002-Refactor-COPY-FROM-to-use-format-callback-functi.patch
Description: Binary data


v33-0001-Refactor-COPY-TO-to-use-format-callback-function.patch
Description: Binary data


Re: Statistics Import and Export

2025-02-25 Thread Corey Huinker
On Tue, Feb 25, 2025 at 9:00 PM Jeff Davis  wrote:

> On Mon, 2025-02-24 at 09:54 -0500, Andres Freund wrote:
> > Have you compared performance of with/without stats after these
> > optimizations?
>
> On unoptimized build with asserts enabled, dumping the regression
> database:
>
>   --no-statistics: 1.0s
>   master:  3.6s
>   v3j-0001:3.0s
>   v3j-0002:1.7s
>
> I plan to commit the patches soon.
>
> Regards,
> Jeff Davis
>
>
+1 from me

We can still convert the "EXECUTE getAttributeStats" call to a Params call,
but that involves creating an ExecuteSqlQueryParams(), which starts to
snowball in the changes required.


Re: Log connection establishment timings

2025-02-25 Thread Melanie Plageman
On Mon, Dec 16, 2024 at 6:26 PM Jelte Fennema-Nio  wrote:
>
> Two thoughts:
> 1. Would it make sense to also expose these timings in some pg_stat_xyz view?

So, right now there isn't an obvious place this would fit. There are
dynamic statistics views that are per connection -- like
pg_stat_ssl/gssapi. We could perhaps add authentication duration to
one of these. But I don't think this really helps users with the use
case I imagine for connection establishment logging. I thought that
people would look at the logs to see how their connection times have
changed over time -- for example if authentication started taking
longer. This also means a cumulative statistics view wouldn't be very
helpful. We could add a view to calculate connection establishment
median, standard deviation, average, etc, but that doesn't seem as
useful as the history of timings and how they have changed over that
time.

> 2. As a user I'd be curious to know how much of the time is spent on
> the network/client vs inside postgres. For example for the scram/sasl
> handshake, how much of the authentication_time is spent waiting on the
> first "read" after the server has called sendAuthRequest.

Well, with the current patch you can compare authentication duration
to total connection establishment duration to see how much time was
spent on authentication vs the rest of connection establishment.
Or do you mean adding another more granular level of timing: after
sendAuthRequest until PerformAuthentication() returns?
I'm not totally clear what "read" means in this context.

- Melanie




Re: [PATCH] Refactor SLRU to always use long file names

2025-02-25 Thread Aleksander Alekseev
Hi,

> > Here is an updated patch. The steps to test it manually are as follows.
> >
> > Compile and install PostgreSQL from the REL_17_STABLE branch:
> >
> > [...]
> >
> > As always, your feedback and suggestions are most welcomed.
>
> Rebased.

Rebased.

--
Best regards,
Aleksander Alekseev
From 43e7cdccf1aa96e6522a41518bf0437dda23cc4d Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Wed, 11 Sep 2024 13:17:33 +0300
Subject: [PATCH v4] Always use long SLRU segment file names

PG17 introduced long SLRU segment file names (commit 4ed8f0913bfd). We used
short or long file names depending on SlruCtl->long_segment_names. This commit
refactors SLRU to always use long file names in order to simplify the code.

Aleksander Alekseev, reviewed by Michael Paquier
Discussion: https://postgr.es/m/CAJ7c6TOy7fUW9MuNeOWor3cSFnQg9tgz=mjXHDb94GORtM_Eyg@mail.gmail.com

(!!!) bump catversion and change the corresponding TODO FIXME line in pg_upgrade.h
---
 src/backend/access/transam/clog.c   |  2 +-
 src/backend/access/transam/commit_ts.c  |  3 +-
 src/backend/access/transam/multixact.c  |  6 +-
 src/backend/access/transam/slru.c   | 73 
 src/backend/access/transam/subtrans.c   |  2 +-
 src/backend/commands/async.c|  2 +-
 src/backend/storage/lmgr/predicate.c|  2 +-
 src/bin/pg_upgrade/pg_upgrade.c | 74 +
 src/bin/pg_upgrade/pg_upgrade.h |  6 ++
 src/bin/pg_verifybackup/t/003_corruption.pl |  2 +-
 src/include/access/slru.h   | 10 +--
 src/test/modules/test_slru/test_slru.c  |  8 +--
 12 files changed, 104 insertions(+), 86 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 48f10bec91e..f130403ac4b 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -810,7 +810,7 @@ CLOGShmemInit(void)
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "transaction", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
   "pg_xact", LWTRANCHE_XACT_BUFFER,
-  LWTRANCHE_XACT_SLRU, SYNC_HANDLER_CLOG, false);
+  LWTRANCHE_XACT_SLRU, SYNC_HANDLER_CLOG);
 	SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
 }
 
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 113fae1437a..59535526823 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -556,8 +556,7 @@ CommitTsShmemInit(void)
 	SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0,
   "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER,
   LWTRANCHE_COMMITTS_SLRU,
-  SYNC_HANDLER_COMMIT_TS,
-  false);
+  SYNC_HANDLER_COMMIT_TS);
 	SlruPagePrecedesUnitTests(CommitTsCtl, COMMIT_TS_XACTS_PER_PAGE);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index c1e2c42e1bb..b3b11c1a0ad 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1974,15 +1974,13 @@ MultiXactShmemInit(void)
   "multixact_offset", multixact_offset_buffers, 0,
   "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER,
   LWTRANCHE_MULTIXACTOFFSET_SLRU,
-  SYNC_HANDLER_MULTIXACT_OFFSET,
-  false);
+  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
   "multixact_member", multixact_member_buffers, 0,
   "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER,
   LWTRANCHE_MULTIXACTMEMBER_SLRU,
-  SYNC_HANDLER_MULTIXACT_MEMBER,
-  false);
+  SYNC_HANDLER_MULTIXACT_MEMBER);
 	/* doesn't call SimpleLruTruncate() or meet criteria for unit tests */
 
 	/* Initialize our shared state struct */
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ce628e62a5..cc078d2e6ce 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -77,42 +77,22 @@
  *
  * "path" should point to a buffer at least MAXPGPATH characters long.
  *
- * If ctl->long_segment_names is true, segno can be in the range [0, 2^60-1].
- * The resulting file name is made of 15 characters, e.g. dir/123456789ABCDEF.
- *
- * If ctl->long_segment_names is false, segno can be in the range [0, 2^24-1].
- * The resulting file name is made of 4 to 6 characters, as of:
- *
- *  dir/1234   for [0, 2^16-1]
- *  dir/12345  for [2^16, 2^20-1]
- *  dir/123456 for [2^20, 2^24-1]
+ * segno can be in the range [0, 2^60-1]. The resulting file name is made
+ * of 15 characters, e.g. dir/123456789ABCDEF.
  */
 static inline int
 SlruFileName(SlruCtl ctl, char *path, int64 segno)
 {
-	if (ctl->long_segment_names)
-	{
-		/*
-		 * We could use 16 characters here but the disadvantage would be that
-		 * the SLRU segments will be hard to distinguish from WAL segments.
-		 *

Re: pgbench client-side performance issue on large scripts

2025-02-25 Thread Daniel Verite
Tom Lane wrote:

> > I see "pgbench -f 50k-select.sql" taking about 5.8 secs of CPU time,
> > out of a total time of 6.7 secs. When run with perf, this profile shows up:
> 
> You ran only a single execution of a 50K-line script?  This test
> case feels a little bit artificial.  Having said that ...

I guess my use case is unusual, otherwise the O(N^2) parse
time would have been noticed sooner, but it's genuine.

I want to see how much a long sequence of statement inside a
pipeline differs between pgbench and psql with
the proposed patch at [1] that implements pipelining in psql.
psql would not actually send statements until \endpipeline is
reached, whereas pgbench does.
In fact I'd be inclined to push much more statements in the pipeline
than 50k, but then the parse time issue really kicks in.

For the moment I'll stay with my quick fix, then l'll try
to come up with something to replace expr_scanner_get_lineno() .


[1] https://commitfest.postgresql.org/patch/5407/

Best regards,
-- 
Daniel Vérité 
https://postgresql.verite.pro/




Re: Adjusting hash join memory limit to handle batch explosion

2025-02-25 Thread Tomas Vondra
On 2/25/25 17:30, James Hunter wrote:
> On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra  wrote:
>>
>> I've pushed the first (and main) part of the patch series, after some
>> more cleanup and comment polishing.
> 
> Two comments on your merged patch --
> 
> First, it's easier to see what's going on if we overlook the logic to
> round to nearest power of two, and solve the optimization problem
> algebraically. Let T = the total memory needed to hash all input rows,
> and B = the size of per-batch metadata (= 2 * BLKSIZE, which is
> typically 16 KB). Then, solving the optimization problem, the minimum
> memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem =
> SQRT(B * T).
> 
> (Here I am using "workmem" for the hash table's "space_allowed.")
> 
> The total working memory used, at the minimum, is always 2 * w: twice
> the optimal "workmem" ("space_allowed").
> 
> This says that the maximum input size that can be (optimally) hashed
> with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4
> GB, and the total working memory used would actually be 16 MB.
> 
> Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64
> MB of total working memory used. So "workmem" grows with the SQRT of
> T, the total hash memory needed; and total working memory is 2x
> "workmem."
> 

Yes, this is a nice way to explain the issue, and how we solve it. It's
probably better than the comment in my commit, I guess.

> Second -- the algebraic solution illustrates the difficulty in
> tracking and restricting working memory usage for Hash Joins! Your
> patch improves the "hash 64 GB" situation, because it eliminates 96 GB
> of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at
> a cost of only 24 MB of workmem. Using the default 8 MB workmem,
> *actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8
> MB) = 136 MB. By increasing workmem to 32 MB, total working memory is
> only 64 MB; so we save 72 MB overall. This is a good thing, but--
> 

Agreed.

> The "but" is that the customer really should have set their workmem to
> 64 MB, in the first place; and we should have taken half of that for
> the hash table, and left the other half for per-batch metadata.
> 
> -- OK,  but historically we have pretended that the per-batch metadata
> used no memory. So the customer should have set their workmem to 32
> MB, with the understanding that PostgreSQL would have actually used 64
> MB...
> 

Sure, we could have considered the per-batch metadata during planning.
And if we find we can't run the hash join, we'd "disable" (penalize) it
in some way. No argument there.

But that assumes we correctly estimate the number of batches during
planning, and it's easy to get that wrong. E.g. the nbatch explosion
cases are a good example.

And that's what my patch was aiming to improve. It does not matter how
the user sets the work_mem GUC, really.

> -- OK, but the customer *didn't* set their workmem to 32 MB. (If they
> had, we wouldn't need this patch -- but we *do* need this patch, which
> means the customer hasn't set their workmem high enough.) Why not?
> Well, because if they set it to 32 MB, they'd run OOM!
> 

Not sure I follow the reasoning here :-( If the query completed with a
lower work_mem value, it should complete with work_mem = 32MB, because
that reduces the amount of memory needed. But yes, it's possible they
hit OOM in both cases, it's an attempt to reduce the impact.

> -- So we are (secretly!) increasing the customer's workmem to 32 MB,
> but only for this particular Hash Join. The customer can't increase it
> to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
> for this Hash Join, in the hopes that by doing so we'll avoid running
> OOM... which is good; but we don't *tell* the customer we've done
> this, and we just hope that the customer actually has 64 MB (= 2x
> workmem) free (because, if they don't, they'll run OOM anyway).
> 

Right. This is meant to be a best-effort mitigation for rare cases.

Maybe we should track/report it somehow, though. I mean, if 1% of hash
joins need this, you're probably fine. If 99% hash joins hit it, you
probably really need a higher work_mem value because the hashed relation
is just too large.

But you have a point - maybe we should track/report this somewhere.
First step would be to make the total memory usage better visible in
explain (it's not obvious it does not include the per-batch metadata).

> All of this is to say that this patch illustrates the need for
> something like proposal [1], which allows PostgreSQL to set workmem
> limits on individual execution nodes, based on the optimizer's memory
> estimates. In the above patch, we're blindly making things better,
> without knowing whether we've made them good enough. (The customer is
> less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
> possible since their workmem limit is 8 MB!)
> 
> In v.next of my patchset at [1] (should be done by end of 

Re: Fix logging for invalid recovery timeline

2025-02-25 Thread David Steele

On 2/25/25 04:07, Benoit Lobréau wrote:

On 2/24/25 11:33 PM, Michael Paquier wrote:

On Mon, Feb 24, 2025 at 05:30:35PM +, David Steele wrote:

+    /* translator: %s is a backup_label or
pg_control file */


See for example PostmasterMain() with the "/* translator: %s is a
configuration file */".


Thank you Michael and David.
I never paid attention to thoses...


This looks good to me.

Regards,
-David




Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-25 Thread Shubham Khanna
On Tue, Feb 25, 2025 at 4:50 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shubham,
>
> Thanks for updating the patch. Few comments.
>
> 01. CreateSubscriberOptions
> ```
> +   boolcleanup_existing_publications;  /* drop all 
> publications */
> ```
>
> My previous comment [1] #1 did not intend to update attributes. My point was
> only for the third argument of setup_subscriber().
>

Fixed.

> 02. setup_subscriber()
> ```
> +   pg_log_info("cleaning up existing publications on the 
> subscriber");
> ```
>
> I don't think this output is needed. check_and_drop_existing_publications() 
> has the similar output.
>

Fixed.

> 03. drop_publication_by_name()
> ```
> +
> +   appendPQExpBuffer(query, "DROP PUBLICATION %s", pubname_esc);
> +   pg_log_info("dropping publication %s in database \"%s\"", pubname, 
> dbname);
> +   pg_log_debug("command is: %s", query->data);
>
> ```
>
> Currently, appendPQExpBuffer() is done after the pg_log_info(). Please 
> reserve current ordering if
> there are no reasons. Also, variable "str" is currently used instead of 
> query, please follow.
>

Fixed.

> 04. drop_publication_by_name()
> ```
> if (!dry_run)
> {
> -   res = PQexec(conn, str->data);
> +   res = PQexec(conn, query->data);
> if (PQresultStatus(res) != PGRES_COMMAND_OK)
> +   dbinfo->made_publication = false;
> +   else
> {
> -   pg_log_error("could not drop publication \"%s\" in 
> database \"%s\": %s",
> -dbinfo->pubname, 
> dbinfo->dbname, PQresultErrorMessage(res));
> -   dbinfo->made_publication = false;   /* don't try 
> again. */
> -
> -   /*
> -* Don't disconnect and exit here. This routine is 
> used by primary
> -* (cleanup publication / replication slot due to an 
> error) and
> -* subscriber (remove the replicated publications). 
> In both cases,
> -* it can continue and provide instructions for the 
> user to remove
> -* it later if cleanup fails.
> -*/
> +   pg_log_error("could not drop publication %s in 
> database \"%s\": %s",
> +pubname, dbname, 
> PQresultErrorMessage(res));
> }
> ```
>
> pg_log_error() is exected when the command succeed: This is not correct.
>

Fixed.

> 05. 040_pg_createsubscriber.pl
> ```
> +# Set up node A as primary
> +my $node_a = PostgreSQL::Test::Cluster->new('node_a');
> +my $aconnstr = $node_a->connstr;
> +$node_a->init(allows_streaming => 'logical');
> +$node_a->append_conf('postgresql.conf', 'autovacuum = off');
> +$node_a->start;
> ```
>
> I don't think new primary is not needed. Can't you reuse node_p?
>

Fixed.

The attached patch contains the suggested changes.

Thanks and regards,
Shubham Khanna.


v10-0001-Support-for-dropping-all-publications-in-pg_crea.patch
Description: Binary data


Re: Update docs for UUID data type

2025-02-25 Thread Laurenz Albe
On Mon, 2025-02-24 at 21:04 -0500, Andy Alsup wrote:
> Please find the attached patch, which only addresses the UUID functions
> (in table format). I appreciate the comments related to the UUID datatype.
> If you feel like the additional content didn't add clarity, I certainly won't 
> argue.

Your patch looks good to me.

I didn't mean that adding more information about the "uuid" data type is
a bad thing.  Perhaps that additional paragraph could be

RFC 9562 defines 8 different UUID versions.  Each version has specific 
requirements
for generating new UUID values, and each version provides distinct benefits 
and drawbacks.
PostgreSQL provides native support for 
generating UUIDs
using the UUIDv4 and UUIDv7 algorithms.  Alternatively, UUID values can be 
generated
outside of the database using any algorithm.  The data type 
uuid can be used
to store any UUID, regardless of the origin and the UUID version.

I would be happy if you added something like that again.

Yours,
Laurenz Albe




Re: [PoC] Federated Authn/z with OAUTHBEARER

2025-02-25 Thread Jacob Champion
On Mon, Feb 24, 2025 at 2:02 PM Jacob Champion
 wrote:
> Fair enough. I'll work on a patch to disallow it; best case, no one
> ever complains, and we've pruned an entire configuration from the list
> of things to worry about.

Here goes:

- 0001 fails configuration if the AsynchDNS feature is not built into libcurl.
- 0002 removes EINTR references from the validator documentation and
instead points authors towards our internal Wait APIs.
- 0003 is an optional followup to the const changes from upthread:
there's no need to memcpy() now, and anyone reading the code without
the history might wonder why I chose such a convoluted way to copy a
struct. :D

WDYT?

--Jacob


0002-oauth-Improve-validator-docs-on-interruptibility.patch
Description: Binary data


0001-oauth-Disallow-synchronous-DNS-in-libcurl.patch
Description: Binary data


0003-oauth-Simplify-copy-of-PGoauthBearerRequest.patch
Description: Binary data


Re: Redact user password on pg_stat_statements

2025-02-25 Thread Sami Imseih
>
> Well sure, but best effort is better than no effort at all. Preventing 
> CREATE/ALTER will catch 99% of items, and as I advocated, there really is no 
> reason for them to be in pg_stat_statements in the first place.
>
>>
>> The clients that set passwords could simply turn off track_utility on a 
>> user/transaction level while they are performing the action with
>> sensitive data.
>
>
> Good point, but that relies on the client to do the right thing, and requires 
> two extra steps.

Yes, I think relying on the client to do the right thing is a correct strategy.

> We already have things like \password in psql.  The most obvious
>helper feature we could add for this on the server side is to allow
> the password to be an out-of-line parameter:

>alter role joe set password $1;

Giving the client to parametrize DDL commands seems like a good
idea overall, and it gives the client a more robust way to deal with
sensitive passwords. Of course, even if something like this becomes possible,
the responsibility is still on the client to ensure that they are not
logging the parameter values. So, the client doing the right thing
is still required.

--

Sami
Amazon Web Services (AWS)




Commit fest 2025-03

2025-02-25 Thread vignesh C
Hi Everyone,

The next PostgreSQL CommitFest is starting in a few days, and I wanted
to remind everyone to register their patches to ensure they receive
the necessary attention and feedback. I've noticed that a few patches
are currently unregistered. If you're still evaluating a POC or
finalizing the design, you can choose to add it later. However, if a
patch is in good shape for review, I encourage you to register it now
to maximize its chances of getting feedback.

Here are few patches that I felt is not unregistered:
1) Federated Authn/z with OAUTHBEARER - Author - Daniel Gustafsson
https://www.postgresql.org/message-id/flat/d1b467a78e0e36ed85a09adf979d04cf124a9d4b.camel%40vmware.com
2) Update docs for UUID data type - Author - Andy Alsup
https://www.postgresql.org/message-id/flat/64b18b5cf27cf13237aec6c9fbbf7720c7107419.camel%40cybertec.at#0d201d3c76021bf121b858b53803763b
3) Lock-free XLog Reservation from WAL - Author - Zhiguo
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
4) Restrict copying of invalidated replication slots - Author - Shlok Kyal
https://www.postgresql.org/message-id/flat/CANhcyEU65aH0VYnLiu%3DOhNNxhnhNhwcXBeT-jvRe1OiJTo_Ayg%40mail.gmail.com
5) MAX_BACKENDS size (comment accuracy) - Author - Jacob Brazeal
https://www.postgresql.org/message-id/CA%2BCOZaBO_s3LfALq%3Db%2BHcBHFSOEGiApVjrRacCe4VP9m7CJsNQ%40mail.gmail.com
6) Simplify the logic a bit (src/bin/scripts/reindexdb.c)- Author -
Ranier Vilela
https://www.postgresql.org/message-id/CAEudQArfqr0-s0VVPSEh=0kgOgBJvFNdGW=xsl5rbcr0wdm...@mail.gmail.com
7) ReplicationSlotRelease() crashes when the instance is in the single
user mode - Author - Hayato Kuroda
https://www.postgresql.org/message-id/flat/OSCPR01MB14966ED588A0328DAEBE8CB25F5FA2%40OSCPR01MB14966.jpnprd01.prod.outlook.com
8) long-standing data loss bug in initial sync of logical replication
- Author -  Shlok Kyal
https://www.postgresql.org/message-id/CANhcyEXvcwHE54UEnatUKEs7v6jwRQQi_4qC64ngCfw1zTd5hg%40mail.gmail.com

If any of these are ready, please add them to the CommitFest page. If
you think some are trivial and don’t need to be added, that’s fine
too.
Looking forward to a productive CommitFest!

Regards,
Vignesh




Re: Parallel heap vacuum

2025-02-25 Thread Melanie Plageman
On Mon, Feb 24, 2025 at 8:15 PM Masahiko Sawada  wrote:
>
> What I can see from these results was that we might not benefit much
> from parallelizing phase III, unfortunately. Although in the best case
> the phase III got about 2x speedup, as for the total duration it's
> about only 10% speedup. My analysis for these results matches what
> John mentioned; phase III is already the fastest phase and accounts
> only ~10% of the total execution time, and the overhead of shared
> TidStore offsets the speedup of phase III.

So, are you proposing to drop the patches for parallelizing phase III
for now? If so, are you planning on posting a set of patches just to
parallelize phase I? I haven't looked at the prelim refactoring
patches to see if they have independent value. What do you think is
reasonable for us to try and do in the next few weeks?

> > The same commit that made the parallel scanning patch more difficult
> > should also reduce the risk of having a large amount of freezing work
> > at once in the first place. (There are still plenty of systemic things
> > that can go wrong, of course, and it's always good if unpleasant work
> > finishes faster.)
>
> I think that vacuum would still need to scan a large amount of blocks
> of the table especially when it is very large and heavily modified.
> Parallel heap vacuum (only phase I) would be beneficial in case where
> autovacuum could not catch up. And we might want to consider using
> parallel heap vacuum also in autovacuum while integrating it with
> eagar freeze scan.

I'd be interested to hear more about this.

- Melanie




Re: Restrict copying of invalidated replication slots

2025-02-25 Thread Masahiko Sawada
On Tue, Feb 25, 2025 at 2:36 AM Amit Kapila  wrote:
>
> On Tue, Feb 25, 2025 at 1:03 AM Masahiko Sawada  wrote:
> >
> > I've checked if this issue exists also on v15 or older, but IIUC it
> > doesn't exist, fortunately. Here is the summary:
> >
> > Scenario-1: the source gets invalidated before the copy function
> > fetches its contents for the first time. In this case, since the
> > source slot's restart_lsn is already an invalid LSN it raises an error
> > appropriately. In v15, we have only one slot invaldation reason, WAL
> > removal, therefore we always reset the slot's restart_lsn to
> > InvalidXlogRecPtr.
> >
> > Scenario-2: the source gets invalidated before the copied slot is
> > created (i.e., between first content copy and
> > create_logical/physical_replication_slot()). In this case, the copied
> > slot could have a valid restart_lsn value that however might point to
> > a WAL segment that might have already been removed. However, since
> > copy_restart_lsn will be an invalid LSN (=0), it's caught by the
> > following if condition:
> >
> > if (copy_restart_lsn < src_restart_lsn ||
> > src_islogical != copy_islogical ||
> > strcmp(copy_name, NameStr(*src_name)) != 0)
> > ereport(ERROR,
> > (errmsg("could not copy replication slot \"%s\"",
> > NameStr(*src_name)),
> >  errdetail("The source replication slot was
> > modified incompatibly during the copy operation.")));
> >
> > Scenario-3: the source gets invalidated after creating the copied slot
> > (i.e. after create_logical/physical_replication_slot()). In this case,
> > since the newly copied slot have the same restart_lsn as the source
> > slot, both slots are invalidated.
> >
>
> Which part of the code will cover Scenario-3? Shouldn't we give ERROR
> for Scenario-3 as well?

In scenario-3, the backend process executing
pg_copy_logical/physical_replication_slot() already holds the new
copied slot and its restart_lsn is the same or older than the source
slot's restart_lsn. Therefore, if the source slot is invalidated at
that timing, the copied slot is invalidated too, resulting in an error
by the backend.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Adjusting hash join memory limit to handle batch explosion

2025-02-25 Thread James Hunter
On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra  wrote:
>
> I've pushed the first (and main) part of the patch series, after some
> more cleanup and comment polishing.

Two comments on your merged patch --

First, it's easier to see what's going on if we overlook the logic to
round to nearest power of two, and solve the optimization problem
algebraically. Let T = the total memory needed to hash all input rows,
and B = the size of per-batch metadata (= 2 * BLKSIZE, which is
typically 16 KB). Then, solving the optimization problem, the minimum
memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem =
SQRT(B * T).

(Here I am using "workmem" for the hash table's "space_allowed.")

The total working memory used, at the minimum, is always 2 * w: twice
the optimal "workmem" ("space_allowed").

This says that the maximum input size that can be (optimally) hashed
with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4
GB, and the total working memory used would actually be 16 MB.

Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64
MB of total working memory used. So "workmem" grows with the SQRT of
T, the total hash memory needed; and total working memory is 2x
"workmem."

Second -- the algebraic solution illustrates the difficulty in
tracking and restricting working memory usage for Hash Joins! Your
patch improves the "hash 64 GB" situation, because it eliminates 96 GB
of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at
a cost of only 24 MB of workmem. Using the default 8 MB workmem,
*actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8
MB) = 136 MB. By increasing workmem to 32 MB, total working memory is
only 64 MB; so we save 72 MB overall. This is a good thing, but--

The "but" is that the customer really should have set their workmem to
64 MB, in the first place; and we should have taken half of that for
the hash table, and left the other half for per-batch metadata.

-- OK,  but historically we have pretended that the per-batch metadata
used no memory. So the customer should have set their workmem to 32
MB, with the understanding that PostgreSQL would have actually used 64
MB...

-- OK, but the customer *didn't* set their workmem to 32 MB. (If they
had, we wouldn't need this patch -- but we *do* need this patch, which
means the customer hasn't set their workmem high enough.) Why not?
Well, because if they set it to 32 MB, they'd run OOM!

-- So we are (secretly!) increasing the customer's workmem to 32 MB,
but only for this particular Hash Join. The customer can't increase it
to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
for this Hash Join, in the hopes that by doing so we'll avoid running
OOM... which is good; but we don't *tell* the customer we've done
this, and we just hope that the customer actually has 64 MB (= 2x
workmem) free (because, if they don't, they'll run OOM anyway).

All of this is to say that this patch illustrates the need for
something like proposal [1], which allows PostgreSQL to set workmem
limits on individual execution nodes, based on the optimizer's memory
estimates. In the above patch, we're blindly making things better,
without knowing whether we've made them good enough. (The customer is
less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
possible since their workmem limit is 8 MB!)

In v.next of my patchset at [1] (should be done by end of day today) I
will deal with the case discussed above by:

1. Doubling Plan.workmem_limit whenever we halve nbatches (so we track
the "workmem" needed by the hash table);
2. Displaying Plan.workmem_limit + Hash.nbatches * (2 * BLCKSIZE),
inside EXPLAIN (work_mem on), (so we display to the customer our best
estimate of the effective workmem limit).

Thanks,
James

[1] 
https://www.postgresql.org/message-id/flat/CAJVSvF6s1LgXF6KB2Cz68sHzk%2Bv%2BO_vmwEkaon%3DH8O9VcOr-tQ%40mail.gmail.com




bug in stored generated column over domain with constraints.

2025-02-25 Thread jian he
hi.

create domain d1 as int not null;
create domain d2 as int check (value > 1);
create domain d3 as int check (value is not null);

create table t0(b int, a d3 GENERATED ALWAYS as (b + 11) stored);
insert into t0 values (1, default);
ERROR:  value for domain d3 violates check constraint "d3_check"

in the above example, a domain with check constraint not working as intended,
similarly, a domain with not-null constraint will also not work.
now table t0 can not insert any data, because of check constraint violation.
so i think this is a bug, (hope i didn't miss anything).

in ExecBuildProjectionInfo, we compile "values (1, default)"  as
targetlist expression (CONST, COERCETODOMAIN)
Then in ExecResult, ExecProject, ExecInterpExpr we evaluate the
compiled expression;
we failed at ExecEvalConstraintCheck. we are acting like: ``null::d3``.

explain(costs off, verbose) insert into t0 values (1, default);
   QUERY PLAN

 Insert on public.t0
   ->  Result
 Output: 1, NULL::integer

the plan is ``NULL::integer``, which will not fail, but  ``NULL::d3`` will fail.
that means, the output plan is right. it's the execution wrong?


the main fix should be in rewriteTargetListIU.
UPDATE don't have this issue, since there is no Result node,
ExecComputeStoredGenerated will do the domain constraint check.

related:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=0da39aa7667b06e16189d318f7850d559d446d52
From 1820d039b472ea71d3b80b98096a81bb5fc3857b Mon Sep 17 00:00:00 2001
From: jian he 
Date: Wed, 26 Feb 2025 01:05:46 +0800
Subject: [PATCH v1 1/1] fix default insertion for stored generated column with
 domain over constraints.

create domain d3 as int check (value is not null);
create table t0(b int, a d3 GENERATED ALWAYS as (b + 11) stored);

insert into t0 values (1, default);
ERROR:  value for domain d3 violates check constraint "d3_check"

explain(costs off, verbose) insert into t0 values (1, default);
  QUERY PLAN
---
 Insert on public.t0
   ->  Result
 Output: 1, NULL::integer

We first evaluate the Result node. Domain coercion in the Result node may lead
to domain constraint violations. These domain constraints should not be checked in
ExecResult, as ExecComputeStoredGenerated will handle them.

context: https://git.postgresql.org/cgit/postgresql.git/commit/?id=0da39aa7667b06e16189d318f7850d559d446d52
discussion: https://postgr.es/m/
---
 src/backend/executor/nodeModifyTable.c| 31 +++---
 src/backend/optimizer/prep/preptlist.c|  3 +-
 src/backend/parser/parse_coerce.c |  5 +--
 src/backend/rewrite/rewriteHandler.c  | 23 +++--
 src/backend/rewrite/rewriteManip.c|  3 +-
 src/include/parser/parse_coerce.h |  2 +-
 .../regress/expected/generated_stored.out | 32 +++
 src/test/regress/sql/generated_stored.sql | 15 +
 8 files changed, 102 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b0fe50075ad..1db7c1e45b6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -67,6 +67,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
@@ -216,13 +217,33 @@ ExecCheckPlanOutput(Relation resultRel, List *targetList)
 		{
 			/* Normal case: demand type match */
 			if (exprType((Node *) tle->expr) != attr->atttypid)
-ereport(ERROR,
-		(errcode(ERRCODE_DATATYPE_MISMATCH),
-		 errmsg("table row type and query-specified row type do not match"),
-		 errdetail("Table has type %s at ordinal position %d, but query expects %s.",
+			{
+if (attr->attgenerated == '\0')
+{
+	ereport(ERROR,
+			errcode(ERRCODE_DATATYPE_MISMATCH),
+			errmsg("table row type and query-specified row type do not match"),
+			errdetail("Table has type %s at ordinal position %d, but query expects %s.",
+	format_type_be(attr->atttypid),
+	attno,
+	format_type_be(exprType((Node *) tle->expr;
+}
+else
+{
+	/*XXX explain why this is needed */
+	Oid			baseTypeId;
+	int32		baseTypeMod = attr->atttypmod;
+	baseTypeId = getBaseTypeAndTypmod(attr->atttypid, &baseTypeMod);
+
+	if (exprType((Node *) tle->expr) != baseTypeId)
+		errcode(ERRCODE_DATATYPE_MISMATCH),
+		errmsg("table row type and query-specified row type do not match"),
+		errdetail("Table has domain type %s at ordinal position %d, but query expects %s.",
    format_type_be(attr->atttypid),
    attno,
-   format_type_be(exprType((Node *) tle->expr);
+   format_type_be(exprType((Node *) tle->expr)));
+}
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/opt

Re: Log connection establishment timings

2025-02-25 Thread Melanie Plageman
Thanks for taking a look!

On Mon, Jan 20, 2025 at 10:01 AM Bertrand Drouvot
 wrote:
>
> The patch needed a rebase due to 34486b6092e. I did it in v2 attached (it's
> a minor rebase due to the AmRegularBackendProcess() introduction in 
> miscadmin.h).
>
> v2 could rely on AmRegularBackendProcess() instead of AmClientBackendProcess()
> but I kept it with AmClientBackendProcess() to reduce "my" changes as 
> compared to
> v1.

Thanks for doing this! I have implemented your suggestion in attached v3.

> Regarding the TimestampTz vs instr_time choice, we have things like:
< -- snip -- >
> So with TimestampTz, we would:
>
> 1. return 0 if we moved the time backward
> 2. provide an inflated duration including the time jump (if the time move
> forward).
>
> But with instr_time (and on systems that support CLOCK_MONOTONIC) then
> pg_clock_gettime_ns() should not be affected by system time change IIUC.
>
> Though time changes are "rare", given the fact that those metrics could 
> provide
> "inaccurate" measurements during that particular moment (time change) then it
> might be worth considering instr_time instead for this particular metric.

Great point. This all makes sense. I've switched to using instr_time in v3.

- Melanie
From 660e2f3846fd1af74956a45559da5ea17d5dbc92 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 25 Feb 2025 13:08:48 -0500
Subject: [PATCH v3] Add connection establishment duration logging

Add durations for several key parts of connection establishment when
log_connections is enabled.

For a new incoming conneciton, starting from when the postmaster gets a
socket from accept() and ending when the forked child backend is first
ready for query, there are multiple steps that could each take longer
than expected due to external factors. Provide visibility into
authentication and fork duration as well as the end-to-end connection
establishment time with logging.

To make this portable, the timestamps captured in the postmaster (socket
creation time, fork initiation time) are passed through the ClientSocket
and BackendStartupData structures instead of simply saved in backend
local memory inherited by the child process.

Reviewed-by: Bertrand Drouvot 
Reviewed-by: Jelte Fennema-Nio 
Reviewed-by: Jacob Champion 
Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com
---
 src/backend/postmaster/launch_backend.c | 20 
 src/backend/postmaster/postmaster.c |  8 
 src/backend/tcop/postgres.c | 24 
 src/backend/utils/init/globals.c|  2 ++
 src/backend/utils/init/postinit.c   | 13 +
 src/include/libpq/libpq-be.h|  2 ++
 src/include/miscadmin.h |  2 ++
 src/include/postmaster/postmaster.h |  7 +++
 src/include/tcop/backend_startup.h  |  1 +
 src/tools/pgindent/typedefs.list|  1 +
 10 files changed, 80 insertions(+)

diff --git a/src/backend/postmaster/launch_backend.c b/src/backend/postmaster/launch_backend.c
index 47375e5bfaa..37b31069120 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -232,6 +232,10 @@ postmaster_child_launch(BackendType child_type, int child_slot,
 
 	Assert(IsPostmasterEnvironment && !IsUnderPostmaster);
 
+	/* Capture time Postmaster initiates fork for logging */
+	if (child_type == B_BACKEND)
+		INSTR_TIME_SET_CURRENT(((BackendStartupData *) startup_data)->fork_time);
+
 #ifdef EXEC_BACKEND
 	pid = internal_forkexec(child_process_kinds[child_type].name, child_slot,
 			startup_data, startup_data_len, client_sock);
@@ -240,6 +244,14 @@ postmaster_child_launch(BackendType child_type, int child_slot,
 	pid = fork_process();
 	if (pid == 0)/* child */
 	{
+		/* Calculate total fork duration in child backend for logging */
+		if (child_type == B_BACKEND)
+		{
+			INSTR_TIME_SET_CURRENT(conn_timing.fork_duration);
+			INSTR_TIME_SUBTRACT(conn_timing.fork_duration,
+((BackendStartupData *) startup_data)->fork_time);
+		}
+
 		/* Close the postmaster's sockets */
 		ClosePostmasterPorts(child_type == B_LOGGER);
 
@@ -618,6 +630,14 @@ SubPostmasterMain(int argc, char *argv[])
 	/* Read in the variables file */
 	read_backend_variables(argv[2], &startup_data, &startup_data_len);
 
+	/* Calculate total fork duration in child backend for logging */
+	if (child_type == B_BACKEND)
+	{
+		INSTR_TIME_SET_CURRENT(conn_timing.fork_duration);
+		INSTR_TIME_SUBTRACT(conn_timing.fork_duration,
+((BackendStartupData *) startup_data)->fork_time);
+	}
+
 	/* Close the postmaster's sockets (as soon as we know them) */
 	ClosePostmasterPorts(child_type == B_LOGGER);
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5dd3b6a4fd4..880f491a9f7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1685,7 +1685,14 @@ ServerLoop(void)
 ClientSocket s;
 
 

Re: Log connection establishment timings

2025-02-25 Thread Melanie Plageman
Thanks for taking a look!

On Mon, Jan 20, 2025 at 12:53 PM Jacob Champion
 wrote:
>
> On Mon, Jan 20, 2025 at 7:01 AM Bertrand Drouvot
>  wrote:
> > Though time changes are "rare", given the fact that those metrics could 
> > provide
> > "inaccurate" measurements during that particular moment (time change) then 
> > it
> > might be worth considering instr_time instead for this particular metric.
>
> +1, I think a CLOCK_MONOTONIC source should be used for this if we've got it.

Done in v3 (see [1]).

> For the EXEC_BACKEND case (which, to be honest, I don't know much
> about), I originally wondered if the fork_duration should include any
> of the shared memory manipulations or library reloads that are done to
> match Unix behavior. But I think I prefer the way the patch does it.

You mean if the EXEC_BACKEND not defined version should calculate the
end of fork_duration basically at the end of
postmaster_child_launch()?

> Can the current timestamp be recorded right at the beginning of
> SubPostmasterMain(), to avoid counting the time setting up GUCs and
> reading the variables file, or do we have to wait?

We actually don't have the startup data until after we
read_backend_variables(), so I did it as soon as I could after that.

You are right that this will include timing from some extra steps.
But, some of these steps are overhead unique to the slow process of
"forking" a backend when EXEC_BACKEND is defined anyway, so it
probably makes sense for them to be included in the timing of "fork"
for these backends.

> nit: conn_timing is currently declared in the "interrupts and crit
> section" part of miscadmin.h; should it be moved down to the
> general-purpose globals?

Whoops, it would help if I read comments and stuff. Thanks! Fixed in v3 in [1].

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_YrNsA7-v5L9d318XZu%2BtPqcxp%2BctCGy2EGYrSt69ON%3DA%40mail.gmail.com




Re: RFC: Additional Directory for Extensions

2025-02-25 Thread Matheus Alcantara
Thanks for reviewing!

On Tue, Feb 25, 2025 at 9:45 AM Andrew Dunstan  wrote:
> I think your additions generally look good. We should be able to
> simplify this:
>
>
> +system_dir = psprintf("%s/extension", sharepath);
> +ecp = system_dir;
> +
> +if (strlen(Extension_control_path) == 0)
> +{
> +paths = lappend(paths, ecp);
> +}
>
Fixed on the attached v3.

-- 
Matheus Alcantara


v3-0001-extension_control_path.patch
Description: Binary data


Re: Statistics Import and Export

2025-02-25 Thread Corey Huinker
On Tue, Feb 25, 2025 at 1:22 PM Jeff Davis  wrote:

> On Mon, 2025-02-24 at 12:50 -0500, Tom Lane wrote:
> > Also, while working on the attached, I couldn't help forming the
> > opinion that we'd be better off to nuke pg_set_attribute_stats()
> > from orbit and require people to use pg_restore_attribute_stats().
>
> Attached a patch to do so. The docs and tests required substantial
> rework, but I think it's for the better now that we aren't trying to do
> in-place updates.
>
> Regards,
> Jeff Davis
>
>
All the C code changes make sense to me. Though as an aside, we're going to
run into the parameter-ordering problem when it comes to
pg_clear_attribute_stats, but that's a (read: my) problem for a later patch.

Documentation:

+ The currently-supported relation statistics are
+ relpages with a value of type
+ integer, reltuples with a value of
+ type real, and relallvisible with
a
+ value of type integer.

Could we make this a bullet-list? Same for the required attribute stats and
optional attribute stats. I think it would be more eye-catching and useful
to people skimming to recall the name of a parameter, which is probably
what most people will do after they've read it once to get the core
concepts.


Question:

Do we want to re-compact the oids we consumed in pg_proc.dat?


Test cases:

We're ripping out a lot of regression tests here. Some of them obviously
have no possible pg_restore_* analogs, such as explicitly set NULL values
vs omitting the param entirely, but some others may not, especially the
ones that test required arg-pairs.

Specifically missing are:

* regclass not found
* attribute is system column
* scalars can't have mcelem
* mcelem / mcelem freqs mismatch (parts 1 and 2)
* scalars can't have elem_count_histogram
* cannot set most_common_elems for range type

I'm less worried about all the tests of successful import calls, as the
pg_upgrade TAP tests kick those tires pretty well.

I'm also ok with losing the copies from test to test_clone, those are also
covered well by the TAP tests.

I'd feel better if we adapted the above tests from set-tests to
restore-tests, as the TAP suite doesn't really cover intentionally bad
stats.


Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Nathan Bossart
On Tue, Feb 25, 2025 at 11:02:40AM -0500, Melanie Plageman wrote:
> This does however leave me with the question of how to handle the
> original question of whether or not to cap the proposed relallfrozen
> to the value of relallvisible when updating stats at the end of
> vacuum. The current code in heap_vacuum_rel() caps relallvisible to
> relpages, so capping relallfrozen to relallvisible would follow that
> pattern. However, the other places relallvisible is updated do no such
> capping (do_analyze_rel(), index_update_stats()). It doesn't seem like
> there is a good reason to do it one place and not the others. So, I
> suggest either removing all the caps and adding a WARNING or capping
> the value in all places. Because users can now manually update these
> values in pg_class, there wouldn't be a way to detect the difference
> between a bogus relallfrozen value due to VM corruption or a bogus
> value due to manual statistics intervention. This led me to think that
> a WARNING and no cap would be more effective for heap_vacuum_rel().

I'm currently leaning towards the "remove all caps" idea.  But I'm not sure
I totally understand the need for a WARNING.  What do we expect users to do
with that information?  If it's intended to alert them of possible
corruption, then IMHO a WARNING may be too gentle.  I guess we could warn
and suggest a way to fix the value with the new statistics manipulation
functions if it's not that big of a deal, but at that point we might as
well just cap it on our own again.  If we don't really expect users to have
to do anything about it, then isn't it just adding unnecessary log noise?

-- 
nathan




Re: Proposal - Reduce lock during first phase of VACUUM TRUNCATE from ACCESS EXCLUSIVE to EXCLUSIVE

2025-02-25 Thread Ramanathan
> Except that mid-transaction lock upgrades increase the risk of
deadlock failures.

Thanks for the feedback. However, wouldn’t that risk already exist in the
current vacuum truncate process? As it stands, VACUUM TRUNCATE performs a
lock upgrade—from ShareUpdateExclusiveLock to AccessExclusiveLock—during
the actual truncate phase. This upgrade, while brief, also carries an
inherent risk of deadlocks.

The proposed change simply shifts part of the locking burden to the
backward scan phase by using an EXCLUSIVE lock instead of ACCESS EXCLUSIVE.
The idea is to reduce the lock's restrictiveness during the scan phase and
only escalate to ACCESS EXCLUSIVE for the fast truncation. Since we’re
already handling a similar upgrade in the current workflow, the risk
profile should be comparable.Which can mitigate the extended outages on hot
standby replicas as the primary does not release the lock based on waiting
queries on the hot standby.

Looking forward to your thoughts.

Best regards,
Ram
ᐧ

On Mon, 17 Feb 2025 at 20:50, Tom Lane  wrote:

> Ramanathan  writes:
> > I propose modifying the use of an EXCLUSIVE lock during the backward scan
> > phase, then upgrading that lock to ACCESS EXCLUSIVE only for the actual
> > truncation phase. Since the truncation phase should be relatively quick,
> > the impact of the ACCESS EXCLUSIVE lock should be minimal.
>
> Except that mid-transaction lock upgrades increase the risk of
> deadlock failures.
>
> regards, tom lane
>


Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Robert Treat
On Tue, Feb 25, 2025 at 11:32 AM Nathan Bossart
 wrote:
> On Tue, Feb 25, 2025 at 11:02:40AM -0500, Melanie Plageman wrote:
> > This does however leave me with the question of how to handle the
> > original question of whether or not to cap the proposed relallfrozen
> > to the value of relallvisible when updating stats at the end of
> > vacuum. The current code in heap_vacuum_rel() caps relallvisible to
> > relpages, so capping relallfrozen to relallvisible would follow that
> > pattern. However, the other places relallvisible is updated do no such
> > capping (do_analyze_rel(), index_update_stats()). It doesn't seem like
> > there is a good reason to do it one place and not the others. So, I
> > suggest either removing all the caps and adding a WARNING or capping
> > the value in all places. Because users can now manually update these
> > values in pg_class, there wouldn't be a way to detect the difference
> > between a bogus relallfrozen value due to VM corruption or a bogus
> > value due to manual statistics intervention. This led me to think that
> > a WARNING and no cap would be more effective for heap_vacuum_rel().
>
> I'm currently leaning towards the "remove all caps" idea.  But I'm not sure
> I totally understand the need for a WARNING.  What do we expect users to do
> with that information?  If it's intended to alert them of possible
> corruption, then IMHO a WARNING may be too gentle.  I guess we could warn
> and suggest a way to fix the value with the new statistics manipulation
> functions if it's not that big of a deal, but at that point we might as
> well just cap it on our own again.  If we don't really expect users to have
> to do anything about it, then isn't it just adding unnecessary log noise?
>

If the end user is manipulating numbers to test some theory, I think
it's valuable feedback that they have probably bent the system too
far, because we are now seeing numbers that don't make sense. If they
aren't mucking with the system, then it's valuable feedback that they
may have an underlying system problem that could be about to get
worse.

Robert Treat
https://xzilla.net




Re: Adjusting hash join memory limit to handle batch explosion

2025-02-25 Thread James Hunter
On Tue, Feb 25, 2025 at 9:39 AM Tomas Vondra  wrote:
>
> On 2/25/25 17:30, James Hunter wrote:
> > On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra  wrote:

> > -- OK, but the customer *didn't* set their workmem to 32 MB. (If they
> > had, we wouldn't need this patch -- but we *do* need this patch, which
> > means the customer hasn't set their workmem high enough.) Why not?
> > Well, because if they set it to 32 MB, they'd run OOM!
> >
>
> Not sure I follow the reasoning here :-( If the query completed with a
> lower work_mem value, it should complete with work_mem = 32MB, because
> that reduces the amount of memory needed. But yes, it's possible they
> hit OOM in both cases, it's an attempt to reduce the impact.

Yes, your patch is a Pareto improvement, because it means we use less
working memory than we would otherwise.

> > -- So we are (secretly!) increasing the customer's workmem to 32 MB,
> > but only for this particular Hash Join. The customer can't increase it
> > to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
> > for this Hash Join, in the hopes that by doing so we'll avoid running
> > OOM... which is good; but we don't *tell* the customer we've done
> > this, and we just hope that the customer actually has 64 MB (= 2x
> > workmem) free (because, if they don't, they'll run OOM anyway).
> >
>
> Right. This is meant to be a best-effort mitigation for rare cases.
>
> Maybe we should track/report it somehow, though. I mean, if 1% of hash
> joins need this, you're probably fine. If 99% hash joins hit it, you
> probably really need a higher work_mem value because the hashed relation
> is just too large.
>
> But you have a point - maybe we should track/report this somewhere.
> First step would be to make the total memory usage better visible in
> explain (it's not obvious it does not include the per-batch metadata).

Right -- my point is that mitigation is good, but tracking /
visibility is also necessary.

> > All of this is to say that this patch illustrates the need for
> > something like proposal [1], which allows PostgreSQL to set workmem
> > limits on individual execution nodes, based on the optimizer's memory
> > estimates. In the above patch, we're blindly making things better,
> > without knowing whether we've made them good enough. (The customer is
> > less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
> > possible since their workmem limit is 8 MB!)
> >
> > In v.next of my patchset at [1] (should be done by end of day today) I
> > will deal with the case discussed above by:
> >  ...

> I'm not opposed to doing something like this, but I'm not quite sure how
> could it help the cases I meant to address with my patch, where we plan
> with low nbatch value, and then it explodes as execution time.

Your patch addresses two cases: (1) where we plan with high nbatch
value; and (2) where we plan with low nbatch value, and then it
explodes at execution time.

Case (2) can be solved only by taking some action at runtime, and that
action is always best effort (because if we really don't have enough
extra memory free, we have to run OOM). Once a query has started
running, we have fewer options.

Case (1) can be solved in various ways, because it occurs before we
started running the query. For example, we can:

1. Delay starting execution of the query until enough memory becomes
available; or
2. Take memory away from other execution nodes to give to this Hash Join.

But these case (1) solutions require access to the actual
working-memory calculation. That's all I'm saying -- by tracking our
"best-effort" decision, we make it possible to address case (1).

(Your patch solves case (2), as well as it can be solved, by giving
memory to this Hash Join at runtime, and hoping that no one else was
using it. It's hard to improve on that, because PG execution nodes
don't, in general, have the ability to give up memory after they've
started running. But we can do better for case (1), if other
components can basically see the results of your formula.)

Thanks,
James




Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Nathan Bossart
On Tue, Feb 25, 2025 at 12:36:40PM -0500, Robert Treat wrote:
> On Tue, Feb 25, 2025 at 11:32 AM Nathan Bossart
>  wrote:
>> I'm currently leaning towards the "remove all caps" idea.  But I'm not sure
>> I totally understand the need for a WARNING.  What do we expect users to do
>> with that information?  If it's intended to alert them of possible
>> corruption, then IMHO a WARNING may be too gentle.  I guess we could warn
>> and suggest a way to fix the value with the new statistics manipulation
>> functions if it's not that big of a deal, but at that point we might as
>> well just cap it on our own again.  If we don't really expect users to have
>> to do anything about it, then isn't it just adding unnecessary log noise?
> 
> If the end user is manipulating numbers to test some theory, I think
> it's valuable feedback that they have probably bent the system too
> far, because we are now seeing numbers that don't make sense.

If we do that in other cases, then that seems reasonable.  But it feels
weird to me to carve out just this one inaccuracy.  For example, do we warn
if someone sets reltuples too high for relpages, or if they set relpages
too low for reltuples?  I'm not seeing why the relallfrozen <=
relallvisible <= relpages case is especially important to uphold.  If we
were consistent about enforcing these kinds of invariants everywhere, then
I think I would be more on board with capping the values and/or ERROR-ing
when folks tried to set bogus ones.  But if the only outcome of bogus
values is that you might get weird plans or autovacuum might prioritize the
table differently, then I'm not sure I see the point.  You can accomplish
both of those things with totally valid values already.

> If they
> aren't mucking with the system, then it's valuable feedback that they
> may have an underlying system problem that could be about to get
> worse.

Maybe a WARNING is as much as we can realistically do in this case, but I
think it could be easily missed in the server logs.  I dunno, it just feels
wrong to me to deal with potential corruption by gently notifying the user
and proceeding normally.  I guess that's what we do already elsewhere,
though (e.g., for relfrozenxid/relminmxid in vac_update_relstats()).

-- 
nathan




Re: Restrict copying of invalidated replication slots

2025-02-25 Thread Masahiko Sawada
On Mon, Feb 24, 2025 at 10:09 PM Shlok Kyal  wrote:
>
> On Tue, 25 Feb 2025 at 01:03, Masahiko Sawada  wrote:
> >
> > I've checked if this issue exists also on v15 or older, but IIUC it
> > doesn't exist, fortunately. Here is the summary:
> >
> > Scenario-1: the source gets invalidated before the copy function
> > fetches its contents for the first time. In this case, since the
> > source slot's restart_lsn is already an invalid LSN it raises an error
> > appropriately. In v15, we have only one slot invaldation reason, WAL
> > removal, therefore we always reset the slot's restart_lsn to
> > InvalidXlogRecPtr.
> >
> > Scenario-2: the source gets invalidated before the copied slot is
> > created (i.e., between first content copy and
> > create_logical/physical_replication_slot()). In this case, the copied
> > slot could have a valid restart_lsn value that however might point to
> > a WAL segment that might have already been removed. However, since
> > copy_restart_lsn will be an invalid LSN (=0), it's caught by the
> > following if condition:
> >
> > if (copy_restart_lsn < src_restart_lsn ||
> > src_islogical != copy_islogical ||
> > strcmp(copy_name, NameStr(*src_name)) != 0)
> > ereport(ERROR,
> > (errmsg("could not copy replication slot \"%s\"",
> > NameStr(*src_name)),
> >  errdetail("The source replication slot was
> > modified incompatibly during the copy operation.")));
> >
> > Scenario-3: the source gets invalidated after creating the copied slot
> > (i.e. after create_logical/physical_replication_slot()). In this case,
> > since the newly copied slot have the same restart_lsn as the source
> > slot, both slots are invalidated.
> >
> > If the above analysis is right, I think the patches are mostly ready.
> > I've made some changes to the patches:
> >
> > - removed src_isinvalidated variable as it's not necessarily necessary.
> > - updated the commit message.
> >
> > Please review them. If there are no further comments on these patches,
> > I'm going to push them.
> >
> I have verified the above scenarios and I confirm the behaviour described.
>
> I have a small doubt.
> In PG_17 and PG_16 we can invalidate physical slots only for
> 'wal_removed' case [1]. And copying of such slot already give error
> 'cannot copy a replication slot that doesn't reserve WAL'. So, in PG17
> and PG16 should we check for invalidated source slot only if we are
> copying logical slots?

Although your analysis is correct, I believe we should retain the
validation check. Even though invalidated physical slots in PostgreSQL
16 and 17always have an invalid restart_lsn, maintaining this check
would be harmless. Furthermore, I prefer to maintain consistency in
the codebase across back branches and the master branch rather than
introducing variations.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Proposal: Limitations of palloc inside checkpointer

2025-02-25 Thread Ekaterina Sokolova

Hi, hackers!

Historically, the checkpointer process use palloc() into 
AbsorbSyncRequests() function. Therefore, the checkpointer does not 
expect to receive a request larger than 1 GB.


We encountered a case where the database went into recovery state, after 
applying all wal, the checkpointer process generated an "invalid memory 
alloc request size" error and entered a loop. But it is quite acceptable 
for the recovery state to receive such a large allocation request.


A simple solution to this problem is to use palloc_extended() instead of 
palloc(). But is it safe to allow the checkpointer to allocate so much 
memory at once?


I have proposal to update this memory allocation but I need your ideas 
and advices on how to do it in appropriate way. As an idea, we can 
replace the array with a list of arrays to allocate memory in chunks. As 
a bad idea, we can process a temporary array without locking.


I would be glad to hear your ideas and suggestions about this topic. 
Have a nice day!


--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Nathan Bossart
On Tue, Feb 25, 2025 at 01:52:28PM -0500, Robert Haas wrote:
> Given that users could manually update the catalog, we have to be able
> to tolerate bad data in the catalogs without the world ending. If that
> code has to exist anyway, then it's not mandatory to cap. On the other
> hand, there's no great virtue in refusing to correct data that we know
> to be wrong. Unless there is some other consideration which makes one
> way better than the other, this feels like author's choice.

Maybe the most conservative choice is to simply follow the example of
surrounding code.  If it's careful to cap relallvisible to relpages, also
have it cap relallfrozen.  If not, don't.  *shrug*

In any case, I don't want to hold up this patch on this relatively minor
point.  This seems like something we could pretty easily change in the
future if needed.

-- 
nathan




Re: Redact user password on pg_stat_statements

2025-02-25 Thread Greg Sabino Mullane
On Tue, Feb 25, 2025 at 10:12 AM Sami Imseih  wrote:

> > What about a more general solution, such as a flag to turn off logging
> of ALTER ROLE statements completely?
>
> IMO, flags for a specific type of utility statement seems way too much for
> pg_stat_statements, and this will also not completely prevent leaking plain
> text passwords from all ways that CREATE/ALTER ROLE could be run, i.e. DO
> blocks, inside functions/procs with track=all.
>

Well sure, but best effort is better than no effort at all. Preventing
CREATE/ALTER will catch 99% of items, and as I advocated, there really is
no reason for them to be in pg_stat_statements in the first place.


> The clients that set passwords could simply turn off track_utility on a
> user/transaction level while they are performing the action with
> sensitive data.
>

Good point, but that relies on the client to do the right thing, and
requires two extra steps.


Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Greg Sabino Mullane
On Tue, Feb 25, 2025 at 11:03 AM Melanie Plageman 
wrote:

> Because users can now manually update these values in pg_class, there
> wouldn't be a way to detect the difference
> between a bogus relallfrozen value due to VM corruption or a bogus value
> due to manual statistics intervention.


Er..you had me until this. If manual monkeying of the system catalogs leads
to a "bogus" error that resembles a real one, then sow the wind, and reap
the whirlwind. I don't think that should be a consideration here.

-- 
Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


A small correction to doc and comment of FSM for indexes

2025-02-25 Thread Alex Friedman

Hi,

This patch fixes a couple of small inaccuracies in the doc and the comment for 
FSM about index handling.

1. In the doc for pg_freespacemap, it currently says:


For indexes, what is tracked is entirely-unused pages, rather than free space 
within pages. Therefore, the values are not meaningful, just whether a page is 
full or empty.


However, as what is tracked is entirely-unused pages, the values mean whether a page is 
"in-use or empty", rather than "full or empty".


2. In indexfsm.c the header comment says:


 * This is similar to the FSM used for heap, in freespace.c, but instead
 * of tracking the amount of free space on pages, we only track whether
 * pages are completely free or in-use. We use the same FSM implementation
 * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused.


However, in the code we see that used pages are marked with 0:


/*
 * RecordUsedIndexPage - mark a page as used in the FSM
 */
void
RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
{
RecordPageWithFreeSpace(rel, usedBlock, 0);
}


And free pages are marked with BLCKSZ - 1:


/*
 * RecordFreeIndexPage - mark a page as free in the FSM
 */
void
RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
{
RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1);
}


And so, this patch also fixes the comment's "using BLCKSZ - 1 to denote used pages, and 0 for 
unused" to be "using 0 to denote used pages, and BLCKSZ - 1 for unused".

While these changes are minor, I've seen how this can cause a bit of confusion, 
and it would be good to clarify it.


Best regards,

Alex FriedmanFrom a1b78438343fca053aa0014687eaba34d5e160e0 Mon Sep 17 00:00:00 2001
From: Alex Friedman 
Date: Tue, 25 Feb 2025 19:12:53 +0200
Subject: [PATCH v1] A small correction to doc and comment of FSM for indexes.

---
 doc/src/sgml/pgfreespacemap.sgml | 2 +-
 src/backend/storage/freespace/indexfsm.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/pgfreespacemap.sgml b/doc/src/sgml/pgfreespacemap.sgml
index 829ad60f32f..3774a9f8c6b 100644
--- a/doc/src/sgml/pgfreespacemap.sgml
+++ b/doc/src/sgml/pgfreespacemap.sgml
@@ -67,7 +67,7 @@
   
For indexes, what is tracked is entirely-unused pages, rather than free
space within pages.  Therefore, the values are not meaningful, just
-   whether a page is full or empty.
+   whether a page is in-use or empty.
   
  
 
diff --git a/src/backend/storage/freespace/indexfsm.c 
b/src/backend/storage/freespace/indexfsm.c
index 1fc263892a7..3cd2437599d 100644
--- a/src/backend/storage/freespace/indexfsm.c
+++ b/src/backend/storage/freespace/indexfsm.c
@@ -16,7 +16,7 @@
  * This is similar to the FSM used for heap, in freespace.c, but instead
  * of tracking the amount of free space on pages, we only track whether
  * pages are completely free or in-use. We use the same FSM implementation
- * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused.
+ * as for heaps, using 0 to denote used pages, and BLCKSZ - 1 for unused.
  *
  *-
  */
-- 
2.41.0



Re: Statistics Import and Export

2025-02-25 Thread Jeff Davis
On Mon, 2025-02-24 at 12:50 -0500, Tom Lane wrote:
> Also, while working on the attached, I couldn't help forming the
> opinion that we'd be better off to nuke pg_set_attribute_stats()
> from orbit and require people to use pg_restore_attribute_stats().

Attached a patch to do so. The docs and tests required substantial
rework, but I think it's for the better now that we aren't trying to do
in-place updates.

Regards,
Jeff Davis

From ea413ee48b10299530bafc3102395285b5ea8ce3 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 24 Feb 2025 17:24:05 -0800
Subject: [PATCH v1] Remove redundant pg_set_*_stats() variants.

After commit f3dae2ae58, the primary purpose of separating the
pg_set_*_stats() from the pg_restore_*_stats() variants was
eliminated.

Leave pg_restore_relation_stats() and pg_restore_attribute_stats(),
which satisfy both purposes, and remove pg_set_relation_stats() and
pg_set_attribute_stats().

Discussion: https://postgr.es/m/1457469.1740419...@sss.pgh.pa.us
---
 doc/src/sgml/func.sgml | 254 +++
 src/backend/catalog/system_functions.sql   |  32 -
 src/backend/statistics/attribute_stats.c   |  98 +--
 src/backend/statistics/relation_stats.c|  24 +-
 src/backend/statistics/stat_utils.c|  30 +-
 src/include/catalog/pg_proc.dat|  14 -
 src/include/statistics/stat_utils.h|   8 +-
 src/test/regress/expected/stats_import.out | 832 +
 src/test/regress/sql/stats_import.sql  | 648 +---
 9 files changed, 209 insertions(+), 1731 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f0ccb751106..12206e0cfc6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -30180,66 +30180,6 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
  
 
  
-  
-   
-
- 
-  pg_set_relation_stats
- 
- pg_set_relation_stats (
- relation regclass
- , relpages integer
- , reltuples real
- , relallvisible integer )
- void
-
-
- Updates relation-level statistics for the given relation to the
- specified values. The parameters correspond to columns in pg_class. Unspecified
- or NULL values leave the setting unchanged.
-
-
- Ordinarily, these statistics are collected automatically or updated
- as a part of  or , so it's not necessary to call this
- function. However, it may be useful when testing the effects of
- statistics on the planner to understand or anticipate plan changes.
-
-
- The caller must have the MAINTAIN privilege on
- the table or be the owner of the database.
-
-
- The value of relpages must be greater than
- or equal to -1,
- reltuples must be greater than or equal to
- -1.0, and relallvisible
- must be greater than or equal to 0.
-
-   
-  
-
-  
-   
-
- 
-  pg_clear_relation_stats
- 
- pg_clear_relation_stats ( relation regclass )
- void
-
-
- Clears table-level statistics for the given relation, as though the
- table was newly created.
-
-
- The caller must have the MAINTAIN privilege on
- the table or be the owner of the database.
-
-   
-  
-
   

 
@@ -30248,26 +30188,25 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
 pg_restore_relation_stats (
 VARIADIC kwargs "any" )
 boolean
-
-
- Similar to pg_set_relation_stats(), but intended
- for bulk restore of relation statistics. The tracked statistics may
- change from version to version, so the primary purpose of this
- function is to maintain a consistent function signature to avoid
- errors when restoring statistics from previous versions.
-
+   
 
- Arguments are passed as pairs of argname
- and argvalue, where
- argname corresponds to a named argument in
- pg_set_relation_stats() and
- argvalue is of the corresponding type.
+ Updates table-level statistics.  Ordinarily, these statistics are
+ collected automatically or updated as a part of  or , so it's not
+ necessary to call this function.  However, it is useful after a
+ restore to enable the optimizer to choose better plans if
+ ANALYZE has not been run yet.
 
 
- Additionally, this function supports argument name
- version of type integer, which
- specifies the version from which the statistics originated, improving
- interpretation of older statistics.
+ The tracked statistics may change from version to version, so
+ ar

Re: Trigger more frequent autovacuums of heavy insert tables

2025-02-25 Thread Robert Haas
On Tue, Feb 25, 2025 at 11:03 AM Melanie Plageman
 wrote:
> This does however leave me with the question of how to handle the
> original question of whether or not to cap the proposed relallfrozen
> to the value of relallvisible when updating stats at the end of
> vacuum. The current code in heap_vacuum_rel() caps relallvisible to
> relpages, so capping relallfrozen to relallvisible would follow that
> pattern. However, the other places relallvisible is updated do no such
> capping (do_analyze_rel(), index_update_stats()). It doesn't seem like
> there is a good reason to do it one place and not the others. So, I
> suggest either removing all the caps and adding a WARNING or capping
> the value in all places. Because users can now manually update these
> values in pg_class, there wouldn't be a way to detect the difference
> between a bogus relallfrozen value due to VM corruption or a bogus
> value due to manual statistics intervention. This led me to think that
> a WARNING and no cap would be more effective for heap_vacuum_rel().

I mean, does it really make any difference one way or the other?

Given that users could manually update the catalog, we have to be able
to tolerate bad data in the catalogs without the world ending. If that
code has to exist anyway, then it's not mandatory to cap. On the other
hand, there's no great virtue in refusing to correct data that we know
to be wrong. Unless there is some other consideration which makes one
way better than the other, this feels like author's choice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Securing PostgreSQL for rootless containers

2025-02-25 Thread Yogesh Sharma

On 2/24/25 14:51, Yogesh Sharma wrote:


This patch has effect on current use of socket unless systemd socket 
are used. Code is also guarded when postgres is not compiled with 
systemd flag.


I meant to say


This patch has **no** effect on current use of socket unless systemd 
socket are used. Code is also guarded when postgres is not compiled with 
systemd flag.



--
Kind Regards,
Yogesh Sharma
Open Source Enthusiast and Advocate
Member:
- PostgreSQL Contributors Team (AWS) https://aws.amazon.com
- PG Infra https://www.postgresql.org/about/governance/sysadmin/
- PG RPMs  https://yum.postgresql.org/contact/





Relaxing constraints on BitmapAnd eligibility?

2025-02-25 Thread Dmytro Astapov
Hi!

I've been investigating why postgres does not do BitmapAnd of two
well-suited indexes, and reading indxpath.c

In my case, there is a table (d date, col1 int, col2 int) -- types not
really important -- and there are two indices on (d,col1) and (d, col2).

For queries that do WHERE d>=X AND col1=Y AND col2=Z postgres will never
BitmapAnd those two indices because both indexes include (d) and we have a
condition on (d). Here is a full example, which could also be seen here:
https://www.db-fiddle.com/f/uPLx5bRtDEoZw3Dx4kkwKh/0:

begin;

CREATE TABLE test_table (
d date,
col1 int,
col2 int
);

INSERT INTO test_table (d, col1, col2)
SELECT
d.date,
c1.val as col1,
c2.val as col2
FROM
generate_series(
'2023-01-01'::date,
'2023-12-31'::date,
'1 day'::interval
) as d(date),
generate_series(1, 1000) as c1(val),
generate_series(1, 1000) as c2(val)
WHERE
random() < 0.001;

create index on test_table(col1,d);
create index on test_table(col2,d);

-- This uses BitmapAnd
explain select * from test_table where col1=123 and col2=321;

-- This does not use BitmapAnd, even though it could!
explain select * from test_table where col1=123 and col2=321 and d >=
'2023-05-05';

I checked that BitmapAnd is rejected by this

line in choose_bitmap_and:

   if (bms_overlap(pathinfo->clauseids, clauseidsofar))
  continue; /* consider it redundant */

There is a comment on choose_bitmap_and that explains the rationale of this
check, but reading it I can't help but feel that what the comment
describes is this condition:

   if (bms_is_subset(pathinfo->clauseids, clauseidsofar))
  continue; /* consider it redundant */

And indeed, in my (admittedly not super-extensive) testing changing
bms_overlap to bms_is_subset leads to better faster execution plans.

Is it possible that this condition could thus be relaxed?

Even if I am wrong, and the condition absolutely should be bms_overlap, I
feel that this restriction is very very hard to discover and perhaps
https://www.postgresql.org/docs/current/indexes-bitmap-scans.html should
mention that compound indexes that have columns in common will never be
combined?

Best regards, Dmytro


  1   2   >