date:20240610

Hi,

On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 06:29:17AM +, Bertrand Drouvot wrote:
> > Thanks for the report! I think it makes sense to add it to the list of known
> > failures.
> > 
> > One way to deal with those corner cases could be to make use of injection 
> > points
> > around places where RUNNING_XACTS is emitted, thoughts?
> 
> Ah.  You mean to force a wait in the code path generating the standby
> snapshots for the sake of this test?

Yeah.

>  That's interesting, I like it.

Great, will look at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Logical Replication of sequences

2024-06-10 Thread Masahiko Sawada

On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada  wrote:
>
> On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila  wrote:
> >
> > On Fri, Jun 7, 2024 at 7:55 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > To achieve this, we can allow sequences to be copied during
> > > > > > the initial CREATE SUBSCRIPTION command similar to what we do for
> > > > > > tables. And then later by new/existing command, we re-copy the 
> > > > > > already
> > > > > > existing sequences on the subscriber.
> > > > > >
> > > > > > The options for the new command could be:
> > > > > > Alter Subscription ... Refresh Sequences
> > > > > > Alter Subscription ... Replicate Sequences
> > > > > >
> > > > > > In the second option, we need to introduce a new keyword Replicate.
> > > > > > Can you think of any better option?
> > > > >
> > > > > Another idea is doing that using options. For example,
> > > > >
> > > > > For initial sequences synchronization:
> > > > >
> > > > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
> > > > >
> > > >
> > > > How will it interact with the existing copy_data option? So copy_data
> > > > will become equivalent to copy_table_data, right?
> > >
> > > Right.
> > >
> > > >
> > > > > For re-copy (or update) sequences:
> > > > >
> > > > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = 
> > > > > true);
> > > > >
> > > >
> > > > Similar to the previous point it can be slightly confusing w.r.t
> > > > copy_data. And would copy_sequence here mean that it would copy
> > > > sequence values of both pre-existing and newly added sequences, if so,
> > > > that would make it behave differently than copy_data?  The other
> > > > possibility in this direction would be to introduce an option like
> > > > replicate_all_sequences/copy_all_sequences which indicates a copy of
> > > > both pre-existing and new sequences, if any.
> > >
> > > Copying sequence data works differently than replicating table data
> > > (initial data copy and logical replication). So I thought the
> > > copy_sequence option (or whatever better name) always does both
> > > updating pre-existing sequences and adding new sequences. REFRESH
> > > PUBLICATION updates the tables to be subscribed, so we also update or
> > > add sequences associated to these tables.
> > >
> >
> > Are you imagining the behavior for sequences associated with tables
> > differently than the ones defined by the CREATE SEQUENCE .. command? I
> > was thinking that users would associate sequences with publications
> > similar to what we do for tables for both cases. For example, they
> > need to explicitly mention the sequences they want to replicate by
> > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > SEQUENCES IN SCHEMA sch1;
> >
> > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > should copy both the explicitly defined sequences and sequences
> > defined with the tables. Do you think a different variant for just
> > copying sequences implicitly associated with tables (say for identity
> > columns)?
>
> Oh, I was thinking that your proposal was to copy literally all
> sequences by REPLICA/REFRESH SEQUENCE command. But it seems to make
> sense to explicitly specify the sequences they want to replicate. It
> also means that they can create a publication that has only sequences.
> In this case, even if they create a subscription for that publication,
> we don't launch any apply workers for that subscription. Right?
>
> Also, given that the main use case (at least as the first step) is
> version upgrade, do we really need to support SEQUENCES IN SCHEMA and
> even FOR SEQUENCE?

Also, I guess that specifying individual sequences might not be easy
to use for users in some cases. For sequences owned by a column of a
table, users might want to specify them altogether, rather than
separately. For example, CREATE PUBLICATION ... FOR TABLE tab1 WITH
SEQUENCES means to add the table tab1 and its sequences to the
publication. For other sequences (i.e., not owned by any tables),
users might want to specify them individually.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Proposal to add page headers to SLRU pages

Hi,

On Tue, Mar 19, 2024 at 06:48:33AM +, Li, Yong wrote:
> 
> Unfortunately, the test requires a setup of two different versions of PG. I 
> am not
> aware of an existing test infrastructure which can run automated tests using 
> two
> PGs. I did manually verify the output of pg_upgrade. 

I think there is something in t/002_pg_upgrade.pl (see 
src/bin/pg_upgrade/TESTING),
that could be used to run automated tests using an old and a current version.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state

2024-06-10 Thread cca5507

Hello hackers, I found that we currently don't track txns committed in
BUILDING_SNAPSHOT state because of the code in xact_decode():
/*
 * If the snapshot isn't yet fully built, we cannot decode anything, so
 * bail out.
 */
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
return;


This can cause a txn to take an incorrect historic snapshot and result in an

interruption of logical replication. Consider the following scenario:
(pub)create table t1 (id int primary key);
(pub)insert into t1 values (1);
(pub)create publication pub for table t1;
(sub)create table t1 (id int primary key);
(pub)begin; insert into t1 values (2); (txn1 in session1)
(sub)create subscription sub connection 'hostaddr=127.0.0.1 port=5432 user=xxx 
dbname=postgres' publication pub; (pub will switch to BUILDING_SNAPSHOT state 
soon)
(pub)begin; insert into t1 values (3); (txn2 in session2)
(pub)create table t2 (id int primary key); (session3)
(pub)commit; (commit txn1, and pub will switch to FULL_SNAPSHOT state soon)
(pub)begin; insert into t2 values (1); (txn3 in session3)
(pub)commit; (commit txn2, and pub will switch to CONSISTENT state soon)
(pub)commit; (commit txn3, and replay txn3 will failed because its snapshot 
cannot see table t2)


The output of pub's log:
ERROR:  could not map filenumber "base/5/16395" to relation OID


Is this a bug? Should we also track the txns committed in BUILDING_SNAPSHOT 
state?


--
Regards,
ChangAo Chen

Re: Logical Replication of sequences

2024-06-10 Thread Amul Sul

On Sat, Jun 8, 2024 at 6:43 PM vignesh C  wrote:

> On Wed, 5 Jun 2024 at 14:11, Amit Kapila  wrote:
> [...]
> A new catalog table, pg_subscription_seq, has been introduced for
> mapping subscriptions to sequences. Additionally, the sequence LSN
> (Log Sequence Number) is stored, facilitating determination of
> sequence changes occurring before or after the returned sequence
> state.
>

Can't it be done using pg_depend? It seems a bit excessive unless I'm
missing
something. How do you track sequence mapping with the publication?

Regards,
Amul

Re: Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state

On Sun, Jun 09, 2024 at 11:21:52PM +0800, cca5507 wrote:
> Hello hackers, I found that we currently don't track txns committed in
> BUILDING_SNAPSHOT state because of the code in xact_decode():
>   /*
>* If the snapshot isn't yet fully built, we cannot decode anything, so
>* bail out.
>*/
>   if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT)
>   return;
> 
> The output of pub's log:
> ERROR:  could not map filenumber "base/5/16395" to relation OID
> 
> Is this a bug? Should we also track the txns committed in BUILDING_SNAPSHOT 
> state?

Clearly, this is not an error you should be able to see as a user.  So
yes, that's something that needs to be fixed.
--
Michael


signature.asc
Description: PGP signature

Re: Proposal to add page headers to SLRU pages

On Mon, Jun 10, 2024 at 07:19:56AM +, Bertrand Drouvot wrote:
> On Tue, Mar 19, 2024 at 06:48:33AM +, Li, Yong wrote:
>> Unfortunately, the test requires a setup of two different versions of PG. I 
>> am not
>> aware of an existing test infrastructure which can run automated tests using 
>> two
>> PGs. I did manually verify the output of pg_upgrade. 
> 
> I think there is something in t/002_pg_upgrade.pl (see 
> src/bin/pg_upgrade/TESTING),
> that could be used to run automated tests using an old and a current version.

Cluster.pm relies on install_path for stuff, where it is possible to
create tests with multiple nodes pointing to different installation
paths.  This allows mixing nodes with different build options, or just
different major versions like pg_upgrade's perl tests.
--
Michael


signature.asc
Description: PGP signature

Re: relfilenode statistics

Hi,

On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 11:17 PM Andres Freund  wrote:
> > If we just want to keep prior stats upon arelation rewrite, we can just copy
> > the stats from the old relfilenode.  Or we can decide that those stats don't
> > really make sense anymore, and start from scratch.
> 
> I think we need to think carefully about what we want the user
> experience to be here. "Per-relfilenode stats" could mean "sometimes I
> don't know the relation OID so I want to use the relfilenumber
> instead, without changing the user experience" or it could mean "some
> of these stats actually properly pertain to the relfilenode rather
> than the relation so I want to associate them with the right object
> and that will affect how the user sees things." We need to decide
> which it is. If it's the former, then we need to examine whether the
> goal of hiding the distinction between relfilenode stats and relation
> stats from the user is in fact feasible. If it's the latter, then we
> need to make sure the whole patch reflects that design, which would
> include e.g. NOT copying stats from the old to the new relfilenode,
> and which would also include documenting the behavior in a way that
> will be understandable to users.

Thanks for sharing your thoughts!

Let's take the current heap_blks_read as an example: it currently survives
a relation rewrite and I guess we don't want to change the existing user
experience for it.

Now say we want to add "heap_blks_written" (like in this POC patch) then I think
that it makes sense for the user to 1) query this new stat from the same place
as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the 
same
experience as far the relation rewrite is concerned (keep the previous stats).

To achieve the rewrite behavior we could:

1) copy the stats from the OLD relfilenode to the relation (like in the POC 
patch)
2) copy the stats from the OLD relfilenode to the NEW one (could be in a 
dedicated
field)

The PROS of 1) is that the behavior is consistent with the current 
heap_blks_read
and that the user could still see the current relfilenode stats (through a new 
API)
if he wants to.

> In my experience, the worst thing you can do in cases like this is be
> somewhere in the middle. Then you tend to end up with stuff like: the
> difference isn't supposed to be something that the user knows or cares
> about, except that they do have to know and care because you haven't
> thoroughly covered up the deception, and often they have to reverse
> engineer the behavior because you didn't document what was really
> happening because you imagined that they wouldn't notice.

My idea was to move all that is in pg_statio_all_tables to relfilenode stats
and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure
the user can still retrieve the stats from pg_statio_all_tables in such a way
that it survives a rewrite, 3) provide dedicated APIs to retrieve
relfilenode stats but only for the current relfilenode, 4) document this
behavior. This is what the POC patch is doing for heap_blks_written (would
need to do the same for heap_blks_read and friends) except for the documentation
part. What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Conflict Detection and Resolution

On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
 wrote:
>
> On 5/27/24 07:48, shveta malik wrote:
> > On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
> >  wrote:
> >>
> >> Which architecture are you aiming for? Here you talk about multiple
> >> providers, but the wiki page mentions active-active. I'm not sure how
> >> much this matters, but it might.
> >
> > Currently, we are working for multi providers case but ideally it
> > should work for active-active also. During further discussion and
> > implementation phase, if we find that, there are cases which will not
> > work in straight-forward way for active-active, then our primary focus
> > will remain to first implement it for multiple providers architecture.
> >
> >>
> >> Also, what kind of consistency you expect from this? Because none of
> >> these simple conflict resolution methods can give you the regular
> >> consistency models we're used to, AFAICS.
> >
> > Can you please explain a little bit more on this.
> >
>
> I was referring to the well established consistency models / isolation
> levels, e.g. READ COMMITTED or SNAPSHOT ISOLATION. This determines what
> guarantees the application developer can expect, what anomalies can
> happen, etc.
>
> I don't think any such isolation level can be implemented with a simple
> conflict resolution methods like last-update-wins etc. For example,
> consider an active-active where both nodes do
>
>   UPDATE accounts SET balance=balance+1000 WHERE id=1
>
> This will inevitably lead to a conflict, and while the last-update-wins
> resolves this "consistently" on both nodes (e.g. ending with the same
> result), it's essentially a lost update.
>

The idea to solve such conflicts is using the delta apply technique
where the delta from both sides will be applied to the respective
columns. We do plan to target this as a separate patch. Now, if the
basic conflict resolution and delta apply both can't go in one
release, we shall document such cases clearly to avoid misuse of the
feature.

> This is a very simplistic example of course, I recall there are various
> more complex examples involving foreign keys, multi-table transactions,
> constraints, etc. But in principle it's a manifestation of the same
> inherent limitation of conflict detection and resolution etc.
>
> Similarly, I believe this affects not just active-active, but also the
> case where one node aggregates data from multiple publishers. Maybe not
> to the same extent / it might be fine for that use case,
>

I am not sure how much it is a problem for general logical replication
solution but we do intend to work on solving such problems in
step-wise manner. Trying to attempt everything in one patch doesn't
seem advisable to me.

>
 but you said
> the end goal is to use this for active-active. So I'm wondering what's
> the plan, there.
>

I think at this stage we are not ready for active-active because
leaving aside this feature we need many other features like
replication of all commands/objects (DDL replication, replicate large
objects, etc.), Global sequences, some sort of global two_phase
transaction management for data consistency, etc. So, it would be
better to consider logical replication cases intending to extend it
for active-active when we have other required pieces.

> If I'm writing an application for active-active using this conflict
> handling, what assumptions can I make? Will Can I just do stuff as if on
> a single node, or do I need to be super conscious about the zillion ways
> things can misbehave in a distributed system?
>
> My personal opinion is that the closer this will be to the regular
> consistency levels, the better. If past experience taught me anything,
> it's very hard to predict how distributed systems with eventual
> consistency behave, and even harder to actually test the application in
> such environment.
>

I don't think in any way this can enable users to start writing
applications for active-active workloads. For something like what you
are saying, we probably need a global transaction manager (or a global
two_pc) as well to allow transactions to behave as they are on
single-node or achieve similar consistency levels. With such
transaction management, we can allow transactions to commit on a node
only when it doesn't lead to a conflict on the peer node.

> In any case, if there are any differences compared to the usual
> behavior, it needs to be very clearly explained in the docs.
>

I agree that docs should be clear about the cases that this can and
can't support.

> >>
> >> How is this going to deal with the fact that commit LSN and timestamps
> >> may not correlate perfectly? That is, commits may happen with LSN1 <
> >> LSN2 but with T1 > T2.
> >
> > Are you pointing to the issue where a session/txn has taken
> > 'xactStopTimestamp' timestamp earlier but is delayed to insert record
> > in XLOG, while another session/txn which has taken timestamp slightly
> > later succeeded to insert the record IN XLOG sooner than the session1,

Re: Wrong results with grouping sets

2024-06-10 Thread Richard Guo

On Wed, Jun 5, 2024 at 5:42 PM Richard Guo  wrote:
> Hence here is the v7 patchset.  I've also added detailed commit messages
> for the two patches.

This patchset does not apply any more.  Here is a new rebase.

While at it, I added more checks for 'root->group_rtindex', and also
added a new test case to verify that we generate window_pathkeys
correctly with grouping sets.

Thanks
Richard

v8-0001-Introduce-a-RTE-for-the-grouping-step.patch
Description: Binary data

v8-0002-Mark-expressions-nullable-by-grouping-sets.patch
Description: Binary data

Re: Logical Replication of sequences

On Mon, Jun 10, 2024 at 12:43 PM Masahiko Sawada  wrote:
>
> On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada  wrote:
> >
> > On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila  wrote:
> > >
> > >
> > > Are you imagining the behavior for sequences associated with tables
> > > differently than the ones defined by the CREATE SEQUENCE .. command? I
> > > was thinking that users would associate sequences with publications
> > > similar to what we do for tables for both cases. For example, they
> > > need to explicitly mention the sequences they want to replicate by
> > > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> > > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > > SEQUENCES IN SCHEMA sch1;
> > >
> > > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > > should copy both the explicitly defined sequences and sequences
> > > defined with the tables. Do you think a different variant for just
> > > copying sequences implicitly associated with tables (say for identity
> > > columns)?
> >
> > Oh, I was thinking that your proposal was to copy literally all
> > sequences by REPLICA/REFRESH SEQUENCE command.
> >

I am trying to keep the behavior as close to tables as possible.

> > But it seems to make
> > sense to explicitly specify the sequences they want to replicate. It
> > also means that they can create a publication that has only sequences.
> > In this case, even if they create a subscription for that publication,
> > we don't launch any apply workers for that subscription. Right?
> >

Right, good point. I had not thought about this.

> > Also, given that the main use case (at least as the first step) is
> > version upgrade, do we really need to support SEQUENCES IN SCHEMA and
> > even FOR SEQUENCE?
>

At the very least, we can split the patch to move these variants to a
separate patch. Once the main patch is finalized, we can try to
evaluate the remaining separately.

> Also, I guess that specifying individual sequences might not be easy
> to use for users in some cases. For sequences owned by a column of a
> table, users might want to specify them altogether, rather than
> separately. For example, CREATE PUBLICATION ... FOR TABLE tab1 WITH
> SEQUENCES means to add the table tab1 and its sequences to the
> publication. For other sequences (i.e., not owned by any tables),
> users might want to specify them individually.
>

Yeah, or we can have a syntax like CREATE PUBLICATION ... FOR TABLE
tab1 INCLUDE SEQUENCES.  Normally, we use the WITH clause for options
(For example, CREATE SUBSCRIPTION ... WITH (streaming=...)).

-- 
With Regards,
Amit Kapila.

Re: Format the code in xact_decode

2024-06-10 Thread Ashutosh Bapat

Changes make the code in case XLOG_XACT_INVALIDATIONS block syntactically
consistent with the other block. LGTM.

On Sun, Jun 9, 2024 at 6:57 PM cca5507  wrote:

> Hello hackers, just as the title says:
> 1. Remove redundant parentheses.
> 2. Adjust the position of the break statement.
> --
> Regards,
> ChangAo Chen
>

-- 
Best Wishes,
Ashutosh Bapat

Re: AIX support

2024-06-10 Thread Laurenz Albe

On Fri, 2024-06-07 at 16:30 +, Srirama Kucherlapati wrote:
> Hi Team, We are pursuing to trim the changes wrt AIX. As of now we trimmed
> the changes with respect to XLC and currently with trimmed changes the
> buildfarm script passed (build and all the regression tests)
> The XLC changes were trimmed only in the below file
>     modified: configure
>     modified: configure.ac

Did you forget an attachment?

Yours,
Laurenz Albe

Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-10 Thread Shlok Kyal

On Thu, 6 Jun 2024 at 11:49, Kyotaro Horiguchi  wrote:
>
> At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith  wrote 
> in
> > Hi, I have reproduced this multiple times now.
> >
> > I confirmed the initial post/steps from Alexander. i.e. The test
> > script provided [1] gets itself into a state where function
> > ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
> > for the next page to become available") constantly returns
> > XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
> > forever, since it never calls WalSndDone() to exit the walsender
> > process.
>
> Thanks for the repro; I believe I understand what's happening here.
>
> During server shutdown, the latter half of the last continuation
> record may fail to be flushed. This is similar to what is described in
> the commit message of commit ff9f111bce. While shutting down,
> WalSndLoop() waits for XLogSendLogical() to consume WAL up to
> flushPtr, but in this case, the last record cannot complete without
> the continuation part starting from flushPtr, which is
> missing. However, in such cases, xlogreader.missingContrecPtr is set
> to the beginning of the missing part, but something similar to
>
> So, I believe the attached small patch fixes the behavior. I haven't
> come up with a good test script for this issue. Something like
> 026_overwrite_contrecord.pl might work, but this situation seems a bit
> more complex than what it handles.
>
> Versions back to 10 should suffer from the same issue and the same
> patch will be applicable without significant changes.

I tested the changes for PG 12 to master as we do not support prior versions.
The patch applied successfully for master and PG 16. I ran the test
provided in [1] multiple times and it ran successfully each time.
The patch did not apply on PG 15. I did a similar change for PG 15 and
created a patch.  I ran the test multiple times and it was successful
every time.
The patch did not apply on PG 14 to PG 12. I did a similar change in
each branch. But the tests did not pass in each branch.

I have attached a patch which applies successfully on the PG 15 branch.

[1]: 
https://www.postgresql.org/message-id/f15d665f-4cd1-4894-037c-afdbe3692...@gmail.com


Thanks and Regards,
Shlok Kyal


v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s.patch
Description: Binary data


v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-15.patch
Description: Binary data

Re: Format the code in xact_decode

2024-06-10 Thread cca5507

Thank you for reply! I have new a patch in commitfest:Format the code in 
xact_decode (postgresql.org)


--
Regards,
ChangAo Chen

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-10 Thread Daniel Gustafsson

> On 7 Jun 2024, at 19:14, Jacob Champion  
> wrote:

> - Could you separate the two features into two patches? That would
> make it easier for reviewers. (They can still share the same thread
> and CF entry.)

+1, please do.

> - The "curve" APIs have been renamed "group" in newer OpenSSLs for a
> while now, and we should probably use those if possible.

Agreed.  While not deprecated per se the curve API is considered obsolete and
is just aliased to the group API (OpenSSL using both the term obsolete and
deprecated to mean the same thing but with very different mechanics is quite
confusing).

> - I think parsing apart the groups list to check NIDs manually could
> lead to false negatives. From a docs skim, 3.0 allows providers to add
> their own group names, and 3.3 now supports question marks in the
> string to allow graceful fallbacks.

Parsing the list will likely risk false negatives as you say, but from skimming
the code there doesn't seem to be a good errormessage from SSL_set1_groups_list
to indicate if listitems were invalid (unless all of them were).  Maybe calling
the associated _get function to check the number of set groups can be used to
verify what happenend?

Regarding the ciphersuites portion of the patch.  I'm not particularly thrilled
about having a GUC for TLSv1.2 ciphers and one for TLSv1.3 ciphersuites, users
not all that familiar with TLS will likely find it confusing to figure out what
to do.

In which way is this feature needed since this can be achieved with the config
directive "Ciphersuites" in openssl.conf IIUC?

If we add this I think we should keep it blank and if so skip setting it at all
falling back on OpenSSL defaults.  The below default for the GUC does not match
the OpenSSL default and I think we are better off trusting them on this.

+   "TLS_AES_256_GCM_SHA384:TLS_AES_128_GCM_SHA256",

--
Daniel Gustafsson

list_free in addRangeTableEntryForJoin

2024-06-10 Thread Ilia Evdokimov


Hi Hackers

I have identified a potential memory leak in the 
`addRangeTableEntryForJoin()` function. The second parameter of 
`addRangeTableEntryForJoin()`, `colnames`, is a `List*` that is 
concatenated with another `List*`, `eref->colnames`, using 
`list_concat()`. We need to pass only the last `numaliases` elements of 
the list, for which we use `list_copy_tail`. This function creates a 
copy of the `colnames` list, resulting in `colnames` pointing to the 
current list that will not be freed. Consequently, a new list is already 
concatenated.


To address this issue, I have invoked `list_free(colnames)` afterwards. 
If anyone is aware of where the list is being freed or has any 
suggestions for improvement, I would greatly appreciate your input.


Best Regards,

Ilia Evdokimov,

TantorLabs LCC
From cd7aa7ac5904501085a980944dc3c18f42721c06 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Mon, 10 Jun 2024 13:10:31 +0300
Subject: [PATCH] After concatenation two lists where the second one is from
 list_copy_tail do not free it

---
 src/backend/parser/parse_relation.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 2f64eaf0e3..8230cf27cc 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -2258,8 +2258,11 @@ addRangeTableEntryForJoin(ParseState *pstate,
 
 	/* fill in any unspecified alias columns */
 	if (numaliases < list_length(colnames))
+	{
 		eref->colnames = list_concat(eref->colnames,
 	 list_copy_tail(colnames, numaliases));
+		list_free(colnames);
+	}
 
 	if (numaliases > list_length(colnames))
 		ereport(ERROR,
-- 
2.34.1

Re: Conflict Detection and Resolution

2024-06-10 Thread shveta malik

On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
 wrote:
>
> >>>
> >>>  UPDATE
> >>> 
> >>>
> >>> Conflict Detection Method:
> >>> 
> >>> Origin conflict detection: The ‘origin’ info is used to detect
> >>> conflict which can be obtained from commit-timestamp generated for
> >>> incoming txn at the source node. To compare remote’s origin with the
> >>> local’s origin, we must have origin information for local txns as well
> >>> which can be obtained from commit-timestamp after enabling
> >>> ‘track_commit_timestamp’ locally.
> >>> The one drawback here is the ‘origin’ information cannot be obtained
> >>> once the row is frozen and the commit-timestamp info is removed by
> >>> vacuum. For a frozen row, conflicts cannot be raised, and thus the
> >>> incoming changes will be applied in all the cases.
> >>>
> >>> Conflict Types:
> >>> 
> >>> a) update_differ: The origin of an incoming update's key row differs
> >>> from the local row i.e.; the row has already been updated locally or
> >>> by different nodes.
> >>> b) update_missing: The row with the same value as that incoming
> >>> update's key does not exist. Remote is trying to update a row which
> >>> does not exist locally.
> >>> c) update_deleted: The row with the same value as that incoming
> >>> update's key does not exist. The row is already deleted. This conflict
> >>> type is generated only if the deleted row is still detectable i.e., it
> >>> is not removed by VACUUM yet. If the row is removed by VACUUM already,
> >>> it cannot detect this conflict. It will detect it as update_missing
> >>> and will follow the default or configured resolver of update_missing
> >>> itself.
> >>>
> >>
> >> I don't understand the why should update_missing or update_deleted be
> >> different, especially considering it's not detected reliably. And also
> >> that even if we happen to find the row the associated TOAST data may
> >> have already been removed. So why would this matter?
> >
> > Here, we are trying to tackle the case where the row is 'recently'
> > deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> > want to opt for a different resolution in such a case as against the
> > one where the corresponding row was not even present in the first
> > place. The case where the row was deleted long back may not fall into
> > this category as there are higher chances that they have been removed
> > by vacuum and can be considered equivalent to the update_ missing
> > case.
> >
>
> My point is that if we can't detect the difference reliably, it's not
> very useful. Consider this example:
>
> Node A:
>
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> The "correct" order of received messages on a third node is T1-T3-T2.
> But we may also see T1-T2-T3 and T3-T1-T2, e.g. due to network issues
> and so on. For T1-T2-T3 the right decision is to discard the update,
> while for T3-T1-T2 it's to either wait for the INSERT or wait for the
> insert to arrive.
>
> But if we misdetect the situation, we either end up with a row that
> shouldn't be there, or losing an update.

Doesn't the above example indicate that 'update_deleted' should also
be considered a necessary conflict type?  Please see the possibilities
of conflicts in all three cases:


The "correct" order of receiving messages on node C (as suggested
above) is T1-T3-T2   (case1)
--
T1 will insert the row.
T3 will have update_differ conflict; latest_timestamp wins or apply
will apply it. earliest_timestamp_wins or skip will skip it.
T2 will delete the row (irrespective of whether the update happened or not).
End Result: No Data.

T1-T2-T3
--
T1 will insert the row.
T2 will delete the row.
T3 will have conflict update_deleted. If it is 'update_deleted', the
chances are that the resolver set here is to 'skip' (default is also
'skip' in this case).

If vacuum has deleted that row (or if we don't support
'update_deleted' conflict), it will be 'update_missing' conflict. In
that case, the user may end up inserting the row if resolver chosen is
in favor of apply (which seems an obvious choice for 'update_missing'
conflict; default is also 'apply_or_skip').

End result:
Row inserted with 'update_missing'.
Row correctly skipped with 'update_deleted' (assuming the obvious
choice seems to be 'skip' for update_deleted case).

So it seems that with 'update_deleted' conflict, there are higher
chances of opting for right decision here (which is to discard the
update), as 'update_deleted' conveys correct info to the user.  The
'update_missing' OTOH does not convey correct info and user may end up
inserting the data by choosing apply favoring resolvers for
'update_missing'. Again, we get benefit of 'update_deleted' for
*recently* deleted rows only.

T3-T1-T2
--
T3 may end up inserting the record if the resolver is in favor of
'apply' and all the

Re: Conflict Detection and Resolution

2024-06-10 Thread shveta malik

On Fri, Jun 7, 2024 at 6:10 PM Tomas Vondra
 wrote:
>
> >>> I don't understand the why should update_missing or update_deleted be
> >>> different, especially considering it's not detected reliably. And also
> >>> that even if we happen to find the row the associated TOAST data may
> >>> have already been removed. So why would this matter?
> >>
> >> Here, we are trying to tackle the case where the row is 'recently'
> >> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
> >> want to opt for a different resolution in such a case as against the
> >> one where the corresponding row was not even present in the first
> >> place. The case where the row was deleted long back may not fall into
> >> this category as there are higher chances that they have been removed
> >> by vacuum and can be considered equivalent to the update_ missing
> >> case.
> >>
> >> Regarding "TOAST column" for deleted row cases, we may need to dig
> >> more. Thanks for bringing this case. Let me analyze more here.
> >>
> > I tested a simple case with a table with one TOAST column and found
> > that when a tuple with a TOAST column is deleted, both the tuple and
> > corresponding pg_toast entries are marked as ‘deleted’ (dead) but not
> > removed immediately. The main tuple and respective pg_toast entry are
> > permanently deleted only during vacuum. First, the main table’s dead
> > tuples are vacuumed, followed by the secondary TOAST relation ones (if
> > available).
> > Please let us know if you have a specific scenario in mind where the
> > TOAST column data is deleted immediately upon ‘delete’ operation,
> > rather than during vacuum, which we are missing.
> >
>
> I'm pretty sure you can vacuum the TOAST table directly, which means
> you'll end up with a deleted tuple with TOAST pointers, but with the
> TOAST entries already gone.
>

It is true that for a deleted row, its toast entries can be vacuumed
earlier than the original/parent row, but we do not need to be
concerned about that to raise 'update_deleted'. To raise an
'update_deleted' conflict, it is sufficient to know that the row has
been deleted and not yet vacuumed, regardless of the presence or
absence of its toast entries. Once this is determined, we need to
build the tuple from remote data and apply it (provided resolver is
such that). If the tuple cannot be fully constructed from the remote
data, the apply operation will either be skipped or an error will be
raised, depending on whether the user has chosen the apply_or_skip or
apply_or_error option.

In cases where the table has toast columns but the remote data does
not include the toast-column entry (when the toast column is
unmodified and not part of the replica identity),  the resolution for
'update_deleted' will be no worse than for 'update_missing'. That is,
for both the cases, we can not construct full tuple and thus the
operation either needs to be skipped or error needs to be raised.

thanks
Shveta

Re: confirmed flush lsn seems to be move backward in certain error cases

On Tue, Feb 20, 2024 at 12:35 PM vignesh C  wrote:
>
> On Sat, 17 Feb 2024 at 12:03, Amit Kapila  wrote:
> >
> >
> > @@ -1839,7 +1839,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
> >
> >   SpinLockAcquire(&MyReplicationSlot->mutex);
> >
> > - MyReplicationSlot->data.confirmed_flush = lsn;
> > + if (lsn > MyReplicationSlot->data.confirmed_flush)
> > + MyReplicationSlot->data.confirmed_flush = lsn;
> >
> >   /* if we're past the location required for bumping xmin, do so */
> >   if (MyReplicationSlot->candidate_xmin_lsn != InvalidXLogRecPtr &&
> > @@ -1904,7 +1905,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
> >   else
> >   {
> >   SpinLockAcquire(&MyReplicationSlot->mutex);
> > - MyReplicationSlot->data.confirmed_flush = lsn;
> > + if (lsn > MyReplicationSlot->data.confirmed_flush)
> > + MyReplicationSlot->data.confirmed_flush = lsn;
> >
> > BTW, from which code path does it update the prior value of
> > confirmed_flush?
>
> The confirmed_flush is getting set in the else condition for this scenario.
>
> If it is through the else check, then can we see if
> > it may change the confirm_flush to the prior position via the first
> > code path? I am asking because in the first code path, we can even
> > flush the re-treated value of confirm_flush LSN.
>
> I was not able to find any scenario to set a prior position with the
> first code path. I tried various scenarios like adding delay in
> walsender, add delay in apply worker, restart the instances and with
> various DML operations. It was always setting it to either to the same
> value as previous or greater value.
>

Fair enough. This means that in the prior versions, it was never
possible to move confirmed_flush LSN in the slot to a backward
position on the disk. So, moving it backward temporarily (in the
memory) shouldn't create any problem. I would prefer your
Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch
to fix this issue.

Thoughts?

--
With Regards,
Amit Kapila.

Re: Logical Replication of sequences

On Mon, 10 Jun 2024 at 12:24, Amul Sul  wrote:
>
>
>
> On Sat, Jun 8, 2024 at 6:43 PM vignesh C  wrote:
>>
>> On Wed, 5 Jun 2024 at 14:11, Amit Kapila  wrote:
>> [...]
>> A new catalog table, pg_subscription_seq, has been introduced for
>> mapping subscriptions to sequences. Additionally, the sequence LSN
>> (Log Sequence Number) is stored, facilitating determination of
>> sequence changes occurring before or after the returned sequence
>> state.
>
>
> Can't it be done using pg_depend? It seems a bit excessive unless I'm missing
> something.

We'll require the lsn because the sequence LSN informs the user that
it has been synchronized up to the LSN in pg_subscription_seq. Since
we are not supporting incremental sync, the user will be able to
identify if he should run refresh sequences or not by checking the lsn
of the pg_subscription_seq and the lsn of the sequence(using
pg_sequence_state added) in the publisher.  Also, this parallels our
implementation for pg_subscription_seq and will aid in expanding for
a) incremental synchronization and b) utilizing workers for
synchronization using sequence states if necessary.

How do you track sequence mapping with the publication?

In the publisher we use pg_publication_rel and
pg_publication_namespace for mapping the sequences with the
publication.

Regards,
Vignesh
Vignesh

Re: Conflict Detection and Resolution

2024-06-10 Thread Tomas Vondra

On 6/10/24 10:54, Amit Kapila wrote:
> On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
>  wrote:
>>
>> On 5/27/24 07:48, shveta malik wrote:
>>> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>>>  wrote:

 Which architecture are you aiming for? Here you talk about multiple
 providers, but the wiki page mentions active-active. I'm not sure how
 much this matters, but it might.
>>>
>>> Currently, we are working for multi providers case but ideally it
>>> should work for active-active also. During further discussion and
>>> implementation phase, if we find that, there are cases which will not
>>> work in straight-forward way for active-active, then our primary focus
>>> will remain to first implement it for multiple providers architecture.
>>>

 Also, what kind of consistency you expect from this? Because none of
 these simple conflict resolution methods can give you the regular
 consistency models we're used to, AFAICS.
>>>
>>> Can you please explain a little bit more on this.
>>>
>>
>> I was referring to the well established consistency models / isolation
>> levels, e.g. READ COMMITTED or SNAPSHOT ISOLATION. This determines what
>> guarantees the application developer can expect, what anomalies can
>> happen, etc.
>>
>> I don't think any such isolation level can be implemented with a simple
>> conflict resolution methods like last-update-wins etc. For example,
>> consider an active-active where both nodes do
>>
>>   UPDATE accounts SET balance=balance+1000 WHERE id=1
>>
>> This will inevitably lead to a conflict, and while the last-update-wins
>> resolves this "consistently" on both nodes (e.g. ending with the same
>> result), it's essentially a lost update.
>>
> 
> The idea to solve such conflicts is using the delta apply technique
> where the delta from both sides will be applied to the respective
> columns. We do plan to target this as a separate patch. Now, if the
> basic conflict resolution and delta apply both can't go in one
> release, we shall document such cases clearly to avoid misuse of the
> feature.
> 

Perhaps, but it's not like having delta conflict resolution (or even
CRDT as a more generic variant) would lead to a regular consistency
model in a distributed system. At least I don't think it can achieve
that, because of the asynchronicity.

Consider a table with "CHECK (amount < 1000)" constraint, and an update
that sets (amount = amount + 900) on two nodes. AFAIK there's no way to
reconcile this using delta (or any other other) conflict resolution.

Which does not mean we should not have some form of conflict resolution,
as long as we know what the goal is. I simply don't want to spend time
working on this, add a lot of complex code, and then realize it doesn't
give us a consistency model that makes sense.

Which leads me back to my original question - what is the consistency
model this you expect to get from this (possibly when combined with some
other pieces?)?

>> This is a very simplistic example of course, I recall there are various
>> more complex examples involving foreign keys, multi-table transactions,
>> constraints, etc. But in principle it's a manifestation of the same
>> inherent limitation of conflict detection and resolution etc.
>>
>> Similarly, I believe this affects not just active-active, but also the
>> case where one node aggregates data from multiple publishers. Maybe not
>> to the same extent / it might be fine for that use case,
>>
> 
> I am not sure how much it is a problem for general logical replication
> solution but we do intend to work on solving such problems in
> step-wise manner. Trying to attempt everything in one patch doesn't
> seem advisable to me.
> 

I didn't say it needs to be done in one patch. I asked for someone to
explain what is the goal - consistency model observed by the users.

>>
>  but you said
>> the end goal is to use this for active-active. So I'm wondering what's
>> the plan, there.
>>
> 
> I think at this stage we are not ready for active-active because
> leaving aside this feature we need many other features like
> replication of all commands/objects (DDL replication, replicate large
> objects, etc.), Global sequences, some sort of global two_phase
> transaction management for data consistency, etc. So, it would be
> better to consider logical replication cases intending to extend it
> for active-active when we have other required pieces.
> 

We're not ready for active-active, sure. And I'm not saying a conflict
resolution would make us ready. The question is what consistency model
we'd like to get from the active-active, and whether conflict resolution
can get us there ...

As for the other missing bits (DDL replication, large objects, global
sequences), I think those are somewhat independent of the question I'm
asking. And some of the stuff is also somewhat optional - for example I
think it'd be fine to not support large objects or global sequences.

>> If I'm writing an application for active-active using th

Re: list_free in addRangeTableEntryForJoin

2024-06-10 Thread Ranier Vilela

Em seg., 10 de jun. de 2024 às 07:35, Ilia Evdokimov <
ilya.evdoki...@tantorlabs.com> escreveu:

> Hi Hackers
>
> I have identified a potential memory leak in the
> `addRangeTableEntryForJoin()` function. The second parameter of
> `addRangeTableEntryForJoin()`, `colnames`, is a `List*` that is
> concatenated with another `List*`, `eref->colnames`, using
> `list_concat()`. We need to pass only the last `numaliases` elements of
> the list, for which we use `list_copy_tail`. This function creates a
> copy of the `colnames` list, resulting in `colnames` pointing to the
> current list that will not be freed. Consequently, a new list is already
> concatenated.
>
> To address this issue, I have invoked `list_free(colnames)` afterwards.
> If anyone is aware of where the list is being freed or has any
> suggestions for improvement, I would greatly appreciate your input.
>
list_copy_tail actually makes a new copy.
But callers of addRangeTableEntryForJoin,
expects to handle a list or NIL, if we free the memory,
Shouldn't I set the variable *colnames* to NIL, too?

best regards,
Ranier Vilela

Re: Conflict Detection and Resolution

2024-06-10 Thread Tomas Vondra




On 6/10/24 12:56, shveta malik wrote:
> On Fri, Jun 7, 2024 at 6:08 PM Tomas Vondra
>  wrote:
>>
>
>  UPDATE
> 
>
> Conflict Detection Method:
> 
> Origin conflict detection: The ‘origin’ info is used to detect
> conflict which can be obtained from commit-timestamp generated for
> incoming txn at the source node. To compare remote’s origin with the
> local’s origin, we must have origin information for local txns as well
> which can be obtained from commit-timestamp after enabling
> ‘track_commit_timestamp’ locally.
> The one drawback here is the ‘origin’ information cannot be obtained
> once the row is frozen and the commit-timestamp info is removed by
> vacuum. For a frozen row, conflicts cannot be raised, and thus the
> incoming changes will be applied in all the cases.
>
> Conflict Types:
> 
> a) update_differ: The origin of an incoming update's key row differs
> from the local row i.e.; the row has already been updated locally or
> by different nodes.
> b) update_missing: The row with the same value as that incoming
> update's key does not exist. Remote is trying to update a row which
> does not exist locally.
> c) update_deleted: The row with the same value as that incoming
> update's key does not exist. The row is already deleted. This conflict
> type is generated only if the deleted row is still detectable i.e., it
> is not removed by VACUUM yet. If the row is removed by VACUUM already,
> it cannot detect this conflict. It will detect it as update_missing
> and will follow the default or configured resolver of update_missing
> itself.
>

 I don't understand the why should update_missing or update_deleted be
 different, especially considering it's not detected reliably. And also
 that even if we happen to find the row the associated TOAST data may
 have already been removed. So why would this matter?
>>>
>>> Here, we are trying to tackle the case where the row is 'recently'
>>> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
>>> want to opt for a different resolution in such a case as against the
>>> one where the corresponding row was not even present in the first
>>> place. The case where the row was deleted long back may not fall into
>>> this category as there are higher chances that they have been removed
>>> by vacuum and can be considered equivalent to the update_ missing
>>> case.
>>>
>>
>> My point is that if we can't detect the difference reliably, it's not
>> very useful. Consider this example:
>>
>> Node A:
>>
>>   T1: INSERT INTO t (id, value) VALUES (1,1);
>>
>>   T2: DELETE FROM t WHERE id = 1;
>>
>> Node B:
>>
>>   T3: UPDATE t SET value = 2 WHERE id = 1;
>>
>> The "correct" order of received messages on a third node is T1-T3-T2.
>> But we may also see T1-T2-T3 and T3-T1-T2, e.g. due to network issues
>> and so on. For T1-T2-T3 the right decision is to discard the update,
>> while for T3-T1-T2 it's to either wait for the INSERT or wait for the
>> insert to arrive.
>>
>> But if we misdetect the situation, we either end up with a row that
>> shouldn't be there, or losing an update.
> 
> Doesn't the above example indicate that 'update_deleted' should also
> be considered a necessary conflict type?  Please see the possibilities
> of conflicts in all three cases:
> 
> 
> The "correct" order of receiving messages on node C (as suggested
> above) is T1-T3-T2   (case1)
> --
> T1 will insert the row.
> T3 will have update_differ conflict; latest_timestamp wins or apply
> will apply it. earliest_timestamp_wins or skip will skip it.
> T2 will delete the row (irrespective of whether the update happened or not).
> End Result: No Data.
> 
> T1-T2-T3
> --
> T1 will insert the row.
> T2 will delete the row.
> T3 will have conflict update_deleted. If it is 'update_deleted', the
> chances are that the resolver set here is to 'skip' (default is also
> 'skip' in this case).
> 
> If vacuum has deleted that row (or if we don't support
> 'update_deleted' conflict), it will be 'update_missing' conflict. In
> that case, the user may end up inserting the row if resolver chosen is
> in favor of apply (which seems an obvious choice for 'update_missing'
> conflict; default is also 'apply_or_skip').
> 
> End result:
> Row inserted with 'update_missing'.
> Row correctly skipped with 'update_deleted' (assuming the obvious
> choice seems to be 'skip' for update_deleted case).
> 
> So it seems that with 'update_deleted' conflict, there are higher
> chances of opting for right decision here (which is to discard the
> update), as 'update_deleted' conveys correct info to the user.  The
> 'update_missing' OTOH does not convey correct info and user may end up
> inserting the data by choosing apply favoring resolvers for
> 'update_missing'. Again, we get benefit of 'update_d

Re: list_free in addRangeTableEntryForJoin

2024-06-10 Thread Ilia Evdokimov

>But callers of addRangeTableEntryForJoin(), expects to handle a list 
or NIL, if we free the memory
I've thoroughly reviewed all callers of the 
`addRangeTableEntryForJoin()` function and confirmed that the list is 
not used after this function is called. Since 
`addRangeTableEntryForJoin()` is the last function to use this list, it 
would be more efficient to free the `List` at the point of its declaration.


I'll attach new patch where I free the lists.

Regards,

Ilia Evdokimov,

Tantor Labs LCC
From 853c5bc854bcc03e791cf32cc8cad7b257ec558f Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Mon, 10 Jun 2024 15:09:12 +0300
Subject: [PATCH] After concatenation two lists where the second one is from
 list_copy_tail do not free it

---
 src/backend/parser/analyze.c  | 2 ++
 src/backend/parser/parse_clause.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 28fed9d87f..c0708a6e3e 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -1943,6 +1943,8 @@ transformSetOperationStmt(ParseState *pstate, SelectStmt *stmt)
 	if (pstate->p_hasAggs || qry->groupClause || qry->groupingSets || qry->havingQual)
 		parseCheckAggregates(pstate, qry);
 
+	list_free(targetnames);
+
 	return qry;
 }
 
diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index 8118036495..25f8c84a9f 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -1620,6 +1620,8 @@ transformFromClauseItem(ParseState *pstate, Node *n,
 		*top_nsitem = nsitem;
 		*namespace = lappend(my_namespace, nsitem);
 
+		list_free(res_colnames);
+
 		return (Node *) j;
 	}
 	else
-- 
2.34.1

Re: list_free in addRangeTableEntryForJoin

2024-06-10 Thread Ranier Vilela

Em seg., 10 de jun. de 2024 às 09:11, Ilia Evdokimov <
ilya.evdoki...@tantorlabs.com> escreveu:

>  >But callers of addRangeTableEntryForJoin(), expects to handle a list
> or NIL, if we free the memory
> I've thoroughly reviewed all callers of the
> `addRangeTableEntryForJoin()` function and confirmed that the list is
> not used after this function is called. Since
> `addRangeTableEntryForJoin()` is the last function to use this list, it
> would be more efficient to free the `List` at the point of its declaration.
>
> I'll attach new patch where I free the lists.
>
Seems like a better style.

Now you need to analyze whether the memory in question is not managed by a
Context and released in a block,
which would make this release useless.

best regards,
Ranier Vilela

Re: Fix grammar oddities in comments

2024-06-10 Thread James Coleman

On Wed, Jun 5, 2024 at 5:34 AM David Rowley  wrote:
>
> On Sun, 2 Jun 2024 at 10:08, James Coleman  wrote:
> > See attached for a small patch fixing some typos and grammatical
> > errors in a couple of comments.
>
> Thanks. I pushed this after messing with the comments a bit more.

Thanks!

> > Side note: It's not clear to me what "Vars of higher levels don't
> > matter here" means in this context (or how that claim is justified),
> > but I haven't changed that part of the comment opting to simply
> > resolve the clear mistakes in the wording here.
>
> It just means Vars with varlevelsup >= 2 don't matter. It only cares
> about Vars with varlevelsup==1, i.e. Vars of the sub-query's direct
> parent.

Yes, I understood the content, but I didn't see any justification
provided, which is what I'd hope for in a comment like this (why not
simply what).

Anyway, thanks again for reviewing and committing.

James Coleman

Re: confirmed flush lsn seems to be move backward in certain error cases

2024-06-10 Thread Shlok Kyal

On Mon, 10 Jun 2024 at 16:39, Amit Kapila  wrote:
>
> On Tue, Feb 20, 2024 at 12:35 PM vignesh C  wrote:
> >
> > On Sat, 17 Feb 2024 at 12:03, Amit Kapila  wrote:
> > >
> > >
> > > @@ -1839,7 +1839,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
> > >
> > >   SpinLockAcquire(&MyReplicationSlot->mutex);
> > >
> > > - MyReplicationSlot->data.confirmed_flush = lsn;
> > > + if (lsn > MyReplicationSlot->data.confirmed_flush)
> > > + MyReplicationSlot->data.confirmed_flush = lsn;
> > >
> > >   /* if we're past the location required for bumping xmin, do so */
> > >   if (MyReplicationSlot->candidate_xmin_lsn != InvalidXLogRecPtr &&
> > > @@ -1904,7 +1905,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
> > >   else
> > >   {
> > >   SpinLockAcquire(&MyReplicationSlot->mutex);
> > > - MyReplicationSlot->data.confirmed_flush = lsn;
> > > + if (lsn > MyReplicationSlot->data.confirmed_flush)
> > > + MyReplicationSlot->data.confirmed_flush = lsn;
> > >
> > > BTW, from which code path does it update the prior value of
> > > confirmed_flush?
> >
> > The confirmed_flush is getting set in the else condition for this scenario.
> >
> > If it is through the else check, then can we see if
> > > it may change the confirm_flush to the prior position via the first
> > > code path? I am asking because in the first code path, we can even
> > > flush the re-treated value of confirm_flush LSN.
> >
> > I was not able to find any scenario to set a prior position with the
> > first code path. I tried various scenarios like adding delay in
> > walsender, add delay in apply worker, restart the instances and with
> > various DML operations. It was always setting it to either to the same
> > value as previous or greater value.
> >
>
> Fair enough. This means that in the prior versions, it was never
> possible to move confirmed_flush LSN in the slot to a backward
> position on the disk. So, moving it backward temporarily (in the
> memory) shouldn't create any problem. I would prefer your
> Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch
> to fix this issue.
>
> Thoughts?

I was able to reproduce the issue with the test script provided in
[1].  I ran the script 10 times and I was able to reproduce the issue
4 times. I also tested the patch
Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch.
and it resolves the issue. I ran the test script 20 times and I was
not able to reproduce the issue.

[1]: 
https://www.postgresql.org/message-id/CALDaNm3hgow2%2BoEov5jBk4iYP5eQrUCF1yZtW7%2BdV3J__p4KLQ%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Non-text mode for pg_dumpall




Tom and Nathan opined recently that providing for non-text mode for 
pg_dumpall would be a Good Thing (TM). Not having it has been a 
long-standing complaint, so I've decided to give it a go.


I think we would need to restrict it to directory mode, at least to 
begin with. I would have a toc.dat with a different magic block (say 
"PGGLO" instead of "PGDMP") containing the global entries (roles, 
tablespaces, databases). Then for each database there would be a 
subdirectory (named for its toc entry) with a standard directory mode 
dump for that database. These could be generated in parallel (possibly 
by pg_dumpall calling pg_dump for each database). pg_restore on 
detecting a global type toc.data would restore the globals and then each 
of the databases (again possibly in parallel).


I'm sure there are many wrinkles I haven't thought of, but I don't see 
any insurmountable obstacles, just a significant amount of code.


Barring the unforeseen my main is to have a preliminary patch by the 
September CF.


Following that I would turn my attention to using it in pg_upgrade.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Postgresql OOM

2024-06-10 Thread Radu Radutiu

>
>
> FWIW, it can be useful to configure the OS with strict memory overcommit.
> That
> causes postgres to fail more gracefully, because the OOM killer won't be
> invoked.
>

In the current setup the database is used as an embedded db, with the
application sharing the same host as the database. Setting the memory
overcommit affects the other application, but configuring LimitAS for the
postgresql systemd service should work.
Does this mean that the fact that this query uses more than 100x the
configured work_mem is not considered a bug? Should I create a bug report?

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-06-10 Thread Robert Haas

On Mon, Jun 10, 2024 at 3:09 AM Ashutosh Bapat
 wrote:
> This is just one instance of measurements. If I run the experiment multiple 
> times the results and the patterns will vary. Usually I have found planning 
> time to vary within 5% for regular tables and within 9% for partitioned 
> tables with a large number of partitions. Below are measurements with the 
> experiment repeated multiple times. For a given number of partitioned tables 
> (each with 1000 partitions) being joined, planning time is measured 10 
> consecutive times. For this set of 10 runs we calculate average and standard 
> deviation of planning time. Such 10 sets are sampled. This means planning 
> time is sampled 100 times in total with and without patch respectively. 
> Measurements with master and patched are reported in the attached excel sheet.

Well, this is fine then I guess, but if the original results weren't
stable enough for people to draw conclusions from, then it's better
not to post them, and instead do this work to get results that are
stable before posting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: list_free in addRangeTableEntryForJoin

2024-06-10 Thread Ilia Evdokimov

>Now you need to analyze whether the memory in question is not managed 
by a Context

I've already analyzed. Let's explain details:


1. analyze.c
1718: List* targetnames;
1815: targetnames = NIL;
1848: targetnames = lappend(targetnames, makeString(colName));
1871: addRangeTableEntryForJoin(...);
=> list_free(targetnames);

2. parse_clause.c
1163: List* res_colnames;
1289: res_colnames = NIL;
1324: foreach(col, res_colnames);
1396: res_colnames = lappend(res_colnames, lfirst(ucol));
1520, 1525: extractRemainingColumns(...);
   |
    290: *res_colnames = lappend(*res_colnames, lfirst(lc));
1543: addRangeTableEntryForJoin(...);
=> list_free(res_colnames);


As you can see, there are no other pointers working with this block of 
memory, and all operations above are either read-only or append nodes to 
the lists. If I am mistaken, please correct me.
Furthermore, I will call `list_free` immediately after 
`addRangeTableEntryForJoin()`. The new patch is attached.


Thanks for reviewing. I'm looking forward to new suggestions.

Regards,

Ilia Evdokimov,

Tantor Labs LCC
From b545770d57ac3ca137e9ad97c004576e77213648 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov 
Date: Mon, 10 Jun 2024 16:39:14 +0300
Subject: [PATCH] After concatenation two lists where the second one is from
 list_copy_tail do not free it

---
 src/backend/parser/analyze.c  | 2 ++
 src/backend/parser/parse_clause.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 28fed9d87f..50c47a64ed 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -1880,6 +1880,8 @@ transformSetOperationStmt(ParseState *pstate, SelectStmt *stmt)
 		NULL,
 		false);
 
+   list_free(targetnames);
+
 	sv_namespace = pstate->p_namespace;
 	pstate->p_namespace = NIL;
 
diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c
index 8118036495..2fa6c03be7 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -1552,6 +1552,8 @@ transformFromClauseItem(ParseState *pstate, Node *n,
 		   j->alias,
 		   true);
 
+   list_free(res_colnames);
+
 		/* Verify that we correctly predicted the join's RT index */
 		Assert(j->rtindex == nsitem->p_rtindex);
 		/* Cross-check number of columns, too */
-- 
2.34.1

Re: confirmed flush lsn seems to be move backward in certain error cases

On Mon, 10 Jun 2024 at 16:38, Amit Kapila  wrote:
>
> On Tue, Feb 20, 2024 at 12:35 PM vignesh C  wrote:
> >
> > On Sat, 17 Feb 2024 at 12:03, Amit Kapila  wrote:
> > >
> > >
> > > @@ -1839,7 +1839,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
> > >
> > >   SpinLockAcquire(&MyReplicationSlot->mutex);
> > >
> > > - MyReplicationSlot->data.confirmed_flush = lsn;
> > > + if (lsn > MyReplicationSlot->data.confirmed_flush)
> > > + MyReplicationSlot->data.confirmed_flush = lsn;
> > >
> > >   /* if we're past the location required for bumping xmin, do so */
> > >   if (MyReplicationSlot->candidate_xmin_lsn != InvalidXLogRecPtr &&
> > > @@ -1904,7 +1905,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
> > >   else
> > >   {
> > >   SpinLockAcquire(&MyReplicationSlot->mutex);
> > > - MyReplicationSlot->data.confirmed_flush = lsn;
> > > + if (lsn > MyReplicationSlot->data.confirmed_flush)
> > > + MyReplicationSlot->data.confirmed_flush = lsn;
> > >
> > > BTW, from which code path does it update the prior value of
> > > confirmed_flush?
> >
> > The confirmed_flush is getting set in the else condition for this scenario.
> >
> > If it is through the else check, then can we see if
> > > it may change the confirm_flush to the prior position via the first
> > > code path? I am asking because in the first code path, we can even
> > > flush the re-treated value of confirm_flush LSN.
> >
> > I was not able to find any scenario to set a prior position with the
> > first code path. I tried various scenarios like adding delay in
> > walsender, add delay in apply worker, restart the instances and with
> > various DML operations. It was always setting it to either to the same
> > value as previous or greater value.
> >
>
> Fair enough. This means that in the prior versions, it was never
> possible to move confirmed_flush LSN in the slot to a backward
> position on the disk. So, moving it backward temporarily (in the
> memory) shouldn't create any problem. I would prefer your
> Assert_confirmed_flush_will_always_not_be_less_than_last_saved_confirmed_flush.patch
> to fix this issue.

I have re-verified  the issue by running the tests in a loop of 150
times and found it to be working fine. Also patch applies neatly,
there was no pgindent issue and all the regression/tap tests run were
successful.

Regards,
Vignesh

Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

2024-06-10 Thread Shlok Kyal

On Mon, 10 Jun 2024 at 15:10, Shlok Kyal  wrote:
>
> On Thu, 6 Jun 2024 at 11:49, Kyotaro Horiguchi  
> wrote:
> >
> > At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith  
> > wrote in
> > > Hi, I have reproduced this multiple times now.
> > >
> > > I confirmed the initial post/steps from Alexander. i.e. The test
> > > script provided [1] gets itself into a state where function
> > > ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
> > > for the next page to become available") constantly returns
> > > XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
> > > forever, since it never calls WalSndDone() to exit the walsender
> > > process.
> >
> > Thanks for the repro; I believe I understand what's happening here.
> >
> > During server shutdown, the latter half of the last continuation
> > record may fail to be flushed. This is similar to what is described in
> > the commit message of commit ff9f111bce. While shutting down,
> > WalSndLoop() waits for XLogSendLogical() to consume WAL up to
> > flushPtr, but in this case, the last record cannot complete without
> > the continuation part starting from flushPtr, which is
> > missing. However, in such cases, xlogreader.missingContrecPtr is set
> > to the beginning of the missing part, but something similar to
> >
> > So, I believe the attached small patch fixes the behavior. I haven't
> > come up with a good test script for this issue. Something like
> > 026_overwrite_contrecord.pl might work, but this situation seems a bit
> > more complex than what it handles.
> >
> > Versions back to 10 should suffer from the same issue and the same
> > patch will be applicable without significant changes.
>
> I tested the changes for PG 12 to master as we do not support prior versions.
> The patch applied successfully for master and PG 16. I ran the test
> provided in [1] multiple times and it ran successfully each time.
> The patch did not apply on PG 15. I did a similar change for PG 15 and
> created a patch.  I ran the test multiple times and it was successful
> every time.
> The patch did not apply on PG 14 to PG 12. I did a similar change in
> each branch. But the tests did not pass in each branch.
I, by mistake, applied wrong changes in PG 14 to PG 12. I tested again
for all versions and the test ran successfully for all of them till
PG12.
I have also attached the patch which applies for PG14 to PG12.

Sorry for the inconvenience.

Thanks and Regards,
Shlok Kyal


v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s.patch
Description: Binary data


v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-15.patch
Description: Binary data


v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-14.patch
Description: Binary data

Re: list_free in addRangeTableEntryForJoin

2024-06-10 Thread Ranier Vilela

Em seg., 10 de jun. de 2024 às 10:45, Ilia Evdokimov <
ilya.evdoki...@tantorlabs.com> escreveu:

>  >Now you need to analyze whether the memory in question is not managed
> by a Context
> I've already analyzed. Let's explain details:
>
>
> 1. analyze.c
> 1718: List* targetnames;
> 1815: targetnames = NIL;
> 1848: targetnames = lappend(targetnames, makeString(colName));
> 1871: addRangeTableEntryForJoin(...);
> => list_free(targetnames);
>
> 2. parse_clause.c
> 1163: List* res_colnames;
> 1289: res_colnames = NIL;
> 1324: foreach(col, res_colnames);
> 1396: res_colnames = lappend(res_colnames, lfirst(ucol));
> 1520, 1525: extractRemainingColumns(...);
> |
>  290: *res_colnames = lappend(*res_colnames, lfirst(lc));
> 1543: addRangeTableEntryForJoin(...);
> => list_free(res_colnames);
>
>
> As you can see, there are no other pointers working with this block of
> memory,

You can see this line?
sortnscolumns = (ParseNamespaceColumn *)
palloc0

All allocations in the backend occur at Context memory managers.
Resource leak can occur mainly with TopTransactionContext.


> and all operations above are either read-only or append nodes to
> the lists. If I am mistaken, please correct me.
> Furthermore, I will call `list_free` immediately after
> `addRangeTableEntryForJoin()`. The new patch is attached.
>
This style is not recommended.
You prevent the use of colnames after calling addRangeTableEntryForJoin.

Better free at the end of the function, like 0002.

Tip 0001, 0002, 0003 numerations are to different patchs.
v1, v2, v3 are new versions of the same patch.

best regards,
Ranier Vilela

Re: Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state

2024-06-10 Thread cca5507

Thank you for reply!I am trying to fix it. This patch (pass check-world) will 
track txns
committed in BUILDING_SNAPSHOT state and can fix this bug.


--
Regards,
ChangAo Chen

v1-0001-Track-transactions-committed-in-BUILDING_SNAPSHOT.patch
Description: Binary data

Re: Non-text mode for pg_dumpall

On Mon, Jun 10, 2024 at 08:58:49AM -0400, Andrew Dunstan wrote:
> Tom and Nathan opined recently that providing for non-text mode for
> pg_dumpall would be a Good Thing (TM). Not having it has been a
> long-standing complaint, so I've decided to give it a go.

Thank you!

> I think we would need to restrict it to directory mode, at least to begin
> with. I would have a toc.dat with a different magic block (say "PGGLO"
> instead of "PGDMP") containing the global entries (roles, tablespaces,
> databases). Then for each database there would be a subdirectory (named for
> its toc entry) with a standard directory mode dump for that database. These
> could be generated in parallel (possibly by pg_dumpall calling pg_dump for
> each database). pg_restore on detecting a global type toc.data would restore
> the globals and then each of the databases (again possibly in parallel).

I'm curious why we couldn't also support the "custom" format.

> Following that I would turn my attention to using it in pg_upgrade.

+1

-- 
nathan

Re: Proposal to include --exclude-extension Flag in pg_dump

2024-06-10 Thread Dean Rasheed

On Sat, 8 Jun 2024 at 19:39, Ayush Vatsa  wrote:
>
> > Attached is a patch for the --filter docs, covering the omissions I can see.
> Thanks Dean for working on this.
> I have reviewed the changes and they look good to me.
>

Thanks for checking. I have committed this now.

Regards,
Dean

Re: Non-text mode for pg_dumpall




On 2024-06-10 Mo 10:14, Nathan Bossart wrote:

On Mon, Jun 10, 2024 at 08:58:49AM -0400, Andrew Dunstan wrote:

Tom and Nathan opined recently that providing for non-text mode for
pg_dumpall would be a Good Thing (TM). Not having it has been a
long-standing complaint, so I've decided to give it a go.

Thank you!


I think we would need to restrict it to directory mode, at least to begin
with. I would have a toc.dat with a different magic block (say "PGGLO"
instead of "PGDMP") containing the global entries (roles, tablespaces,
databases). Then for each database there would be a subdirectory (named for
its toc entry) with a standard directory mode dump for that database. These
could be generated in parallel (possibly by pg_dumpall calling pg_dump for
each database). pg_restore on detecting a global type toc.data would restore
the globals and then each of the databases (again possibly in parallel).

I'm curious why we couldn't also support the "custom" format.



We could, but the housekeeping would be a bit harder. We'd need to keep 
pointers to the offsets of the per-database TOCs (I don't want to have a 
single per-cluster TOC). And we can't produce it in parallel, so I'd 
rather start with something we can produce in parallel.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Non-text mode for pg_dumpall

2024-06-10 Thread Magnus Hagander

On Mon, Jun 10, 2024 at 4:14 PM Nathan Bossart 
wrote:

> On Mon, Jun 10, 2024 at 08:58:49AM -0400, Andrew Dunstan wrote:
> > Tom and Nathan opined recently that providing for non-text mode for
> > pg_dumpall would be a Good Thing (TM). Not having it has been a
> > long-standing complaint, so I've decided to give it a go.
>
> Thank you!
>

Indeed, this has been quite annoying!


> I think we would need to restrict it to directory mode, at least to begin
> > with. I would have a toc.dat with a different magic block (say "PGGLO"
> > instead of "PGDMP") containing the global entries (roles, tablespaces,
> > databases). Then for each database there would be a subdirectory (named
> for
> > its toc entry) with a standard directory mode dump for that database.
> These
> > could be generated in parallel (possibly by pg_dumpall calling pg_dump
> for
> > each database). pg_restore on detecting a global type toc.data would
> restore
> > the globals and then each of the databases (again possibly in parallel).
>
> I'm curious why we couldn't also support the "custom" format.
>

Or maybe even a combo - a directory of custom format files? Plus that one
special file being globals? I'd say that's what most use cases I've seen
would prefer.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Non-text mode for pg_dumpall

On Mon, Jun 10, 2024 at 10:51:42AM -0400, Andrew Dunstan wrote:
> On 2024-06-10 Mo 10:14, Nathan Bossart wrote:
>> I'm curious why we couldn't also support the "custom" format.
> 
> We could, but the housekeeping would be a bit harder. We'd need to keep
> pointers to the offsets of the per-database TOCs (I don't want to have a
> single per-cluster TOC). And we can't produce it in parallel, so I'd rather
> start with something we can produce in parallel.

Got it.

-- 
nathan

Re: Non-text mode for pg_dumpall

On Mon, Jun 10, 2024 at 04:52:06PM +0200, Magnus Hagander wrote:
> On Mon, Jun 10, 2024 at 4:14 PM Nathan Bossart 
> wrote:
>> I'm curious why we couldn't also support the "custom" format.
> 
> Or maybe even a combo - a directory of custom format files? Plus that one
> special file being globals? I'd say that's what most use cases I've seen
> would prefer.

Is there a particular advantage to that approach as opposed to just using
"directory" mode for everything?  I know pg_upgrade uses "custom" mode for
each of the databases, so a combo approach would be a closer match to the
existing behavior, but that doesn't strike me as an especially strong
reason to keep doing it that way.

-- 
nathan

Re: Track the amount of time waiting due to cost_delay

On Mon, Jun 10, 2024 at 06:05:13AM +, Bertrand Drouvot wrote:
> During the last pgconf.dev I attended Robert´s presentation about autovacuum 
> and
> it made me remember of an idea I had some time ago: $SUBJECT

This sounds like useful information to me.  I wonder if we should also
surface the effective cost limit for each autovacuum worker.

> Currently one can change [autovacuum_]vacuum_cost_delay and
> [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact 
> of
> the changes on the vacuum duration: one could observe the vacuum duration
> variation but the correlation to the changes is not accurate (as many others
> factors could impact the vacuum duration (load on the system, i/o 
> latency,...)).

IIUC you'd need to get information from both pg_stat_progress_vacuum and
pg_stat_activity in order to know what percentage of time was being spent
in cost delay.  Is that how you'd expect for this to be used in practice?

>   pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
>   pg_usleep(msec * 1000);
>   pgstat_report_wait_end();
> + /* Report the amount of time we slept */
> + if (VacuumSharedCostBalance != NULL)
> + 
> pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> + else
> + 
> pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);

Hm.  Should we measure the actual time spent sleeping, or is a rough
estimate good enough?  I believe pg_usleep() might return early (e.g., if
the process is signaled) or late, so this field could end up being
inaccurate, although probably not by much.  If we're okay with millisecond
granularity, my first instinct is that what you've proposed is fine, but I
figured I'd bring it up anyway.

-- 
nathan

Re: Non-text mode for pg_dumpall

2024-06-10 Thread Magnus Hagander

On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart 
wrote:

> On Mon, Jun 10, 2024 at 04:52:06PM +0200, Magnus Hagander wrote:
> > On Mon, Jun 10, 2024 at 4:14 PM Nathan Bossart  >
> > wrote:
> >> I'm curious why we couldn't also support the "custom" format.
> >
> > Or maybe even a combo - a directory of custom format files? Plus that one
> > special file being globals? I'd say that's what most use cases I've seen
> > would prefer.
>
> Is there a particular advantage to that approach as opposed to just using
> "directory" mode for everything?  I know pg_upgrade uses "custom" mode for
> each of the databases, so a combo approach would be a closer match to the
> existing behavior, but that doesn't strike me as an especially strong
> reason to keep doing it that way.
>

A gazillion files to deal with? Much easier to work with individual custom
files if you're moving databases around and things like that.
Much easier to monitor eg sizes/dates if you're using it for backups.

It's not things that are make-it-or-break-it or anything, but there are
some smaller things that definitely can be useful.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: list_free in addRangeTableEntryForJoin

Ilia Evdokimov  writes:
> I have identified a potential memory leak in the 
> `addRangeTableEntryForJoin()` function. The second parameter of 
> `addRangeTableEntryForJoin()`, `colnames`, is a `List*` that is 
> concatenated with another `List*`, `eref->colnames`, using 
> `list_concat()`. We need to pass only the last `numaliases` elements of 
> the list, for which we use `list_copy_tail`. This function creates a 
> copy of the `colnames` list, resulting in `colnames` pointing to the 
> current list that will not be freed. Consequently, a new list is already 
> concatenated.

> To address this issue, I have invoked `list_free(colnames)` afterwards. 

Sadly, I think this is basically a waste of development effort.
The parser, like the planner and rewriter and many other Postgres
subsystems, leaks tons of small allocations like this.  That's
*by design*, and *it's okay*, because we run these steps in short-
lived memory contexts that will be reclaimed in toto as soon as
the useful output data structures are no longer needed.  It's not
worth the sort of intellectual effort you've put in in this thread
to prove whether individual small structures are no longer needed.
Plus, in many cases that isn't obvious, and/or it'd be notationally
messy to reclaim things explicitly at a suitable point.

If we were talking about a potentially-very-large data structure,
or one that we might create very many instances of during one
parsing pass, then it might be worth the trouble to free explicitly;
but I don't see that concern applying here.

You might find src/backend/utils/mmgr/README to be worth reading.

regards, tom lane

Re: Non-text mode for pg_dumpall

On Mon, Jun 10, 2024 at 05:45:19PM +0200, Magnus Hagander wrote:
> On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart 
> wrote:
>> Is there a particular advantage to that approach as opposed to just using
>> "directory" mode for everything?  I know pg_upgrade uses "custom" mode for
>> each of the databases, so a combo approach would be a closer match to the
>> existing behavior, but that doesn't strike me as an especially strong
>> reason to keep doing it that way.
> 
> A gazillion files to deal with? Much easier to work with individual custom
> files if you're moving databases around and things like that.
> Much easier to monitor eg sizes/dates if you're using it for backups.
> 
> It's not things that are make-it-or-break-it or anything, but there are
> some smaller things that definitely can be useful.

Makes sense, thanks for elaborating.

-- 
nathan

Re: Non-text mode for pg_dumpall

Magnus Hagander  writes:
> On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart 
> wrote:
>> Is there a particular advantage to that approach as opposed to just using
>> "directory" mode for everything?

> A gazillion files to deal with? Much easier to work with individual custom
> files if you're moving databases around and things like that.
> Much easier to monitor eg sizes/dates if you're using it for backups.

You can always tar up the directory tree after-the-fact if you want
one file.  Sure, that step's not parallelized, but I think we'd need
some non-parallelized copying to create such a file anyway.

regards, tom lane

Re: Wrong security context for deferred triggers?

2024-06-10 Thread Laurenz Albe

On Sat, 2024-06-08 at 17:36 -0400, Joseph Koshakow wrote:
> I see that this patch is marked as ready for review, so I thought I
> would attempt to review it. This is my first review, so please take it
> with a grain of salt.

Thank you.  Your review is valuable and much to the point.

> 
> It sounds like the crux of your argument is that the current behavior
> is that triggers are executed with the role and security context of the
> session at the time of execution. Instead, the trigger should be
> executed with the role and security context of the session at the time
> time of queuing (i.e. the same context as the action that triggered the
> trigger). While I understand that the current behavior can be
> surprising in some scenarios, it's not clear to me why this behavior is
> wrong.

Since neither the documentation nor any source comment cover this case,
it is to some extent a matter of taste if the current behavior is
correct ot not.  An alternative to my patch would be to document the
current behavior rather than change it.

Like you, I was surprised by the current behavior.  There is a design
principle that PostgreSQL tries to follow, called the "Principle of
least astonishment".  Things should behave like a moderately skilled
user would expect them to.  In my opinion, the current behavior violates
that principle.  Tomas seems to agree with that point of view.

On the other hand, changing current behavior carries the risk of
backward compatibility problems.  I don't know how much of a problem
that would be, but I have the impression that few people use deferred
triggers in combination with SET ROLE or SECURITY DEFINER functions,
which makes the problem rather exotic, so hopefully the pain caused by
the compatibility break will be moderate.

I didn't find this strange behavior myself: it was one of our customers
who uses security definer functions for data modifications and has
problems with the current behavior, and I am trying to improve the
situation on their behalf.

> It seems that the whole point of deferring a trigger to commit
> time is that the context that the trigger is executed in is different
> than the context that it was triggered in. Tables may have changed,
> permissions may have changed, session configuration variables may have
> changed, roles may have changed, etc. So why should the executing role
> be treated differently and restored to the value at the time of
> triggering. Perhaps you can expand on why you feel that the current
> behavior is wrong?

True, somebody could change permissions or parameters between the
DML statement and COMMIT, but that feels like external influences to me.
Those changes would be explicit.

But I feel that the database user that runs the trigger should be the
same user that ran the triggering SQL statement.  Even though I cannot
put my hand on a case where changing this user would constitute a real
security problem, it feels wrong.

I am aware that that is rather squishy argumentation, but I have no
better one.  Both my and Thomas' gut reaction seems to have been "the
current behavior is wrong".

> 
> I find these examples to be surprising, but not necessarily wrong
> (as per my comment above). If someone wants the triggers to be executed
> as the triggering role, then they can run `SET CONSTRAINTS ALL
> IMMEDIATE`. If deferring a trigger to commit time and executing it as
> the triggering role is desirable, then maybe we should add a modifier
> to triggers that can control this behavior. Something like
> `SECURITY INVOKER | SECURITY TRIGGERER` (modeled after the modifiers in
> `CREATE FUNCTION`) that control which role is used.

SECURITY INVOKER and SECURITY TRIGGERER seem too confusing.  I would say
that the triggerer is the one who invokes the trigger...

It would have to be a switch like EXECUTE DEFERRED TRIGGER AS INVOKER|COMMITTER
or so, but I think that special SQL syntax for this exotic corner case
is a little too much.  And then: is there anybody who feels that the current
behavior is desirable?

> Isaac Morland wrote:
> > This looks to me like another reason that triggers should run as the
> > trigger owner. Which role owns the trigger won’t change as a result of
> > constraints being deferred or not, or any role setting done during the
> > transaction, including that relating to security definer functions.
> 
> +1, this approach would remove all of the surprising/wrong behavior and
> in my opinion is more obvious. I'd like to add some more reasons why
> this behavior makes sense: [...]
> 
> However, I do worry that this is too much of a breaking change and
> fundamentally changes how triggers work.

Yes.  This might be the right thing to do if we designed triggers as a
new feature, but changing the behavior like that would certainly break
a lot of cases.

People who want that behavior use a SECURITY DEFINER trigger function.

> 
> I skimmed the code and haven't looked at in depth. Whichever direction
> we go, I think it's worth updating the

Re: Non-text mode for pg_dumpall




On 2024-06-10 Mo 12:21, Tom Lane wrote:

Magnus Hagander  writes:

On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart 
wrote:

Is there a particular advantage to that approach as opposed to just using
"directory" mode for everything?

A gazillion files to deal with? Much easier to work with individual custom
files if you're moving databases around and things like that.
Much easier to monitor eg sizes/dates if you're using it for backups.

You can always tar up the directory tree after-the-fact if you want
one file.  Sure, that step's not parallelized, but I think we'd need
some non-parallelized copying to create such a file anyway.





Yeah.

I think I can probably allow for Magnus' suggestion fairly easily, but 
if I have to choose I'm going to go for the format that can be produced 
with the maximum parallelism.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

Hi,

On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 06:05:13AM +, Bertrand Drouvot wrote:
> > During the last pgconf.dev I attended Robert´s presentation about 
> > autovacuum and
> > it made me remember of an idea I had some time ago: $SUBJECT
> 
> This sounds like useful information to me.

Thanks for looking at it!

> I wonder if we should also
> surface the effective cost limit for each autovacuum worker.

I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.

> > Currently one can change [autovacuum_]vacuum_cost_delay and
> > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the 
> > impact of
> > the changes on the vacuum duration: one could observe the vacuum duration
> > variation but the correlation to the changes is not accurate (as many others
> > factors could impact the vacuum duration (load on the system, i/o 
> > latency,...)).
> 
> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> pg_stat_activity in order to know what percentage of time was being spent
> in cost delay.  Is that how you'd expect for this to be used in practice?

Yeah, one could use a query such as:

select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p 
JOIN pg_stat_activity a using (pid)

for example. Worth to provide an example somewhere in the doc?

> > pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
> > pg_usleep(msec * 1000);
> > pgstat_report_wait_end();
> > +   /* Report the amount of time we slept */
> > +   if (VacuumSharedCostBalance != NULL)
> > +   
> > pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> > +   else
> > +   
> > pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> 
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and 
specially
during a long running vacuum (which is probably the case where users care the 
most).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

libpq contention due to gss even when not using gss

Hi,

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C.  Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 
host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1:   16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp:49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

And it doesn't seem great that something as rarely used as gss introduces
overhead to very common paths.

Here's a bottom-up profile:

-   32.10%  pgbench  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
   - 32.09% queued_spin_lock_slowpath
  - 16.15% futex_wake
   do_futex
   __x64_sys_futex
   do_syscall_64
 - entry_SYSCALL_64_after_hwframe
- 16.15% __GI___lll_lock_wake
   - __GI___pthread_mutex_unlock_usercnt
  - 5.12% gssint_select_mech_type
 - 4.36% gss_inquire_attrs_for_mech
- 2.85% gss_indicate_mechs
   - gss_indicate_mechs_by_attrs
  - 1.58% gss_acquire_cred_from
   gss_acquire_cred
   pg_GSS_have_cred_cache
   select_next_encryption_method (inlined)
   init_allowed_encryption_methods (inlined)
   PQconnectPoll
   pqConnectDBStart (inlined)
   PQconnectStartParams
   PQconnectdbParams
   doConnect


And a bottom-up profile:

-   32.10%  pgbench  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
   - 32.09% queued_spin_lock_slowpath
  - 16.15% futex_wake
   do_futex
   __x64_sys_futex
   do_syscall_64
 - entry_SYSCALL_64_after_hwframe
- 16.15% __GI___lll_lock_wake
   - __GI___pthread_mutex_unlock_usercnt
  - 5.12% gssint_select_mech_type
 - 4.36% gss_inquire_attrs_for_mech
- 2.85% gss_indicate_mechs
   - gss_indicate_mechs_by_attrs
  - 1.58% gss_acquire_cred_from
   gss_acquire_cred
   pg_GSS_have_cred_cache
   select_next_encryption_method (inlined)
   init_allowed_encryption_methods (inlined)
   PQconnectPoll
   pqConnectDBStart (inlined)
   PQconnectStartParams
   PQconnectdbParams
   doConnect



Clearly the contention originates outside of our code, but is triggered by
doing pg_GSS_have_cred_cache() every time a connection is established.

Greetings,

Andres Freund

Re: Remove dependence on integer wrapping

Hi,

On 2024-06-09 21:57:54 -0400, Tom Lane wrote:
> BTW, while I approve of trying to get rid of our need for -fwrapv,
> I'm quite scared of actually doing it.

I think that's a quite fair concern. One potentially relevant datapoint is
that we actually don't have -fwrapv equivalent on all platforms, and I don't
recall a lot of complaints from windows users. Of course it's quite possible
that they'd never notice...

I think this is a good argument for enabling -ftrapv in development
builds. That gives us at least a *chance* of seeing these issues.

> Whatever cases you may have discovered by running the regression tests will
> be at best the tip of the iceberg.  Is there any chance of using static
> analysis to find all the places of concern?

The last time I tried removing -fwrapv both gcc and clang found quite a few
issues. I think I fixed most of those though, so presumably the issue that
remain are ones less easily found by simple static analysis.

I wonder if coverity would find more if we built without -fwrapv.

GCC has -Wstrict-overflow=n, which at least tells us where the compiler
optimizes based on the assumption that there cannot be overflow.  It does
generate a bunch of noise, but I think it's almost exclusively due to our
MemSet() macro.

Greetings,

Andres Freund

RFC: adding pytest as a supported test framework

2024-06-10 Thread Jacob Champion

Hi all,

For the v18 cycle, I would like to try to get pytest [1] in as a
supported test driver, in addition to the current offerings.

(I'm tempted to end the email there.)

We had an unconference session at PGConf.dev [2] around this topic.
There seemed to be a number of nodding heads and some growing
momentum. So here's a thread to try to build wider consensus. If you
have a competing or complementary test proposal in mind, heads up!

== Problem Statement(s) ==

1. We'd like to rerun a failing test by itself.

2. It'd be helpful to see _what_ failed without poring over logs.

These two got the most nodding heads of the points I presented. (#1
received tongue-in-cheek applause.) I think most modern test
frameworks are going to give you these things, but unfortunately we
don't have them.

Additionally,

3. Many would like to use modern developer tooling during testing
(language servers! autocomplete! debuggers! type checking!) and we
can't right now.

4. It'd be great to split apart client-side tests from server-side
tests. Driving Postgres via psql all the time is fine for acceptance
testing, but it becomes a big problem when you need to test how
clients talk to servers with incompatible feature sets, or how a peer
behaves when talking to something buggy.

5. Personally, I want to implement security features test-first (for
high code coverage and other benefits), and our Perl-based tests are
usually too cumbersome for that.

== Why pytest? ==

>From the small and biased sample at the unconference session, it looks
like a number of people have independently settled on pytest in their
own projects. In my opinion, pytest occupies a nice space where it
solves some of the above problems for us, and it gives us plenty of
tools to solve the other problems without too much pain.

Problem 1 (rerun failing tests): One architectural roadblock to this
in our Test::More suite is that tests depend on setup that's done by
previous tests. pytest allows you to declare each test's setup
requirements via pytest fixtures, letting the test runner build up the
world exactly as it needs to be for a single isolated test. These
fixtures may be given a "scope" so that multiple tests may share the
same setup for performance or other reasons.

Problem 2 (seeing what failed): pytest does this via assertion
introspection and very detailed failure reporting. If you haven't seen
this before, take a look at the pytest homepage [1]; there's an
example of a full log.

Problem 3 (modern tooling): We get this from Python's very active
developer base.

Problems 4 (splitting client and server tests) and 5 (making it easier
to write tests first) aren't really Python- or pytest-specific, but I
have done both quite happily in my OAuth work [3], and I've since
adapted that suite multiple times to develop and test other proposals
on this list, like LDAP/SCRAM, client encryption, direct SSL, and
compression.

Python's standard library has lots of power by itself, with very good
documentation. And virtualenvs and better package tooling have made it
much easier, IMO, to avoid the XKCD dependency tangle [4] of the
2010s. When it comes to third-party packages, which I think we're
probably going to want in moderation, we would still need to discuss
supply chain safety. Python is not as mature here as, say, Go.

== A Plan ==

Even if everyone were on board immediately, there's a lot of work to
do. I'd like to add pytest in a more probationary status, so we can
iron out the inevitable wrinkles. My proposal would be:

1. Commit bare-bones support in our Meson setup for running pytest, so
everyone can kick the tires independently.
2. Add a test for something that we can't currently exercise.
3. Port a test from a place where the maintenance is terrible, to see
if we can improve it.

If we hate it by that point, no harm done; tear it back out. Otherwise
we keep rolling forward.

Thoughts? Suggestions?

Thanks,
--Jacob

[1] https://docs.pytest.org/
[2]
https://wiki.postgresql.org/wiki/PGConf.dev_2024_Developer_Unconference#New_testing_frameworks
[3] https://github.com/jchampio/pg-pytest-suite
[4] https://xkcd.com/1987/

Re: Proposal: Document ABI Compatibility

2024-06-10 Thread David E. Wheeler

On Jun 4, 2024, at 03:18, Peter Eisentraut  wrote:

> This could possibly be avoided by renaming the symbol in backbranches. Maybe 
> something like
> 
> #define InitResultRelInfo InitResultRelInfo2
> 
> Then you'd get a specific error message when loading the module, rather than 
> a crash.

That sounds more useful, yes. Is that a practice the project would consider 
adopting?

There’s also oracle_fdw@d137d15[1], which says:

> An API break in PostgreSQL 10.4 and 9.6.9 makes it impossible
> to use these versions: the "extract_actual_join_clauses" function
> gained an additional parameter.

The 10.4 commit is 68fab04, and it does indeed add a new function:

``` patch
--- a/src/include/optimizer/restrictinfo.h
+++ b/src/include/optimizer/restrictinfo.h
@@ -36,6 +36,7 @@ extern List *get_actual_clauses(List *restrictinfo_list);
 extern List *extract_actual_clauses(List *restrictinfo_list,
   bool pseudoconstant);
 extern void extract_actual_join_clauses(List *restrictinfo_list,
+   Relids joinrelids,
List **joinquals,
List **otherquals);
 extern bool join_clause_is_movable_to(RestrictInfo *rinfo, RelOptInfo 
*baserel);
```

I wonder if that sort of change could be avoided in backpatches, maybe by 
adding and using a `extract_actual_join_clauses_compat` function and using that 
internally instead?

Or, to David C’s point, perhaps it would be better to say there are some 
categories of APIs that are not subject to any guarantees in minor releases?

Best,

David

[1]: 
https://github.com/laurenz/oracle_fdw/commit/d137d15edca8c67df1e5a01f417f4833b028
[2]: 
https://github.com/postgres/postgres/commit/68fab04f7c2a07c5308e3d2957198ccd7a80ebc5#diff-bb6fa74cb115e19684092f0938131cd5d99b26fa2d49480f7ea7f28e937a7fb4

Re: Track the amount of time waiting due to cost_delay

On Mon, Jun 10, 2024 at 05:48:22PM +, Bertrand Drouvot wrote:
> On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
>> I wonder if we should also
>> surface the effective cost limit for each autovacuum worker.
> 
> I'm not sure about it as I think that it could be misleading: one could query
> pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> due to _this_ cost_limit. But that's not necessary true as the cost_limit 
> could
> have changed multiple times since the vacuum started. So, unless there is
> frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
> the cost_limit could be misleadind IMHO.

Well, that's true for the delay, too, right (at least as of commit
7d71d3d)?

-- 
nathan

Re: RFC: adding pytest as a supported test framework

2024-06-10 Thread Alexander Korotkov

Hi!

On Mon, Jun 10, 2024 at 9:46 PM Jacob Champion
 wrote:
> Thoughts? Suggestions?

Thank you for working on this.
Do you think you could re-use something from testgres[1] package?

Links.
1. https://github.com/postgrespro/testgres

--
Regards,
Alexander Korotkov
Supabase

Re: Things I don't like about \du's "Attributes" column

2024-06-10 Thread Pavel Luzanov


On 10.06.2024 09:25, Kyotaro Horiguchi wrote:


I guess that in English, when written as "'Login' = 'yes/no'", it can
be easily understood. However, in Japanese, "'ログイン' = 'はい/いいえ'"
looks somewhat awkward and is a bit difficult to understand at a
glance. "'ログイン' = '可/不可'" (equivalent to "Login is
'can/cannot'") sounds more natural in Japanese, but it was rejected
upthread, and I also don't like 'can/cannot'. To give further
candidates, "allowed/not allowed" or "granted/denied" can be
mentioned, and they would be easier to translate, at least to
Japanese. However, is there a higher likelihood that 'granted/denied'
will be misunderstood as referring to database permissions?


Thank you for looking into this, translationis important.

What do you think about the following options?

1. Try to find a more appropriate name for the column.
Maybe "Can login?" is better suited for yes/no and Japanese translation?

2. Show the value only for true, for example "Granted" as you suggested.
Do not show the "false" value at all. This will be consistent
with the "Attributes" column, which shows only enabled values.

I would prefer the first option and look for the best name for the column.
The second option can also be implemented if we сhoose a value for 'true'.

BTW, I went through all the \d* commands and looked at how columns with
logical values are displayed. There are two approaches: yes/no and t/f.

yes/no
\dAc "Default?"
\dc "Default?"
\dC "Implicit?"
\dO "Deterministic?"

t/f
\dL "Trusted", "Internal language"
\dRp "All tables", "Inserts" "Updates" "Deletes" "Truncates" "Via root"
\dRs "Enabled", "Binary", "Disable on error", "Password required", "Run as owner?", 
"Failover"


Likewise, "'Valid until' = 'infinity'" (equivalent to "'有効期限' = '
無限'") also sounds awkward. Maybe that's the same in English. I guess
that 'unbounded' or 'indefinite' sounds better, and their Japanese
translation '無期限' also sounds natural. However, I'm not sure we
want to go to that extent in transforming the table.


'infinity' is the value in the table as any other dates.
As far as I understand, it is not translatable.
So you'll see '有効期限' = 'infinity'.


Butthis can be implemented usingthe followingexpression: case when 
rolvaliduntil = 'infinity' or rolvaliduntil is null then
 'unbounded' -- translatable value
   else
 rolvaliduntil::pg_catalog.text
   end

Or we can hide 'infinity':

   case when rolvaliduntil = 'infinity' then
 null
   else
 rolvaliduntil
   end

This is a little bit better, but I don't like both. Wewill notbe ableto 
distinguishbetween nullandinfinity values inthe table. After all, I 
think 'infinity' is a rare case for "Valid until". Whatis the reasonto 
set'Validuntil'='infinity'ifthe passwordisunlimitedbydefault? Therefore, 
my opinion here is to leave "infinity" as is, but I am open to better 
alternatives.


--
Pavel Luzanov
Postgres Professional:https://postgrespro.com

ODBC Source Downloads Missing

2024-06-10 Thread Mark Hill

Is there an issue with the ODBC Source downloads today?

The source download URL isn't working:  
https://www.postgresql.org/ftp/odbc/versions/src/

Thanks, Mark

Re: Non-text mode for pg_dumpall

2024-06-10 Thread Magnus Hagander

On Mon, Jun 10, 2024 at 6:21 PM Tom Lane  wrote:

> Magnus Hagander  writes:
> > On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart  >
> > wrote:
> >> Is there a particular advantage to that approach as opposed to just
> using
> >> "directory" mode for everything?
>
> > A gazillion files to deal with? Much easier to work with individual
> custom
> > files if you're moving databases around and things like that.
> > Much easier to monitor eg sizes/dates if you're using it for backups.
>
> You can always tar up the directory tree after-the-fact if you want
> one file.  Sure, that step's not parallelized, but I think we'd need
> some non-parallelized copying to create such a file anyway.
>

That would require double the disk space.

But you can also just run pg_dump manually on each database and a
pg_dumpall -g like people are doing today -- I thought this whole thing was
about making it more convenient :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Proposal: Document ABI Compatibility

Hi,

On 2024-06-10 15:05:32 -0400, David E. Wheeler wrote:
> > An API break in PostgreSQL 10.4 and 9.6.9 makes it impossible
> > to use these versions: the "extract_actual_join_clauses" function
> > gained an additional parameter.
> 
> The 10.4 commit is 68fab04, and it does indeed add a new function:

That's 6 years ago, not sure we can really learn that much from that.

And it's not like it's actually impossible, #ifdefs aren't great, but they are
better than nothing.

> ``` patch
> --- a/src/include/optimizer/restrictinfo.h
> +++ b/src/include/optimizer/restrictinfo.h
> @@ -36,6 +36,7 @@ extern List *get_actual_clauses(List *restrictinfo_list);
>  extern List *extract_actual_clauses(List *restrictinfo_list,
>bool pseudoconstant);
>  extern void extract_actual_join_clauses(List *restrictinfo_list,
> +   Relids joinrelids,
> List **joinquals,
> List **otherquals);
>  extern bool join_clause_is_movable_to(RestrictInfo *rinfo, RelOptInfo 
> *baserel);
> ```
> 
> I wonder if that sort of change could be avoided in backpatches, maybe by 
> adding and using a `extract_actual_join_clauses_compat` function and using 
> that internally instead?
> 
> Or, to David C’s point, perhaps it would be better to say there are some 
> categories of APIs that are not subject to any guarantees in minor releases?

I'm honestly very dubious that this is a good point to introduce a bunch of
formalism. It's a already a lot of work to maintain them, if we make it even
harder we'll end up more fixes not being backported, because it's not worth
the pain.

To be blunt, the number of examples raised here doesn't seem to indicate that
this is an area where we need to invest additional resources. We are already
severely constrained as a project by committer bandwidth, there are plenty
other things that seem more important to focus on.

Greetings,

Andres Freund

Re: RFC: adding pytest as a supported test framework

Hi,

Just for context for the rest the email: I think we desperately need to move
off perl for tests. The infrastructure around our testing is basically
unmaintained and just about nobody that started doing dev stuff in the last 10
years learned perl.

On 2024-06-10 11:46:00 -0700, Jacob Champion wrote:
> 4. It'd be great to split apart client-side tests from server-side
> tests. Driving Postgres via psql all the time is fine for acceptance
> testing, but it becomes a big problem when you need to test how
> clients talk to servers with incompatible feature sets, or how a peer
> behaves when talking to something buggy.

That seems orthogonal to using pytest vs something else?

> == Why pytest? ==
> 
> From the small and biased sample at the unconference session, it looks
> like a number of people have independently settled on pytest in their
> own projects. In my opinion, pytest occupies a nice space where it
> solves some of the above problems for us, and it gives us plenty of
> tools to solve the other problems without too much pain.

We might be able to alleviate that by simply abstracting it away, but I found
pytest's testrunner pretty painful. Oodles of options that are not very well
documented and that often don't work because they are very specific to some
situations, without that being explained.

> Problem 1 (rerun failing tests): One architectural roadblock to this
> in our Test::More suite is that tests depend on setup that's done by
> previous tests. pytest allows you to declare each test's setup
> requirements via pytest fixtures, letting the test runner build up the
> world exactly as it needs to be for a single isolated test. These
> fixtures may be given a "scope" so that multiple tests may share the
> same setup for performance or other reasons.

OTOH, that's quite likely to increase overall test times very
significantly. Yes, sometimes that can be avoided with careful use of various
features, but often that's hard, and IME is rarely done rigiorously.

> Problem 2 (seeing what failed): pytest does this via assertion
> introspection and very detailed failure reporting. If you haven't seen
> this before, take a look at the pytest homepage [1]; there's an
> example of a full log.

That's not really different than what the perl tap test stuff allows. We
indeed are bad at utilizing it, but I'm not sure that switching languages will
change that.

I think part of the problem is that the information about what precisely
failed is often much harder to collect when testing multiple servers
interacting than when doing localized unit tests.

I think we ought to invest a bunch in improving that, I'd hope that a lot of
that work would be largely independent of the language the tests are written
in.

> Python's standard library has lots of power by itself, with very good
> documentation. And virtualenvs and better package tooling have made it
> much easier, IMO, to avoid the XKCD dependency tangle [4] of the
> 2010s.

Ugh, I think this is actually python's weakest area. There's about a dozen
package managers and "python distributions", that are at best half compatible,
and the documentation situation around this is *awful*.

> When it comes to third-party packages, which I think we're
> probably going to want in moderation, we would still need to discuss
> supply chain safety. Python is not as mature here as, say, Go.

What external dependencies are you imagining?

> == A Plan ==
> 
> Even if everyone were on board immediately, there's a lot of work to
> do. I'd like to add pytest in a more probationary status, so we can
> iron out the inevitable wrinkles. My proposal would be:
> 
> 1. Commit bare-bones support in our Meson setup for running pytest, so
> everyone can kick the tires independently.
> 2. Add a test for something that we can't currently exercise.
> 3. Port a test from a place where the maintenance is terrible, to see
> if we can improve it.
> 
> If we hate it by that point, no harm done; tear it back out. Otherwise
> we keep rolling forward.

I think somewhere between 1 and 4 a *substantial* amount of work would be
required to provide a bunch of the infrastructure that Cluster.pm etc
provide. Otherwise we'll end up with a lot of copy pasted code between tests.

Greetings,

Andres Freund

Re: Track the amount of time waiting due to cost_delay

2024-06-10 Thread Imseih (AWS), Sami

>> This sounds like useful information to me.

> Thanks for looking at it!

The  VacuumDelay is the only visibility available to
gauge the cost_delay. Having this information
advertised by pg_stat_progress_vacuum as is being proposed
is much better. However, I also think that the
"number of times"  the vacuum went into delay will be needed
as well. Both values will be useful to tune cost_delay and cost_limit. 

It may also make sense to accumulate the total_time in delay
and the number of times delayed in a cumulative statistics [0]
view to allow a user to trend this information overtime.
I don't think this info fits in any of the existing views, i.e.
pg_stat_database, so maybe a new view for cumulative
vacuum stats may be needed. This is likely a separate
discussion, but calling it out here.

>> IIUC you'd need to get information from both pg_stat_progress_vacuum and
>> pg_stat_activity in order to know what percentage of time was being spent
>> in cost delay.  Is that how you'd expect for this to be used in practice?

> Yeah, one could use a query such as:

> select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p 
> JOIN pg_stat_activity a using (pid)

Maybe all  progress views should just expose the 
"beentry->st_activity_start_timestamp " 
to let the user know when the current operation began.


Regards,

Sami Imseih
Amazon Web Services (AWS)


[0] https://www.postgresql.org/docs/current/monitoring-stats.html

Re: RFC: adding pytest as a supported test framework



On 2024-06-10 Mo 16:04, Andres Freund wrote:

Hi,


Just for context for the rest the email: I think we desperately need to move
off perl for tests. The infrastructure around our testing is basically
unmaintained and just about nobody that started doing dev stuff in the last 10
years learned perl.



Andres,

I get that you don't like perl. But it's hard for me to take this 
terribly seriously. "desperately" seems like massive overstatement at 
best. As for what up and coming developers learn, they mostly don't 
learn C either, and that's far more critical to what we do.


I'm not sure what part of the testing infrastructure you think is 
unmaintained. For example, the last release of Test::Simple was all the 
way back on April 25.


Maybe there are some technical superiorities about what Jacob is 
proposing, enough for us to add it to our armory. I'll keep an open mind 
on that.


But let's not throw the baby out with the bathwater. Quite apart from 
anything else, a wholesale rework of the test infrastructure would make 
backpatching more painful.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Improve the granularity of PQsocketPoll's timeout parameter?

In [1] Dominique Devienne complained that PQsocketPoll would be
far more useful to him if it had better-than-one-second timeout
resolution.  I initially pushed back on that on the grounds that
post-beta1 is a bit late to be redefining public APIs.  Which it is,
but if we don't fix it now then we'll be stuck supporting that API
indefinitely.  And it's not like one-second resolution is great
for our internal usage either --- for example, I see that psql
is now doing

end_time = time(NULL) + 1;
rc = PQsocketPoll(sock, forRead, !forRead, end_time);

which claims to be waiting one second, but actually it's waiting
somewhere between 0 and 1 second.  So I thought I'd look into
whether we can still change it without too much pain, and I think
we can.

The $64 question is how to represent the end_time if not as time_t.
The only alternative POSIX offers AFAIK is gettimeofday's "struct
timeval", which is painful to compute with and I don't think it's
native on Windows.  What I suggest is that we use int64 microseconds
since the epoch, which is the same idea as the backend's TimestampTz
except I think we'd better use the Unix epoch not 2000-01-01.
Then converting code is just a matter of changing variable types
and adding some zeroes to constants.

The next question is how to spell "int64" in libpq-fe.h.  As a
client-exposed header, the portability constraints on it are pretty
stringent, so even in 2024 I'm loath to make it depend on ;
and certainly depending on our internal int64 typedef won't do.
What I did in the attached is to write "long long int", which is
required to be at least 64 bits by C99.  Other opinions are possible
of course.

Lastly, we need a way to get current time in this form.  My first
draft of the attached patch had the callers calling gettimeofday
and doing arithmetic from that, but it seems a lot better to provide
a function that just parallels time(2).

BTW, I think this removes the need for libpq-fe.h to #include ,
but I didn't remove that because it seems likely that some callers are
indirectly relying on it to be present.  Removing it wouldn't gain
very much anyway.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/CAFCRh-8hf%3D7V8UoF63aLxSkeFmXX8-1O5tRxHL61Pngb7V9rcw%40mail.gmail.com

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 80179f9978..8c0ea444ad 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -268,30 +268,50 @@ PGconn *PQsetdb(char *pghost,
   
nonblocking connection
Poll a connection's underlying socket descriptor retrieved with .
+   The primary use of this function is iterating through the connection
+   sequence described in the documentation of
+   .
 
-int PQsocketPoll(int sock, int forRead, int forWrite, time_t end_time);
+int PQsocketPoll(int sock, int forRead, int forWrite,
+ long long int end_time_us);
 
   
 
   
-   This function sets up polling of a file descriptor. The underlying function is either
+   This function performs polling of a file descriptor, optionally with
+   a timeout. The underlying function is either
poll(2) or select(2), depending on platform
-   support. The primary use of this function is iterating through the connection sequence
-   described in the documentation of . If
-   forRead is specified, the function waits for the socket to be ready
-   for reading. If forWrite is specified, the function waits for the
-   socket to be ready for write. See POLLIN and POLLOUT
+   support.
+   If forRead is nonzero, the
+   function will terminate when the socket is ready for
+   reading. If forWrite is nonzero,
+   the function will terminate when the
+   socket is ready for writing.
+   See POLLIN and POLLOUT
from poll(2), or readfds and
-   writefds from select(2) for more information. If
-   end_time is not -1, it specifies the time at which
-   this function should stop waiting for the condition to be met.
+   writefds from select(2) for more information.
+  
+
+  
+   The timeout is specified by end_time_us, which
+   is the time to stop waiting expressed as a number of microseconds since
+   the Unix epoch (that is, time_t times 1 million).
+   Timeout is infinite if end_time_us
+   is -1.  Timeout is immediate (no blocking) if
+   end_time is 0 (or indeed, any time before now).
+   Timeout values can be calculated conveniently by adding the desired
+   number of microseconds to the result of
+   .
+   Note that the underlying system calls may have less than microsecond
+   precision, so that the actual delay may be imprecise.
   
 
   
The function returns a value greater than 0 if the specified condition
is met, 0 if a timeout occurred, or -1 if an error
occurred. The error can be retrieved by check

Re: RFC: adding pytest as a supported test framework

2024-06-10 Thread Jelte Fennema-Nio

On Mon, 10 Jun 2024 at 20:46, Jacob Champion
 wrote:
> For the v18 cycle, I would like to try to get pytest [1] in as a
> supported test driver, in addition to the current offerings.

Huge +1 from me (but I'm definitely biased here)

> Thoughts? Suggestions?

I think the most important thing is that we make it easy for people to
use this thing, and use it in a "correct" way. I have met very few
people that actually like writing tests, so I think it's very
important to make the barrier to do so as low as possible.

For the PgBouncer repo I created my own pytest based test suite more
~1.5 years ago now. I tried to make it as easy as possible to write
tests there, and it has worked out quite well imho. I don't think it
makes sense to copy all things I did there verbatim, because some of
it is quite specific to testing PgBouncer. But I do think there's
quite a few things that could probably be copied (or at least inspire
what you do). Some examples:

1. helpers to easily run shell commands, most importantly setting
check=True by default[1]
2. helper to get a free tcp port[2]
3. helper to check if the log contains a specific string[3]
4. automatically show PG logs on test failure[4]
5. helpers to easily run sql commands (psycopg interface isn't very
user friendly imho for the common case)[5]
6. startup/teardown cleanup logic[6]

[1]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L83-L131
[2]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L210-L233
[3]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L1125-L1143
[4]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L1075-L1103
[5]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L326-L338
[6]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L546-L642

On Mon, 10 Jun 2024 at 22:04, Andres Freund  wrote:
> > Problem 1 (rerun failing tests): One architectural roadblock to this
> > in our Test::More suite is that tests depend on setup that's done by
> > previous tests. pytest allows you to declare each test's setup
> > requirements via pytest fixtures, letting the test runner build up the
> > world exactly as it needs to be for a single isolated test. These
> > fixtures may be given a "scope" so that multiple tests may share the
> > same setup for performance or other reasons.
>
> OTOH, that's quite likely to increase overall test times very
> significantly. Yes, sometimes that can be avoided with careful use of various
> features, but often that's hard, and IME is rarely done rigiorously.

You definitely want to cache things like initdb and "pg_ctl start".
But that's fairly easy to do with some startup/teardown logic. For
PgBouncer I create a dedicated schema for each test that needs to
create objects and automatically drop that schema at the end of the
test[6] (including any temporary objects outside of schemas like
users/replication slots). You can even choose not to clean up certain
large schemas if they are shared across multiple tests.

[6]: 
https://github.com/pgbouncer/pgbouncer/blob/3f791020fb017c570fcd2db390600a353f1cba0c/test/utils.py#L546-L642

> > Problem 2 (seeing what failed): pytest does this via assertion
> > introspection and very detailed failure reporting. If you haven't seen
> > this before, take a look at the pytest homepage [1]; there's an
> > example of a full log.
>
> That's not really different than what the perl tap test stuff allows. We
> indeed are bad at utilizing it, but I'm not sure that switching languages will
> change that.

It's not about allowing, it's about doing the thing that you want by
default. The following code

assert a == b

will show you the actual values of both a and b when the test fails,
instead of saying something like "false is not true". Ofcourse you can
provide a message here too, like with perl its ok function, but even
when you don't the output is helpful.

> I think part of the problem is that the information about what precisely
> failed is often much harder to collect when testing multiple servers
> interacting than when doing localized unit tests.
>
> I think we ought to invest a bunch in improving that, I'd hope that a lot of
> that work would be largely independent of the language the tests are written
> in.

Well, as you already noted no-one that started doing dev stuff in the
last 10 years knows Perl nor wants to learn it. So a large part of the
community tries to touch the current perl test suite as little as
possible. I personally haven't tried to improve anything about our
perl testing framework, even though I'm normally very much into
improving developer tooling.

> > Python's standard library has lots of power by itself, with very good
> > documentation. And virtualenvs and better packa

Re: RFC: adding pytest as a supported test framework

2024-06-10 Thread Jelte Fennema-Nio

On Mon, 10 Jun 2024 at 22:47, Andrew Dunstan  wrote:
> As for what up and coming developers learn, they mostly don't learn C either, 
> and that's far more critical to what we do.

I think many up and coming devs have at least touched C somewhere
(e.g. in university). And because it's more critical to the project
and also to many other low level projects, they don't mind learning it
so much if they don't know it yet. But I, for example, try to write as
few Perl tests as possible, because getting good at Perl has pretty
much no use to me outside of writing tests for postgres.

(I do personally think that official Rust support in Postgres would
probably be a good thing, but that is a whole other discussion that
I'd like to save for some other day)

> But let's not throw the baby out with the bathwater. Quite apart from 
> anything else, a wholesale rework of the test infrastructure would make 
> backpatching more painful.

Backporting test improvements to decrease backporting pain is
something we don't look badly upon afaict (Citus its test suite breaks
semi-regularly on minor PG version updates due to some slight output
changes introduced by e.g. an updated version of the isolationtester).

Re: Track the amount of time waiting due to cost_delay

2024-06-10 Thread Robert Haas

On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
 wrote:
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

I bet you could also sleep for longer than planned, throwing the
numbers off in the other direction.

I'm always suspicious of this sort of thing. I tend to find nothing
gives me the right answer unless I assume that the actual sleep times
are randomly and systematically different from the intended sleep
times but arbitrarily large amounts. I think we should at least do
some testing: if we measure both the intended sleep time and the
actual sleep time, how close are they? Does it change if the system is
under crushing load (which might elongate sleeps) or if we spam
SIGUSR1 against the vacuum process (which might shorten them)?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Allow logical failover slots to wait on synchronous replication

2024-06-10 Thread John H

Hi hackers,

Building on bf279ddd1c, this patch introduces a GUC
'standby_slot_names_from_syncrep' which allows logical failover slots
to wait for changes to have been synchronously replicated before sending
the decoded changes to logical subscribers.

The existing 'standby_slot_names' isn't great for users who are running
clusters with quorum-based synchronous replicas. For instance, if
the user has  synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
bit tedious to have to reconfigure the standby_slot_names to set it to
the most updated 3 sync replicas whenever different sync replicas start
lagging. In the event that both GUCs are set, 'standby_slot_names' takes
precedence.

I did some very brief pgbench runs to compare the latency. Client instance
was running pgbench and 10 logical clients while the Postgres box hosted
the writer and 5 synchronous replicas.

There's a hit to TPS, which I'm thinking is due to more contention on the
SyncRepLock, and that scales with the number of logical walsenders. I'm
guessing we can avoid this if we introduce another set of
lsn[NUM_SYNC_REP_WAIT_MODE] and have the logical walsenders check
and wait on that instead but I wasn't sure if that's the right approach.

pgbench numbers:

// Empty standby_slot_names_from_syncrep
query mode: simple
number of clients: 8
number of threads: 8
maximum number of tries: 1
duration: 1800 s
number of transactions actually processed: 1720173
number of failed transactions: 0 (0.000%)
latency average = 8.371 ms
initial connection time = 7.963 ms
tps = 955.651025 (without initial connection time)

// standby_slot_names_from_syncrep = 'true'
scaling factor: 200
query mode: simple
number of clients: 8
number of threads: 8
maximum number of tries: 1
duration: 1800 s
number of transactions actually processed: 1630105
number of failed transactions: 0 (0.000%)
latency average = 8.834 ms
initial connection time = 7.670 ms
tps = 905.610230 (without initial connection time)

Thanks,

--
John Hsu - Amazon Web Services


0001-Introduce-a-new-guc-standby_slot_names_from_syncrep.patch
Description: Binary data

Re: CheckMyDatabase some error messages in two lines.

On Sun, Jun 09, 2024 at 10:12:53PM -0400, Tom Lane wrote:
> No doubt.  People have done it both ways in the past, but I think
> currently there's a weak consensus in favor of using one line for
> such messages even when it runs past 80 columns, mainly because
> that makes it easier to grep the source code for a message text.

I recall the same consensus here.  Greppability matters across the
board.

> But: I don't see too much value in changing this particular instance,
> because the line break is in a place where it would not likely cause
> you to miss finding the line.  You might grep for the first part of
> the string or the second part, but probably not for ", which is not".
> If the line break were in the middle of a phrase, there'd be more
> argument for collapsing it out.

Not sure these ones are worth it, either, so I'd let them be.
--
Michael


signature.asc
Description: PGP signature

Re: Improve the granularity of PQsocketPoll's timeout parameter?

2024-06-10 Thread Jeff Davis

On Mon, 2024-06-10 at 17:39 -0400, Tom Lane wrote:
> What I suggest is that we use int64 microseconds
> since the epoch, which is the same idea as the backend's TimestampTz
> except I think we'd better use the Unix epoch not 2000-01-01.
> Then converting code is just a matter of changing variable types
> and adding some zeroes to constants.

...

> Lastly, we need a way to get current time in this form.  My first
> draft of the attached patch had the callers calling gettimeofday
> and doing arithmetic from that, but it seems a lot better to provide
> a function that just parallels time(2).

I briefly skimmed the thread and didn't find the reason why the API
requires an absolute time.

My expectation would be for the last parameter to be a relative timeout
("wait up to X microseconds"). That avoids the annoyance of creating a
new definition of absolute time and exposing a new function to retrieve
it.

Regards,
Jeff Davis

Re: Improve the granularity of PQsocketPoll's timeout parameter?

Jeff Davis  writes:
> I briefly skimmed the thread and didn't find the reason why the API
> requires an absolute time.

Because a common call pattern is to loop around PQsocketPoll calls.
In that scenario you generally want to nail down the timeout time
before starting the loop, not have it silently move forward after
any random event that breaks the current wait (EINTR for example).
pqSocketCheck and pqConnectDBComplete both rely on this.

regards, tom lane

Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

On Thu, Jun 06, 2024 at 03:19:20PM +0900, Kyotaro Horiguchi wrote:
> During server shutdown, the latter half of the last continuation
> record may fail to be flushed. This is similar to what is described in
> the commit message of commit ff9f111bce. While shutting down,
> WalSndLoop() waits for XLogSendLogical() to consume WAL up to
> flushPtr, but in this case, the last record cannot complete without
> the continuation part starting from flushPtr, which is
> missing. However, in such cases, xlogreader.missingContrecPtr is set
> to the beginning of the missing part, but something similar to 

-/* If EndRecPtr is still past our flushPtr, it means we caught up. */
-if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+/*
+ * If EndRecPtr is still past our flushPtr, it means we caught up.  When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing.  If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+(got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))

FWIW, I don't have a better idea than what you are proposing here.  We
just cannot receive more data past the page boundary in a shutdown
sequence in this context, so checking after the missingContrecPtr
is a good compromise to ensure that we don't remain stuck when
shutting down a logical WAL sender.  I'm surprised that we did not
hear about that more often on the lists, or perhaps we did but just
discarded it?

This is going to take some time to check across all the branches down
to v12 that this is stable, because this area of the code tends to
change slightly every year..  Well, that's my job.

> So, I believe the attached small patch fixes the behavior. I haven't
> come up with a good test script for this issue. Something like
> 026_overwrite_contrecord.pl might work, but this situation seems a bit
> more complex than what it handles.

Hmm.  Indeed you will not be able to reuse the same trick with the end
of a segment.  Still you should be able to get a rather stable test by
using the same tricks as 039_end_of_wal.pl to spawn a record across
multiple pages, no?  
--
Michael


signature.asc
Description: PGP signature

Re: Format the code in xact_decode

On Mon, Jun 10, 2024 at 06:03:40PM +0800, cca5507 wrote:
> Thank you for reply!

No objections here, either.  

> I have new a patch in commitfest:Format the code in xact_decode
> (postgresql.org)

Thanks for tracking that.  For reference:
https://commitfest.postgresql.org/48/5028/
--
Michael


signature.asc
Description: PGP signature

Re: RFC: adding pytest as a supported test framework

Hi,

On 2024-06-10 16:46:56 -0400, Andrew Dunstan wrote:
> 
> On 2024-06-10 Mo 16:04, Andres Freund wrote:
> > Hi,
> > 
> > 
> > Just for context for the rest the email: I think we desperately need to move
> > off perl for tests. The infrastructure around our testing is basically
> > unmaintained and just about nobody that started doing dev stuff in the last 
> > 10
> > years learned perl.

> Andres,
> 
> I get that you don't like perl.

I indeed don't particularly like perl - but that's really not the main
issue. I've already learned [some of] it. What is the main issue is that I've
also watched several newer folks try to write tests in it, and it was not
pretty.

> But it's hard for me to take this terribly seriously. "desperately" seems
> like massive overstatement at best.

Shrug.

> As for what up and coming developers learn, they mostly don't learn C
> either, and that's far more critical to what we do.

C is a a lot more useful to to them than perl. And it's actually far more
widely known these days than perl. C does teach you some reasonably
low-level-ish understanding of hardware. There are gazillions of programs
written in C that we'll have to maintain for decades. I don't think that's
comparably true for perl.

> I'm not sure what part of the testing infrastructure you think is
> unmaintained. For example, the last release of Test::Simple was all the way
> back on April 25.

IPC::Run is quite buggy and basically just maintained by Noah these days.

Greetings,

Andres Freund

Re: ON ERROR in json_query and the like

2024-06-10 Thread jian he

On Tue, May 28, 2024 at 5:29 PM Markus Winand  wrote:
>
> Hi!
>
> I’ve noticed two “surprising” (to me) behaviors related to
> the “ON ERROR” clause of the new JSON query functions in 17beta1.
>
> 1. JSON parsing errors are not subject to ON ERROR
>Apparently, the functions expect JSONB so that a cast is implied
>when providing TEXT. However, the errors during that cast are
>not subject to the ON ERROR clause.
>
>17beta1=# SELECT JSON_QUERY('invalid', '$' NULL ON ERROR);
>ERROR:  invalid input syntax for type json
>DETAIL:  Token "invalid" is invalid.
>CONTEXT:  JSON data, line 1: invalid
>
>Oracle DB and Db2 (LUW) both return NULL in that case.
>
>I had a look on the list archive to see if that is intentional but
>frankly speaking these functions came a long way. In case it is
>intentional it might be worth adding a note to the docs.
>

json_query ( context_item, path_expression);

`SELECT JSON_QUERY('invalid', '$' NULL ON ERROR);`
to make this return NULL, that means to catch all the errors that
happened while context_item evaluation.
otherwise, it would not be consistent?

Currently context_item expressions can be quite arbitrary.
considering the following examples.

create or replace function test(jsonb) returns jsonb as $$ begin raise
exception 'abort'; end $$ language plpgsql;
create or replace function test1(jsonb) returns jsonb as $$ begin
return $1; end $$ language plpgsql;
SELECT JSON_VALUE(test('1'), '$');
SELECT JSON_VALUE(test1('1'), '$');
SELECT JSON_VALUE((select '1'::jsonb), '$');
SELECT JSON_VALUE((with cte(s) as (select '1') select s::jsonb from cte), '$');
SELECT JSON_VALUE((with cte(s) as (select '1') select s::jsonb from
cte union all select s::jsonb from cte limit 1), '$');

Currently, I don't think we can make
SELECT JSON_VALUE(test('1'), '$' null on error);
return NULL.

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

On Fri, Jun 07, 2024 at 11:10:25AM +0200, Matthias van de Meent wrote:
> My primary concern isn't the IO, but the O(shared_buffers) that we
> have to go through during a checkpoint. As I mentioned upthread, it is
> reasonably possible the new cluster is already setup with a good
> fraction of the old system's shared_buffers configured. Every
> checkpoint has to scan all those buffers, which IMV can get (much)
> more expensive than the IO overhead caused by the WAL_LOG strategy. It
> may be a baseless fear as I haven't done the performance benchmarks
> for this, but I wouldn't be surprised if shared_buffers=8GB would
> measurably impact the upgrade performance in the current patch (vs the
> default 128MB).

I did a handful of benchmarks on an r5.24xlarge that seem to prove your
point.  The following are the durations of the pg_restore step of
pg_upgrade:

* 10k empty databases, 128MB shared_buffers
  WAL_LOG:   1m 01s
  FILE_COPY: 0m 22s

* 10k empty databases, 100GB shared_buffers
  WAL_LOG:   2m 03s
  FILE_COPY: 5m 08s

* 2.5k databases with 10k tables each, 128MB shared_buffers
  WAL_LOG:   17m 20s
  FILE_COPY: 16m 44s

* 2.5k databases with 10k tables each, 100GB shared_buffers
  WAL_LOG:   16m 39s
  FILE_COPY: 15m 21s

I was surprised with the last result, but there's enough other stuff
happening during such a test that I hesitate to conclude much.

> I'll note that the documentation for upgrading with pg_upgrade has the
> step for updating postgresql.conf / postgresql.auto.conf only after
> pg_upgrade has run already, but that may not be how it's actually
> used: after all, we don't have full control in this process, the user
> is the one who provides the new cluster with initdb.

Good point.  I think it's clear that FILE_COPY is not necessarily a win in
all cases for pg_upgrade.

>>> If such a change were implemented (i.e. no checkpoints for FILE_COPY
>>> in binary upgrade, with a single manual checkpoint after restoring
>>> template1 in create_new_objects) I think most of my concerns with this
>>> patch would be alleviated.
>>
>> Yeah, I think that's a valid point. The second checkpoint is to ensure
>> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
>> binary upgrades, we don't need that guarantee because a checkpoint
>> will be performed during shutdown at the end of the upgrade anyway.
> 
> Indeed.

It looks like pg_dump always uses template0, so AFAICT we don't even need
the suggested manual checkpoint after restoring template1.

I do like the idea of skipping a bunch of unnecessary operations in binary
upgrade mode, since it'll help me in my goal of speeding up pg_upgrade.
But I'm a bit hesitant to get too fancy here and to introduce a bunch of
new "if (IsBinaryUpgrade)" checks if the gains in the field won't be all
that exciting.  However, we've already sprinkled such checks quite
liberally, so maybe I'm being too cautious...

-- 
nathan

Re: GUC names in messages

2024-06-10 Thread Peter Smith

On Tue, May 28, 2024 at 4:16 PM Peter Smith  wrote:
>
...
>
> The new GUC quoting patches are separated by different GUC types only
> to simplify my processing of them.
>
> v7-0001 = Add quotes for GUCs - bool
> v7-0002 = Add quotes for GUCs - int
> v7-0003 = Add quotes for GUCs - real
> v7-0004 = Add quotes for GUCs - string
> v7-0005 = Add quotes for GUCs - enum
>
> The other v7 patches are just carried forward unchanged from v6:
>
> v7-0006 = fix case for IntervalStyle
> v7-0007 = fix case for Datestyle
> v7-0008 = make common translatable message strings
>
> 

Hi,

Here is a new patch set v8*, which is the same as v7* but fixes an
error in v7-0008 detected by cfbot.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v8-0001-Add-quotes-for-GUCs-bool.patch
Description: Binary data


v8-0005-Add-quotes-for-GUCs-enum.patch
Description: Binary data


v8-0003-Add-quotes-for-GUCs-real.patch
Description: Binary data


v8-0004-Add-quotes-for-GUCs-string.patch
Description: Binary data


v8-0002-Add-quotes-foir-GUCs-int.patch
Description: Binary data


v8-0007-GUC-names-fix-case-datestyle.patch
Description: Binary data


v8-0006-GUC-names-fix-case-intervalstyle.patch
Description: Binary data


v8-0008-GUC-names-make-common-translatable-messages.patch
Description: Binary data

Re: race condition in pg_class

2024-06-10 Thread Noah Misch

On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier  wrote:
> > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > > It's not this patch set's fault, but I'm not very pleased to see that
> > > the injection point wait events have been shoehorned into the
> > > "Extension" category - which they are not - instead of being a new
> > > wait_event_type. That would have avoided the ugly wait-event naming
> > > pattern, inconsistent with everything else, introduced by
> > > inplace050-tests-inj-v1.patch.
> >
> > Not sure to agree with that.  The set of core backend APIs supporting
> > injection points have nothing to do with wait events.  The library
> > attached to one or more injection points *may* decide to use a wait
> > event like what the wait/wakeup calls in modules/injection_points do,
> > but that's entirely optional.  These rely on custom wait events,
> > plugged into the Extension category as the code run is itself in an
> > extension.  I am not arguing against the point that it may be
> > interesting to plug in custom wait event categories, but the current
> > design of wait events makes that much harder than what core is
> > currently able to handle, and I am not sure that this brings much at
> > the end as long as the wait event strings can be customized.
> >
> > I've voiced upthread concerns over the naming enforced by the patch
> > and the way it plugs the namings into the isolation functions, by the
> > way.
> 
> I think the core code should provide an "Injection Point" wait event
> type and let extensions add specific wait events there, just like you
> did for "Extension".

Michael, could you accept the core code offering that, or not?  If so, I am
content to implement that.  If not, for injection point wait events, I have
just one priority.  The isolation tester already detects lmgr locks without
the test writer teaching it about each lock individually.  I want it to have
that same capability for injection points.  Do you think we can find something
everyone can accept, having that property?  These wait events show up in tests
only, and I'm happy to make the cosmetics be anything compatible with that
detection ability.

Re: Allow logical failover slots to wait on synchronous replication

On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote:
> The existing 'standby_slot_names' isn't great for users who are running
> clusters with quorum-based synchronous replicas. For instance, if
> the user has  synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
> bit tedious to have to reconfigure the standby_slot_names to set it to
> the most updated 3 sync replicas whenever different sync replicas start
> lagging. In the event that both GUCs are set, 'standby_slot_names' takes
> precedence.

Hm.  IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E"
to get the desired behavior today.  That might ordinarily be okay, but it
could cause logical replication to be held back unnecessarily if one of the
replicas falls behind for whatever reason.  A way to tie standby_slot_names
to synchronous replication instead does seem like it would be useful in
this case.

-- 
nathan

walsender.c fileheader comment

2024-06-10 Thread Peter Smith

Hi,

I was reading the walsender.c fileheader comment while studying
another thread. I think if there is logical replication in progress
then the PROCSIG_WALSND_INIT_STOPPING handler will *always* switch to
a "stopping" state: e.g.,

/*
 * Handle PROCSIG_WALSND_INIT_STOPPING signal.
 */
void
HandleWalSndInitStopping(void)
{
Assert(am_walsender);

/*
* If replication has not yet started, die like with SIGTERM. If
* replication is active, only set a flag and wake up the main loop. It
* will send any outstanding WAL, wait for it to be replicated to the
* standby, and then exit gracefully.
*/
if (!replication_active)
kill(MyProcPid, SIGTERM);
else
got_STOPPING = true;
}

~~~

But the walsender.c fileheader comment seems to be saying something
slightly different. IIUC, some minor rewording of the comment is
needed so it describes the code better.

HEAD
...
 * shutdown, if logical replication is in progress all existing WAL records
 * are processed followed by a shutdown.  Otherwise this causes the walsender
 * to switch to the "stopping" state. In this state, the walsender will reject
 * any further replication commands. The checkpointer begins the shutdown
 ...

SUGGESTION
.. shutdown. If logical replication is in progress, the walsender
switches to a "stopping" state. In this state, the walsender will
reject any further replication commands - but all existing WAL records
are processed - followed by a shutdown.

~~~

I attached a patch for the above-suggested change.

Thoughts?

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-walsender-fileheader-comment.patch
Description: Binary data

Re: Logical Replication of sequences

On Mon, 10 Jun 2024 at 14:48, Amit Kapila  wrote:
>
> On Mon, Jun 10, 2024 at 12:43 PM Masahiko Sawada  
> wrote:
> >
> > On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > Are you imagining the behavior for sequences associated with tables
> > > > differently than the ones defined by the CREATE SEQUENCE .. command? I
> > > > was thinking that users would associate sequences with publications
> > > > similar to what we do for tables for both cases. For example, they
> > > > need to explicitly mention the sequences they want to replicate by
> > > > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> > > > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > > > SEQUENCES IN SCHEMA sch1;
> > > >
> > > > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > > > should copy both the explicitly defined sequences and sequences
> > > > defined with the tables. Do you think a different variant for just
> > > > copying sequences implicitly associated with tables (say for identity
> > > > columns)?
> > >
> > > Oh, I was thinking that your proposal was to copy literally all
> > > sequences by REPLICA/REFRESH SEQUENCE command.
> > >
>
> I am trying to keep the behavior as close to tables as possible.
>
> > > But it seems to make
> > > sense to explicitly specify the sequences they want to replicate. It
> > > also means that they can create a publication that has only sequences.
> > > In this case, even if they create a subscription for that publication,
> > > we don't launch any apply workers for that subscription. Right?
> > >
>
> Right, good point. I had not thought about this.
>
> > > Also, given that the main use case (at least as the first step) is
> > > version upgrade, do we really need to support SEQUENCES IN SCHEMA and
> > > even FOR SEQUENCE?
> >
>
> At the very least, we can split the patch to move these variants to a
> separate patch. Once the main patch is finalized, we can try to
> evaluate the remaining separately.

I engaged in an offline discussion with Amit about strategizing the
division of patches to facilitate the review process. We agreed on the
following split: The first patch will encompass the setting and
getting of sequence values (core sequence changes). The second patch
will cover all changes on the publisher side related to "FOR ALL
SEQUENCES." The third patch will address subscriber side changes aimed
at synchronizing "FOR ALL SEQUENCES" publications. The fourth patch
will focus on supporting "FOR SEQUENCE" publication. Lastly, the fifth
patch will introduce support for  "FOR ALL SEQUENCES IN SCHEMA"
publication.

I will work on this and share an updated patch for the same soon.

Regards,
Vignesh

Re: ODBC Source Downloads Missing

2024-06-10 Thread Kashif Zeeshan

Getting the same issue at my end, the error message is "The URL you
specified does not exist.".

On Tue, Jun 11, 2024 at 12:33 AM Mark Hill  wrote:

> Is there an issue with the ODBC Source downloads today?
>
> The source download URL isn’t working:
> https://www.postgresql.org/ftp/odbc/versions/src/
>
>
>
> Thanks, Mark
>

Re: Logical Replication of sequences

2024-06-10 Thread Amul Sul

On Mon, Jun 10, 2024 at 5:00 PM vignesh C  wrote:

> On Mon, 10 Jun 2024 at 12:24, Amul Sul  wrote:
> >
> >
> >
> > On Sat, Jun 8, 2024 at 6:43 PM vignesh C  wrote:
> >>
> >> On Wed, 5 Jun 2024 at 14:11, Amit Kapila 
> wrote:
> >> [...]
> >> A new catalog table, pg_subscription_seq, has been introduced for
> >> mapping subscriptions to sequences. Additionally, the sequence LSN
> >> (Log Sequence Number) is stored, facilitating determination of
> >> sequence changes occurring before or after the returned sequence
> >> state.
> >
> >
> > Can't it be done using pg_depend? It seems a bit excessive unless I'm
> missing
> > something.
>
> We'll require the lsn because the sequence LSN informs the user that
> it has been synchronized up to the LSN in pg_subscription_seq. Since
> we are not supporting incremental sync, the user will be able to
> identify if he should run refresh sequences or not by checking the lsn
> of the pg_subscription_seq and the lsn of the sequence(using
> pg_sequence_state added) in the publisher.  Also, this parallels our
> implementation for pg_subscription_seq and will aid in expanding for
> a) incremental synchronization and b) utilizing workers for
> synchronization using sequence states if necessary.
>
> How do you track sequence mapping with the publication?
>
> In the publisher we use pg_publication_rel and
> pg_publication_namespace for mapping the sequences with the
> publication.
>

Thanks for the explanation. I'm wondering what the complexity would be, if
we
wanted to do something similar on the subscriber side, i.e., tracking via
pg_subscription_rel.

Regards,
Amul

Re: Improve the granularity of PQsocketPoll's timeout parameter?

2024-06-10 Thread Jeff Davis

On Mon, 2024-06-10 at 19:57 -0400, Tom Lane wrote:
> Because a common call pattern is to loop around PQsocketPoll calls.
> In that scenario you generally want to nail down the timeout time
> before starting the loop, not have it silently move forward after
> any random event that breaks the current wait (EINTR for example).
> pqSocketCheck and pqConnectDBComplete both rely on this.

I agree it makes things easier for a caller following that pattern,
because it doesn't need to recalculate the timeout each time through
the loop.

But:

1. If your clock goes backwards, you can end up waiting for an
arbitrarily long time. To prevent that you need to do some
recalculation each time through the loop anyway.

2. Inventing a new absolute time type just for this single purpose
seems strange to me. Would it be useful in other places? Are we going
to define what kinds of operations/transformations are supported?

3. I can't recall another API that uses absolute time for a timeout;
are you aware of a precedent?

Regards,
Jeff Davis

Re: race condition in pg_class

On Mon, Jun 10, 2024 at 07:19:27PM -0700, Noah Misch wrote:
> On Fri, Jun 07, 2024 at 09:08:03AM -0400, Robert Haas wrote:
>> I think the core code should provide an "Injection Point" wait event
>> type and let extensions add specific wait events there, just like you
>> did for "Extension".
> 
> Michael, could you accept the core code offering that, or not?  If so, I am
> content to implement that.  If not, for injection point wait events, I have
> just one priority.  The isolation tester already detects lmgr locks without
> the test writer teaching it about each lock individually.  I want it to have
> that same capability for injection points. Do you think we can find something
> everyone can accept, having that property?  These wait events show up in tests
> only, and I'm happy to make the cosmetics be anything compatible with that
> detection ability.

Adding a wait event class for injection point is an interesting
suggestion that would simplify the detection in the isolation function
quite a bit.  Are you sure that this is something that would be fit
for v17 material?  TBH, I am not sure.

At the end, the test coverage has the highest priority and the bugs
you are addressing are complex enough that isolation tests of this
level are a necessity, so I don't object to what
inplace050-tests-inj-v2.patch introduces with the naming dependency
for the time being on HEAD.  I'll just adapt and live with that
depending on what I deal with, while trying to improve HEAD later on.

I'm still wondering if there is something that could be more elegant
than a dedicated class for injection points, but I cannot think about
something that would be better for isolation tests on top of my head.
If there is something I can think of, I'll just go and implement it :)
--
Michael

signature.asc
Description: PGP signature

Re: Improve the granularity of PQsocketPoll's timeout parameter?

Jeff Davis  writes:
> I agree it makes things easier for a caller following that pattern,
> because it doesn't need to recalculate the timeout each time through
> the loop.

> But:

> 1. If your clock goes backwards, you can end up waiting for an
> arbitrarily long time. To prevent that you need to do some
> recalculation each time through the loop anyway.

[ shrug... ] The only reason this has come up is that f5e4dedfa
exposed what was previously just libpq-private code.  Given that
that code has operated in this way for a couple of decades with
approximately zero trouble reports, I'm not very interested in
re-litigating its theory of operation.  The more so if you don't
have a concrete better alternative to propose.

> 2. Inventing a new absolute time type just for this single purpose
> seems strange to me. Would it be useful in other places? Are we going
> to define what kinds of operations/transformations are supported?

I'm not that thrilled with inventing a new time type just for this,
either.  However, time_t is not very fit for purpose, so do you
have a different suggestion?

We could make it a bit nicer-looking by wrapping "long long int"
in a typedef, but that's only cosmetic.

> 3. I can't recall another API that uses absolute time for a timeout;
> are you aware of a precedent?

The other thing that I've seen done is for select(2) to treat the
timeout as an in/out parameter, decrementing it by the amount of
time slept.  I hope you'll agree that that's a monstrous kluge.

regards, tom lane

Trying out read streams in pgvector (an extension)

2024-06-10 Thread Thomas Munro

Hi,

I was looking around for an exotic index type to try the experience of
streamifying an extension, ie out-of-core code.  I am totally new to
pgvector, but since everyone keeps talking about it, I could not avoid
picking up some basic facts in the pgconf.dev hallway track, and
understood that its scans have some degree of known-order access
predictability, and then also some degree of fuzzy-predictable
order-not-yet-determined access too.  It's also quite random in the
I/O sense.

Here's a toy to streamify the known-order part.  I think for the fuzzy
part that links those parts together, maybe there is some way to guess
when it's a reasonable time to speculatively prefetch the lowest order
stuff in the pairing heap, and then deal with it if you're wrong, but
I didn't try that...

Someone involved in that project mentioned that it's probably not a
great topic to research in practice, because real world users of HNSW
use fully cached ie prewarmed indexes, because the performance is so
bad otherwise.  (Though maybe that argument is a little circular...).
So although this patch clearly speeds up cold HSNW searches to a
degree controlled by effective_io_concurrency, I'll probably look for
something else.  Suggestions for interesting index types to look at
streamifying are very welcome!

Hmm.  If that's really true about HNSW though, then there may still be
an opportunity to do automatic memory prefetching[1].  But then in the
case of index building, "stream" is NULL in this patch anyway.  It
surely must also be possible to find some good places to put
profitable explicit pg_mem_prefetch() calls given the predictability
and the need to get only ~60ns ahead for that usage.  I didn't look
into that because I was trying to prove things about read_stream.c,
not get involved in another project :-D

Here ends my science experiment report, which I'm dropping here just
in case others see useful ideas here.  The main thing I learned about
the read stream API is that it'd be nice to be able to reset the
stream but preserve the distance (something that came up on the
streaming sequential scan thread for a different reason), to deal with
cases where look-ahead opportunities come in bursts but you want a
longer lived stream than I used here.  That is the reason the patch
creates and destroys temporary streams in a loop; doh.  It also
provides an interesting case study for what speculative random
look-ahead support might need to look like.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKNUMnqubrrz8pRBdEM8vHeSCZcNq7iqERmkt6zPtpA3g%40mail.gmail.com

=== setup 

create extension vector;

create or replace function random_vector(dimensions int)
returns vector language sql
begin atomic;
  select array_agg(random())::vector
from generate_series(1, dimensions);
end;

create table t (id serial, embedding vector(6));

insert into t (embedding)
select random_vector(6)
  from generate_series(1, 100);

set maintenance_work_mem = '2GB';

create index on t using hnsw(embedding vector_l2_ops);

=== test of a hot search, assuming repeated ===

select embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector
  from t
 where embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector < 0.2
 order by 1 limit 20;

=== test of a cold search, assuming empty caches ===

create or replace function test()
returns void
language plpgsql as
$$
declare
  my_vec vector(6) := random_vector(6);
begin
  perform embedding <-> my_vec
 from t
where embedding <-> my_vec < 0.2
order by 1 limit 20;
end;
$$;

select test();

(Make sure you remember to set effective_io_concurrency to an
interesting number if you want to generate a lot of overlapping
fadvise calls.)


0001-XXX-toy-use-of-read-stream-API.patch
Description: Binary data

Re: Logical Replication of sequences

On Tue, 11 Jun 2024 at 09:41, Amul Sul  wrote:
>
> On Mon, Jun 10, 2024 at 5:00 PM vignesh C  wrote:
>>
>> On Mon, 10 Jun 2024 at 12:24, Amul Sul  wrote:
>> >
>> >
>> >
>> > On Sat, Jun 8, 2024 at 6:43 PM vignesh C  wrote:
>> >>
>> >> On Wed, 5 Jun 2024 at 14:11, Amit Kapila  wrote:
>> >> [...]
>> >> A new catalog table, pg_subscription_seq, has been introduced for
>> >> mapping subscriptions to sequences. Additionally, the sequence LSN
>> >> (Log Sequence Number) is stored, facilitating determination of
>> >> sequence changes occurring before or after the returned sequence
>> >> state.
>> >
>> >
>> > Can't it be done using pg_depend? It seems a bit excessive unless I'm 
>> > missing
>> > something.
>>
>> We'll require the lsn because the sequence LSN informs the user that
>> it has been synchronized up to the LSN in pg_subscription_seq. Since
>> we are not supporting incremental sync, the user will be able to
>> identify if he should run refresh sequences or not by checking the lsn
>> of the pg_subscription_seq and the lsn of the sequence(using
>> pg_sequence_state added) in the publisher.  Also, this parallels our
>> implementation for pg_subscription_seq and will aid in expanding for
>> a) incremental synchronization and b) utilizing workers for
>> synchronization using sequence states if necessary.
>>
>> How do you track sequence mapping with the publication?
>>
>> In the publisher we use pg_publication_rel and
>> pg_publication_namespace for mapping the sequences with the
>> publication.
>
>
> Thanks for the explanation. I'm wondering what the complexity would be, if we
> wanted to do something similar on the subscriber side, i.e., tracking via
> pg_subscription_rel.

Because we won't utilize sync workers to synchronize the sequence, and
the sequence won't necessitate sync states like init, sync,
finishedcopy, syncdone, ready, etc., initially, I considered keeping
the sequences separate. However, I'm ok with using pg_subscription_rel
as it could potentially help in enhancing incremental synchronization
and parallelizing later on.

Regards,
Vignesh

Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

On Thu, Jun 6, 2024 at 11:49 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith  wrote 
> in
> > Hi, I have reproduced this multiple times now.
> >
> > I confirmed the initial post/steps from Alexander. i.e. The test
> > script provided [1] gets itself into a state where function
> > ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
> > for the next page to become available") constantly returns
> > XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
> > forever, since it never calls WalSndDone() to exit the walsender
> > process.
>
> Thanks for the repro; I believe I understand what's happening here.
>
> During server shutdown, the latter half of the last continuation
> record may fail to be flushed. This is similar to what is described in
> the commit message of commit ff9f111bce. While shutting down,
> WalSndLoop() waits for XLogSendLogical() to consume WAL up to
> flushPtr, but in this case, the last record cannot complete without
> the continuation part starting from flushPtr, which is
> missing.

Sorry, it is not clear to me why we failed to flush the last
continuation record in logical walsender? I see that we try to flush
the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
that not sufficient?

-- 
With Regards,
Amit Kapila.

Re: Track the amount of time waiting due to cost_delay

On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 05:48:22PM +, Bertrand Drouvot wrote:
> > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> >> I wonder if we should also
> >> surface the effective cost limit for each autovacuum worker.
> > 
> > I'm not sure about it as I think that it could be misleading: one could 
> > query
> > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> > due to _this_ cost_limit. But that's not necessary true as the cost_limit 
> > could
> > have changed multiple times since the vacuum started. So, unless there is
> > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed 
> > and
> > the cost_limit could be misleadind IMHO.
> 
> Well, that's true for the delay, too, right (at least as of commit
> 7d71d3d)?

Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Issue with the PRNG used by Postgres

2024-06-10 Thread Andrey M. Borodin

FWIW, yesterday we had one more reproduction of stuck spinlock panic which does 
not seem as a stuck spinlock.

I don’t see any valuable diagnostic information. The reproduction happened on 
hot standby. There’s a message in logs on primary at the same time, but does 
not seem to be releated:
"process 3918804 acquired ShareLock on transaction 909261926 after 2716.594 ms"
PostgreSQL 14.11
VM with this node does not seem heavily loaded, according to monitoring there 
were just 2 busy backends before panic shutdown.

> On 16 Apr 2024, at 20:54, Andres Freund  wrote:
> 
> Hi,
> 
> On 2024-04-15 10:54:16 -0400, Robert Haas wrote:
>> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund  wrote:
>>> Here's a patch implementing this approach. I confirmed that before we 
>>> trigger
>>> the stuck spinlock logic very quickly and after we don't. However, if most
>>> sleeps are interrupted, it can delay the stuck spinlock detection a good
>>> bit. But that seems much better than triggering it too quickly.
>> 
>> +1 for doing something about this. I'm not sure if it goes far enough,
>> but it definitely seems much better than doing nothing.
> 
> One thing I started to be worried about is whether a patch ought to prevent
> the timeout used by perform_spin_delay() from increasing when
> interrupted. Otherwise a few signals can trigger quite long waits.
> 
> But as a I can't quite see a way to make this accurate in the backbranches, I
> suspect something like what I posted is still a good first version.
> 

What kind of inaccuracy do you see?
The code in performa_spin_delay() does not seem to be much different across 
REL_11_STABLE..REL_12_STABLE.
The only difference I see is how random number is generated.

Thanks!

Best regards, Andrey Borodin.

Re: Improve the granularity of PQsocketPoll's timeout parameter?

2024-06-10 Thread Jeff Davis

On Tue, 2024-06-11 at 00:52 -0400, Tom Lane wrote:
> 
> I'm not that thrilled with inventing a new time type just for this,
> either.  However, time_t is not very fit for purpose, so do you
> have a different suggestion?

No, I don't have a great alternative, so I don't object to your
solutions for f5e4dedfa8.

Regards,
Jeff Davis

Re: Sort functions with specialized comparators

2024-06-10 Thread Stepan Neretin

On Sat, Jun 8, 2024 at 1:50 AM Stepan Neretin  wrote:

> Hello all.
>
> I am interested in the proposed patch and would like to propose some
> additional changes that would complement it. My changes would introduce
> similar optimizations when working with a list of integers or object
> identifiers. Additionally, my patch includes an extension for benchmarking,
> which shows an average speedup of 30-40%.
>
> postgres=# SELECT bench_oid_sort(100);
>  bench_oid_sort
>
>
> 
>  Time taken by list_sort: 116990848 ns, Time taken by list_oid_sort:
> 80446640 ns, Percentage difference: 31.24%
> (1 row)
>
> postgres=# SELECT bench_int_sort(100);
>  bench_int_sort
>
>
> 
>  Time taken by list_sort: 118168506 ns, Time taken by list_int_sort:
> 80523373 ns, Percentage difference: 31.86%
> (1 row)
>
> What do you think about these changes?
>
> Best regards, Stepan Neretin.
>
> On Fri, Jun 7, 2024 at 11:08 PM Andrey M. Borodin 
> wrote:
>
>> Hi!
>>
>> In a thread about sorting comparators[0] Andres noted that we have
>> infrastructure to help compiler optimize sorting. PFA attached PoC
>> implementation. I've checked that it indeed works on the benchmark from
>> that thread.
>>
>> postgres=# CREATE TABLE arrays_to_sort AS
>>SELECT array_shuffle(a) arr
>>FROM
>>(SELECT ARRAY(SELECT generate_series(1, 100)) a),
>>generate_series(1, 10);
>>
>> postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- original
>> Time: 990.199 ms
>> postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- patched
>> Time: 696.156 ms
>>
>> The benefit seems to be on the order of magnitude with 30% speedup.
>>
>> There's plenty of sorting by TransactionId, BlockNumber, OffsetNumber,
>> Oid etc. But this sorting routines never show up in perf top or something
>> like that.
>>
>> Seems like in most cases we do not spend much time in sorting. But
>> specialization does not cost us much too, only some CPU cycles of a
>> compiler. I think we can further improve speedup by converting inline
>> comparator to value extractor: more compilers will see what is actually
>> going on. But I have no proofs for this reasoning.
>>
>> What do you think?
>>
>>
>> Best regards, Andrey Borodin.
>>
>> [0]
>> https://www.postgresql.org/message-id/flat/20240209184014.sobshkcsfjix6u4r%40awork3.anarazel.de#fc23df2cf314bef35095b632380b4a59
>
>
Hello all.

I have decided to explore more areas in which I can optimize and have added
two new benchmarks. Do you have any thoughts on this?

postgres=# select bench_int16_sort(100);
bench_int16_sort

-
 Time taken by usual sort: 66354981 ns, Time taken by optimized sort:
52151523 ns, Percentage difference: 21.41%
(1 row)

postgres=# select bench_float8_sort(100);
bench_float8_sort

--
 Time taken by usual sort: 121475231 ns, Time taken by optimized sort:
74458545 ns, Percentage difference: 38.70%
(1 row)

postgres=#

Best regards, Stepan Neretin.
From a3c03ee1b4f6d94d6bf2dc8700c30349271ef9b3 Mon Sep 17 00:00:00 2001
From: Stepan Neretin 
Date: Tue, 11 Jun 2024 12:02:48 +0700
Subject: [PATCH v42 08/12] Refactor sorting of attribute numbers in
 pg_publication.c and statscmds.c

This commit refactors the sorting of attribute numbers in two modules:
pg_publication.c and statscmds.c. Instead of using the qsort function
with a custom compare function, it utilizes the recently introduced
sort_int_16_arr function, which offers enhanced performance for sorting
int16 arrays.

  Details:
- Replaced qsort calls with sort_int_16_arr for sorting attribute numbers.
- Improved efficiency by utilizing the sorting template for int16 arrays.

This change ensures consistency in sorting methods and enhances sorting
performance in relevant modules.
---
 src/backend/catalog/pg_publication.c | 2 +-
 src/backend/commands/statscmds.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 8518582b76..196f696c1e 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -540,7 +540,7 @@ publication_translate_columns(Relation targetrel, List *columns,
 	}
 
 	/* Be tidy, so that the catalog representation is always sorted */
-	qsort(attarray, n, sizeof(AttrNumber), compare_int16);
+	sort_int_16_arr(attarray, n);
 
 	*natts = n;
 	*attrs = attarray;
d

Re: relfilenode statistics

2024-06-10 Thread Kyotaro Horiguchi

At Mon, 10 Jun 2024 08:09:56 +, Bertrand Drouvot 
 wrote in 
> Hi,
> 
> On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> > On Thu, Jun 6, 2024 at 11:17 PM Andres Freund  wrote:
> > > If we just want to keep prior stats upon arelation rewrite, we can just 
> > > copy
> > > the stats from the old relfilenode.  Or we can decide that those stats 
> > > don't
> > > really make sense anymore, and start from scratch.
> > 
> > I think we need to think carefully about what we want the user
> > experience to be here. "Per-relfilenode stats" could mean "sometimes I
> > don't know the relation OID so I want to use the relfilenumber
> > instead, without changing the user experience" or it could mean "some
> > of these stats actually properly pertain to the relfilenode rather
> > than the relation so I want to associate them with the right object
> > and that will affect how the user sees things." We need to decide
> > which it is. If it's the former, then we need to examine whether the
> > goal of hiding the distinction between relfilenode stats and relation
> > stats from the user is in fact feasible. If it's the latter, then we
> > need to make sure the whole patch reflects that design, which would
> > include e.g. NOT copying stats from the old to the new relfilenode,
> > and which would also include documenting the behavior in a way that
> > will be understandable to users.
> 
> Thanks for sharing your thoughts!
> 
> Let's take the current heap_blks_read as an example: it currently survives
> a relation rewrite and I guess we don't want to change the existing user
> experience for it.
> 
> Now say we want to add "heap_blks_written" (like in this POC patch) then I 
> think
> that it makes sense for the user to 1) query this new stat from the same place
> as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the 
> same
> experience as far the relation rewrite is concerned (keep the previous stats).
> 
> To achieve the rewrite behavior we could:
> 
> 1) copy the stats from the OLD relfilenode to the relation (like in the POC 
> patch)
> 2) copy the stats from the OLD relfilenode to the NEW one (could be in a 
> dedicated
> field)
> 
> The PROS of 1) is that the behavior is consistent with the current 
> heap_blks_read
> and that the user could still see the current relfilenode stats (through a 
> new API)
> if he wants to.
> 
> > In my experience, the worst thing you can do in cases like this is be
> > somewhere in the middle. Then you tend to end up with stuff like: the
> > difference isn't supposed to be something that the user knows or cares
> > about, except that they do have to know and care because you haven't
> > thoroughly covered up the deception, and often they have to reverse
> > engineer the behavior because you didn't document what was really
> > happening because you imagined that they wouldn't notice.
> 
> My idea was to move all that is in pg_statio_all_tables to relfilenode stats
> and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) 
> ensure
> the user can still retrieve the stats from pg_statio_all_tables in such a way
> that it survives a rewrite, 3) provide dedicated APIs to retrieve
> relfilenode stats but only for the current relfilenode, 4) document this
> behavior. This is what the POC patch is doing for heap_blks_written (would
> need to do the same for heap_blks_read and friends) except for the 
> documentation
> part. What do you think?

In my opinion, it is certainly strange that bufmgr is aware of
relation kinds, but introducing relfilenode stats to avoid this skew
doesn't seem to be the best way, as it invites inconclusive arguments
like the one raised above. The fact that we transfer counters from old
relfilenodes to new ones indicates that we are not really interested
in counts by relfilenode. If that's the case, wouldn't it be simpler
to call pgstat_count_relation_buffer_read() from bufmgr.c and then
branch according to relkind within that function? If you're concerned
about the additional branch, some ingenuity may be needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Track the amount of time waiting due to cost_delay