Re: Synchronizing slots from primary to standby

2024-02-22 Thread Bertrand Drouvot
Hi,

On Fri, Feb 23, 2024 at 09:43:48AM +0530, shveta malik wrote:
> On Fri, Feb 23, 2024 at 8:35 AM shveta malik  wrote:
> >
> > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
> >  wrote:
> > >
> > > Suppose that in synchronize_slots() the query would be:
> > >
> > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> > > " restart_lsn, catalog_xmin, two_phase, failover,"
> > > " database, conflict_reason"
> > > " FROM pg_catalog.pg_replication_slots"
> > > " WHERE failover and NOT temporary and 1 = 1";
> > >
> > > Then my comment is to rewrite it to:
> > >
> > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> > > " restart_lsn, catalog_xmin, two_phase, failover,"
> > > " database, conflict_reason"
> > > " FROM pg_catalog.pg_replication_slots"
> > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 
> > > 1";
> > >
> > > to ensure the operator "=" is coming from the pg_catalog schema.
> > >
> >
> > Thanks for the details, but slot-sync does not use SPI calls, it uses
> > libpqrcv calls. So is this change needed?
> 
> Additionally, I would like to have a better understanding of why it's
> necessary and whether it addresses any potential security risks.

Because one could create say the "=" OPERATOR in their own schema, attach a
function to it doing undesired stuff and change the search_path for the database
the sync slot worker connects to.

Then this new "=" operator would be used (instead of the pg_catalog.= one),
triggering the "undesired" function as superuser.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-02-22 Thread Bertrand Drouvot
Hi,

On Fri, Feb 23, 2024 at 08:35:44AM +0530, shveta malik wrote:
> On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
>  wrote:
> >
> > Suppose that in synchronize_slots() the query would be:
> >
> > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> > " restart_lsn, catalog_xmin, two_phase, failover,"
> > " database, conflict_reason"
> > " FROM pg_catalog.pg_replication_slots"
> > " WHERE failover and NOT temporary and 1 = 1";
> >
> > Then my comment is to rewrite it to:
> >
> > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> > " restart_lsn, catalog_xmin, two_phase, failover,"
> > " database, conflict_reason"
> > " FROM pg_catalog.pg_replication_slots"
> > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1";
> >
> > to ensure the operator "=" is coming from the pg_catalog schema.
> >
> 
> Thanks for the details, but slot-sync does not use SPI calls, it uses
> libpqrcv calls.

Sorry for the confusion, I meant to say "remote SQL calls".

> So is this change needed?

The example I provided is a "fake" one (as currently the "=" operator is not
used in the const char *query in synchronize_slots()). So there is currently
nothing to change here. I just want to highlight that if we are using (or will
use) operators in the remote SQL calls then we should ensure they are coming 
from
the pg_catalog schema (as in the example provided above).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-22 Thread Dilip Kumar
On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > wrote:
>
> > > Sure, but is that really what we want?
> >
> > So your question is do we want these buffers to be in multiple of
> > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > don't think it should create any problem logically.  I mean we can
> > look again in the patch to see if we have made any such assumptions
> > but that should be fairly easy to fix, then maybe if we are going in
> > this way we should get rid of the check_slru_buffers() function as
> > well.
>
> Not really, I just don't think the macro should be in slru.h.

Okay

> Another thing I've been thinking is that perhaps it would be useful to
> make the banks smaller, when the total number of buffers is small.  For
> example, if you have 16 or 32 buffers, it's not really clear to me that
> it makes sense to have just 1 bank or 2 banks.  It might be more
> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> the algorithm scale down as well as up ...

It might be helpful to have small-size banks when SLRU buffers are set
to a very low value and we are only accessing a couple of pages at a
time (i.e. no buffer replacement) because in such cases most of the
contention will be on SLRU Bank lock. Although I am not sure how
practical such a use case would be, I mean if someone is using
multi-xact very heavily or creating frequent subtransaction overflow
then wouldn't they should set this buffer limit to some big enough
value?  By doing this we would lose some simplicity of the patch I
mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
need to compute this and store it in SlruShared. Maybe that's not that
bad.

>
> I haven't done either of those things in the attached v19 version.  I
> did go over the comments once again and rewrote the parts I was unhappy
> with, including some existing ones.  I think it's OK now from that point
> of view ... at some point I thought about creating a separate README,
> but in the end I thought it not necessary.

Thanks, I will review those changes.

> I did add a bunch of Assert()s to make sure the locks that are supposed
> to be held are actually held.  This led me to testing the page status to
> be not EMPTY during SimpleLruWriteAll() before calling
> SlruInternalWritePage(), because the assert was firing.  The previous
> code is not really *buggy*, but to me it's weird to call WritePage() on
> a slot with no contents.

Okay,  I mean internally SlruInternalWritePage() will flush only if
the status is SLRU_PAGE_VALID, but it is better the way you have done.

> Another change was in TransactionGroupUpdateXidStatus: the original code
> had the leader doing pg_atomic_read_u32(>clogGroupFirst) to
> know which bank to lock.  I changed it to simply be the page used by the
> leader process; this doesn't need an atomic read, and should be the same
> page anyway.  (If it isn't, it's no big deal).  But what's more: even if
> we do read ->clogGroupFirst at that point, there's no guarantee that
> this is going to be exactly for the same process that ends up being the
> first in the list, because since we have not set it to INVALID by the
> time we grab the bank lock, it is quite possible for more processes to
> add themselves to the list.

Yeah, this looks better

> I realized all this while rewriting the comments in a way that would let
> me understand what was going on ... so IMO the effort was worthwhile.

+1

I will review and do some more testing early next week and share my feedback.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-22 Thread Amit Kapila
On Wed, Feb 21, 2024 at 8:34 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 20 Feb 2024 19:56:10 +0530, Robert Haas  wrote 
> in
> > It seems like maybe somebody should look into why this is happening,
> > and perhaps fix it.
>
> GetConnection()@streamutil.c wants to ensure conninfo has a fallback
> database name ("replication"). However, the function seems to be
> ignoring the case where neither dbname nor connection string is given,
> which is the first case Kuroda-san raised. The second case is the
> intended behavior of the function.
>
> >   /* pg_recvlogical uses dbname only; others use connection_string 
> > only. */
> >   Assert(dbname == NULL || connection_string == NULL);
>
> And the function incorrectly assumes that the connection string
> requires "dbname=replication".
>
> >* Merge the connection info inputs given in form of connection 
> > string,
> >* options and default values (dbname=replication, replication=true, 
> > etc.)
>
> But the name is a pseudo database name only used by pg_hba.conf
> (maybe) , which cannot be used as an actual database name (without
> quotation marks, or unless it is actually created). The function
> should not add the fallback database name because the connection
> string for physical replication connection doesn't require the dbname
> parameter. (attached patch)
>

We do append dbname=replication even in libpqrcv_connect for .pgpass
lookup as mentioned in comments. See below:
libpqrcv_connect()
{

else
{
/*
* The database name is ignored by the server in replication mode,
* but specify "replication" for .pgpass lookup.
*/
keys[++i] = "dbname";
vals[i] = "replication";
}
...
}

I think as part of this effort we should just add dbname to
primary_conninfo written in postgresql.auto.conf file. As above, the
question is valid whether we should do it just for 17 or for prior
versions. Let's discuss more on that. I am not sure of the use case
for versions before 17 but commit cca97ce6a665 mentioned that some
middleware or proxies might however need to know the dbname to make
the correct routing decision for the connection. Does that apply here
as well? If so, we can do it and update the docs, otherwise, let's do
it just for 17.

-- 
With Regards,
Amit Kapila.




Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-22 Thread Amit Kapila
On Wed, Feb 21, 2024 at 7:46 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > Just FYI - here is an extreme case. And note that I have applied proposed 
> > > patch.
> > >
> > > When `pg_basebackup -D data_N2 -R` is used:
> > > ```
> > > primary_conninfo = 'user=hayato ... dbname=hayato ...
> > > ```
> > >
> > > But when `pg_basebackup -d "" -D data_N2 -R` is used:
> > > ```
> > > primary_conninfo = 'user=hayato ... dbname=replication
> > > ```
> >
> > It seems like maybe somebody should look into why this is happening,
> > and perhaps fix it.
>
> I think this caused from below part [1] in GetConnection().
>
> If both dbname and connection_string are the NULL, we will enter the else part
> and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated
> only here.
>
> Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(),
> the strange part would be found and replaced to the username [2].
>
> I think if both the connection string and the dbname are NULL, it should be
> considered as the physical replication connection. here is a patch to fix it.
>

When dbname is NULL or not given, it defaults to username. This
follows the specs of the connection string. See (dbname #
The database name. Defaults to be the same as the user name...) [1].
Your patch breaks that specs, so I don't think it is correct.

[1] - https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-CONNSTRING

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2024-02-22 Thread shveta malik
On Fri, Feb 23, 2024 at 10:06 AM Zhijie Hou (Fujitsu)
 wrote:
>
>
> I noticed one CFbot failure[1] which is because the tap-test doesn't wait for 
> the
> standby to catch up before promoting, thus the data inserted after promotion
> could not be replicated to the subscriber. Add a wait_for_replay_catchup to 
> fix it.
>
> Apart from this, I also adjusted some variable names in the tap-test to be
> consistent. And added back a mis-removed ProcessConfigFile call.
>
> [1] https://cirrus-ci.com/task/6126787437002752?logs=check_world#L312
>

Thanks for the patches. Had a quick look at v95_2, here are some
trivial comments:


slot.h:
-
1)
extern List *GetStandbySlotList(bool copy);
extern void WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn);
extern void FilterStandbySlots(XLogRecPtr wait_for_lsn,
     List **standby_slots);

The order is different from the one in slot.c

slot.c:
-
2)
warningfmt = _("replication slot \"%s\" specified in parameter \"%s\"
does not exist, ignoring");

GUC names should not have double quotes. Same in each warningfmt in
this function


3)
errmsg("replication slot \"%s\" specified in parameter \"%s\" does not
have active_pid",

Same here, double quotes around standby_slot_names should be removed


 walsender.c:
--
4)
* Used by logical decoding SQL functions that acquired slot with failover
* enabled.

To be consistent with other such comments in previous patches:
slot with failover enabled --> failover enabled slot

5) Wake up the logical walsender processes with failover-enabled slots

failover-enabled slots  --> failover enabled slots

postgresql.conf.sample:
--
6)
streaming replication standby server slot names that logical walsender
processes will wait for

Is it better to say it like this? (I leave this to your preference)

streaming replication standby server slot names for which logical
walsender processes will wait.

thanks
Shveta




Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Ashutosh Bapat
On Thu, Feb 22, 2024 at 8:35 PM Tom Lane  wrote:
>
> Peter Eisentraut  writes:
> > The problem is, we don't really have any end-to-end coverage of
>
> > dump
> > restore
> > dump again
> > compare the two dumps
>
> > with a database with lots of interesting objects in it.
>
> I'm very much against adding another full run of the core regression
> tests to support this.

This will be taken care of by Peter's latest idea of augmenting
existing 002_pg_upgrade.pl.

> But beyond the problem of not bloating the
> check-world test runtime, there is the question of what this would
> actually buy us.  I doubt that it is worth very much, because
> it would not detect bugs-of-omission in pg_dump.  As I remarked in
> the other thread, if pg_dump is blind to the existence of some
> feature or field, testing that the dumps compare equal will fail
> to reveal that it didn't restore that property.
>
> I'm not sure what we could do about that.  One could imagine writing
> some test infrastructure that dumps out the contents of the system
> catalogs directly, and comparing that instead of pg_dump output.
> But that'd be a lot of infrastructure to write and maintain ...
> and it's not real clear why it wouldn't *also* suffer from
> I-forgot-to-add-this hazards.

If a developer forgets to add logic to dump objects that their patch
adds, it's hard to detect it, through testing alone, in every possible
case. We need reviewers to take care of that. I don't think that's the
objective of this test case or of pg_upgrade test either.

>
> On balance, I think there are good reasons that we've not added
> such a test, and I don't believe those tradeoffs have changed.
>

I am not aware of those reasons. Are they documented somewhere? Any
pointers to the previous discussion on this topic? Googling "pg_dump
regression pgsql-hackers" returns threads about performance
regressions.

On the flip side, the test I wrote reproduces the COMPRESSION/STORAGE
bug you reported along with a few other bugs in that area which I will
report soon on that thread. I think, that shows that we need such a
test.

-- 
Best Wishes,
Ashutosh Bapat




Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Ashutosh Bapat
On Thu, Feb 22, 2024 at 3:50 PM Peter Eisentraut  wrote:
>
> On 22.02.24 11:00, Daniel Gustafsson wrote:
> >> On 22 Feb 2024, at 10:55, Ashutosh Bapat  
> >> wrote:
> >> On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson  wrote:
> >
> >> Somebody looking for dump/restore tests wouldn't search
> >> src/bin/pg_upgrade, I think.
> >
> > Quite possibly not, but pg_upgrade is already today an important testsuite 
> > for
> > testing pg_dump in binary-upgrade mode so maybe more developers touching
> > pg_dump should?
>
> Yeah, I think attaching this to the existing pg_upgrade test would be a
> good idea.  Not only would it save test run time, it would probably also
> reduce code duplication.
>

That's more than one vote for adding the test to 002_pg_ugprade.pl.
Seems fine to me.

-- 
Best Wishes,
Ashutosh Bapat




Re: RFC: Logging plan of the running query

2024-02-22 Thread Robert Haas
On Thu, Feb 22, 2024 at 6:25 AM James Coleman  wrote:
> This is potentially a bit of a wild idea, but I wonder if having some
> kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in
> "normal" as opposed to "critical" (using that word differently than
> the existing critical sections) would be worth it.

It's worth considering, but the definition of "normal" vs. "critical"
might be hard to pin down. Or, we might end up with a definition that
is specific to this particular case and not generalizable to others.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-22 Thread Ajin Cherian
On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий 
wrote:

> Dear All,
>
> I'd like to present and talk about a problem when 2PC transactions are
> applied quite slowly on a replica during logical replication. There is a
> master and a replica with established logical replication from the master
> to the replica with twophase = true. With some load level on the master,
> the replica starts to lag behind the master, and the lag will be
> increasing. We have to significantly decrease the load on the master to
> allow replica to complete the catchup. Such problem may create significant
> difficulties in the production. The problem appears at least on
> REL_16_STABLE branch.
>
> To reproduce the problem:
>
>- Setup logical replication from master to replica with subscription
>parameter twophase =  true.
>- Create some intermediate load on the master (use pgbench with custom
>sql with prepare+commit)
>- Optionally switch off the replica for some time (keep load on
>master).
>- Switch on the replica and wait until it reaches the master.
>
> The replica will never reach the master with even some low load on the
> master. If to remove the load, the replica will reach the master for much
> greater time, than expected. I tried the same for regular transactions, but
> such problem doesn't appear even with a decent load.
>
>
>
I tried this setup and I do see that the logical subscriber does reach the
master in a short time. I'm not sure what I'm missing. I stopped the
logical subscriber in between while pgbench was running and then started it
again and ran the following:
postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
 sent_lsn  | pg_current_wal_lsn
---+
 0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)

My pgbench command:
pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5

my custom sql file:
cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;

regards,
Ajin Cherian
Fujitsu Australia


Re: About a recently-added message

2024-02-22 Thread Peter Smith
On Thu, Feb 22, 2024 at 11:36 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 21 Feb 2024 14:57:42 +0530, Amit Kapila  
> wrote in
> > On Tue, Feb 20, 2024 at 3:21 PM shveta malik  wrote:
> > >
> > > okay, attached v2 patch with changed error msgs and double quotes
> > > around logical.
> > >
> >
> > Horiguchi-San, does this address all your concerns related to
> > translation with these new messages?
>
> Yes, I'm happy with all of the changes.  The proposed patch appears to
> cover all instances related to slotsync.c, and it looks fine to
> me. Thanks!
>
> I found that logica.c is also using the policy that I complained
> about, but it is a separate issue.
>
> ./logical.c 122:errmsg("logical decoding requires wal_level >= 
> logical")));
>

Hmm. I have a currently stalled patch-set to simplify the quoting of
all the GUC names by using one rule. The consensus is to *always*
quote them. See [1]. And those patches already are addressing that
logical.c code mentioned above.

IMO it would be good if we could try to get that patch set approved to
fix everything in one go, instead of ad-hoc hunting for and fixing
cases one at a time.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPsf3NewXbsFKY88Qn1ON1_dMD6343MuWdMiiM2Ds9a_wA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: speed up a logical replica setup

2024-02-22 Thread Amit Kapila
On Fri, Feb 23, 2024 at 8:16 AM Euler Taveira  wrote:
>
> On Wed, Feb 21, 2024, at 5:00 AM, Shlok Kyal wrote:
>
> I found some issues and fixed those issues with top up patches
> v23-0012 and v23-0013
> 1.
> Suppose there is a cascade physical replication node1->node2->node3.
> Now if we run pg_createsubscriber with node1 as primary and node2 as
> standby, pg_createsubscriber will be successful but the connection
> between node2 and node3 will not be retained and log og node3 will
> give error:
> 2024-02-20 12:32:12.340 IST [277664] FATAL:  database system
> identifier differs between the primary and standby
> 2024-02-20 12:32:12.340 IST [277664] DETAIL:  The primary's identifier
> is 7337575856950914038, the standby's identifier is
> 7337575783125171076.
> 2024-02-20 12:32:12.341 IST [277491] LOG:  waiting for WAL to become
> available at 0/3000F10
>
> To fix this I am avoiding pg_createsubscriber to run if the standby
> node is primary to any other server.
> Made the change in v23-0012 patch
>
>
> IIRC we already discussed the cascading replication scenario. Of course,
> breaking a node is not good that's why you proposed v23-0012. However,
> preventing pg_createsubscriber to run if there are standbys attached to it is
> also annoying. If you don't access to these hosts you need to (a) kill
> walsender (very fragile / unstable), (b) start with max_wal_senders = 0 or (3)
> add a firewall rule to prevent that these hosts do not establish a connection
> to the target server. I wouldn't like to include the patch as-is. IMO we need
> at least one message explaining the situation to the user, I mean, add a hint
> message.
>

Yeah, it could be a bit tricky for users to ensure that no follow-on
standby is present but I think it is still better to give an error and
prohibit running pg_createsubscriber than breaking the existing
replication. The possible solution, in this case, is to allow running
pg_basebackup via this tool or otherwise and then let the user use it
to convert to a subscriber. It would be good to keep things simple for
the first version then we can add such options like --force in
subsequent versions.

-- 
With Regards,
Amit Kapila.




Re: LogwrtResult contended spinlock

2024-02-22 Thread Robert Haas
On Fri, Feb 23, 2024 at 1:18 AM Jeff Davis  wrote:
> I don't see the global non-shared variable as a huge problem, so if it
> serves a purpose then I'm fine keeping it. Perhaps we could make it a
> bit safer by using some wrapper functions.

I actually really hate these kinds of variables. I think they likely
mask various bugs (where the value might not be up to date where we
think it is) and anti-bugs (where we actually rely on the value being
out of date but we don't know that we do because the code is so
opaque). My vintage 2021 adventures in getting rid of the global
variable ThisTimeLineID and a few other global variables found some
actual but minor bugs, also found a bunch of confused code that didn't
really do what it thought it did, and took up a huge amount of my time
trying to analyze what was happening. I'm not prepared to recommend
what we should do in this particular case. I would like to believe
that the local copies can be eliminated somehow, but I haven't studied
the code or done benchmarking so I don't really know enough to guess
what sort of code changes would or would not be good enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Synchronizing slots from primary to standby

2024-02-22 Thread Zhijie Hou (Fujitsu)
On Friday, February 23, 2024 10:18 AM Zhijie Hou (Fujitsu) 
 wrote:
> >
> > Hi,
> >
> > Since the slotsync worker patch has been committed, I rebased the
> > remaining patches.
> > And here is the V95 patch set.
> >
> > Also, I fixed a bug in the current 0001 patch where the member of the
> > standby slot names list pointed to the freed memory after calling
> ProcessConfigFile().
> > Now, we will obtain a new list when we call ProcessConfigFile(). The
> > optimization to only get the new list when the names actually change
> > has been removed. I think this change is acceptable because
> > ProcessConfigFile is not a frequent occurrence.
> >
> > Additionally, I reordered the tests in
> > 040_standby_failover_slots_sync.pl. Now the new test will be conducted
> > after the sync slot test to prevent the risk of the logical slot
> > occasionally not catching up to the latest catalog_xmin and, as a result, 
> > not
> being able to be synced immediately.
> 
> There is one unexpected change in the previous version, sorry for that.
> Here is the correct version.

I noticed one CFbot failure[1] which is because the tap-test doesn't wait for 
the
standby to catch up before promoting, thus the data inserted after promotion
could not be replicated to the subscriber. Add a wait_for_replay_catchup to fix 
it.

Apart from this, I also adjusted some variable names in the tap-test to be
consistent. And added back a mis-removed ProcessConfigFile call.

[1] https://cirrus-ci.com/task/6126787437002752?logs=check_world#L312

Best Regards,
Hou zj


v96-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch
Description:  v96-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch


v96-0002-Document-the-steps-to-check-if-the-standby-is-re.patch
Description:  v96-0002-Document-the-steps-to-check-if-the-standby-is-re.patch


Re: Synchronizing slots from primary to standby

2024-02-22 Thread shveta malik
On Fri, Feb 23, 2024 at 8:35 AM shveta malik  wrote:
>
> On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
>  wrote:
> >
> > Suppose that in synchronize_slots() the query would be:
> >
> > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> > " restart_lsn, catalog_xmin, two_phase, failover,"
> > " database, conflict_reason"
> > " FROM pg_catalog.pg_replication_slots"
> > " WHERE failover and NOT temporary and 1 = 1";
> >
> > Then my comment is to rewrite it to:
> >
> > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> > " restart_lsn, catalog_xmin, two_phase, failover,"
> > " database, conflict_reason"
> > " FROM pg_catalog.pg_replication_slots"
> > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1";
> >
> > to ensure the operator "=" is coming from the pg_catalog schema.
> >
>
> Thanks for the details, but slot-sync does not use SPI calls, it uses
> libpqrcv calls. So is this change needed?

Additionally, I would like to have a better understanding of why it's
necessary and whether it addresses any potential security risks.

thanks
Shveta




Re: Why is subscription/t/031_column_list.pl failing so much?

2024-02-22 Thread Amit Kapila
On Mon, Feb 19, 2024 at 2:19 PM Amit Kapila  wrote:
>
> On Thu, Feb 15, 2024 at 10:46 AM vignesh C  wrote:
> >
> > On Thu, 15 Feb 2024 at 08:42, Amit Kapila  wrote:
> > >
> >
> > > To resolve the BF
> > > failure, I still feel, we should just recreate the subscription. This
> > > is a pre-existing problem and we can track it via a separate patch
> > > with a test case targetting such failures.
> >
> > +1 for going with recreation of the subscription, the changes for this
> > are available at [1].
> > [1] - 
> > https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com
> >
>
> Tom, and others, does anyone still object to going ahead with an idea
> by just changing the test to recreate the subscription to silence BF
> failures for this test?
>

Seeing no objections, I have pushed the required test changes to silence the BF.

-- 
With Regards,
Amit Kapila.




Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-22 Thread Amit Kapila
On Thu, Feb 22, 2024 at 6:59 PM Давыдов Виталий
 wrote:
>
> I'd like to present and talk about a problem when 2PC transactions are 
> applied quite slowly on a replica during logical replication. There is a 
> master and a replica with established logical replication from the master to 
> the replica with twophase = true. With some load level on the master, the 
> replica starts to lag behind the master, and the lag will be increasing. We 
> have to significantly decrease the load on the master to allow replica to 
> complete the catchup. Such problem may create significant difficulties in the 
> production. The problem appears at least on REL_16_STABLE branch.
>
> To reproduce the problem:
>
> Setup logical replication from master to replica with subscription parameter 
> twophase =  true.
> Create some intermediate load on the master (use pgbench with custom sql with 
> prepare+commit)
> Optionally switch off the replica for some time (keep load on master).
> Switch on the replica and wait until it reaches the master.
>
> The replica will never reach the master with even some low load on the 
> master. If to remove the load, the replica will reach the master for much 
> greater time, than expected. I tried the same for regular transactions, but 
> such problem doesn't appear even with a decent load.
>
> I think, the main proplem of 2PC catchup bad performance - the lack of 
> asynchronous commit support for 2PC. For regular transactions asynchronous 
> commit is used on the replica by default (subscrition sycnronous_commit = 
> off). It allows the replication worker process on the replica to avoid fsync 
> (XLogFLush) and to utilize 100% CPU (the background wal writer or 
> checkpointer will do fsync). I agree, 2PC are mostly used in multimaster 
> configurations with two or more nodes which are performed synchronously, but 
> when the node in catchup (node is not online in a multimaster cluster), 
> asynchronous commit have to be used to speedup the catchup.
>

I don't see we do anything specific for 2PC transactions to make them
behave differently than regular transactions with respect to
synchronous_commit setting. What makes you think so? Can you pin point
the code you are referring to?

> There is another thing that affects on the disbalance of the master and 
> replica performance. When the master executes requestes from multiple 
> clients, there is a fsync optimization takes place in XLogFlush. It allows to 
> decrease the number of fsync in case when a number of parallel backends write 
> to the WAL simultaneously. The replica applies received transactions in one 
> thread sequentially, such optimization is not applied.
>

Right, I think for this we need to implement parallel apply.

> I see some possible solutions:
>
> Implement asyncronous commit for 2PC transactions.
> Do some hacking with enableFsync when it is possible.
>

Can you be a bit more specific about what exactly you have in mind to
achieve the above solutions?

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2024-02-22 Thread shveta malik
On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
 wrote:
>
> Suppose that in synchronize_slots() the query would be:
>
> const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> " restart_lsn, catalog_xmin, two_phase, failover,"
> " database, conflict_reason"
> " FROM pg_catalog.pg_replication_slots"
> " WHERE failover and NOT temporary and 1 = 1";
>
> Then my comment is to rewrite it to:
>
> const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
> " restart_lsn, catalog_xmin, two_phase, failover,"
> " database, conflict_reason"
> " FROM pg_catalog.pg_replication_slots"
> " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1";
>
> to ensure the operator "=" is coming from the pg_catalog schema.
>

Thanks for the details, but slot-sync does not use SPI calls, it uses
libpqrcv calls. So is this change needed?

thanks
Shveta




Re: speed up a logical replica setup

2024-02-22 Thread Euler Taveira
On Wed, Feb 21, 2024, at 5:00 AM, Shlok Kyal wrote:
> I found some issues and fixed those issues with top up patches
> v23-0012 and v23-0013
> 1.
> Suppose there is a cascade physical replication node1->node2->node3.
> Now if we run pg_createsubscriber with node1 as primary and node2 as
> standby, pg_createsubscriber will be successful but the connection
> between node2 and node3 will not be retained and log og node3 will
> give error:
> 2024-02-20 12:32:12.340 IST [277664] FATAL:  database system
> identifier differs between the primary and standby
> 2024-02-20 12:32:12.340 IST [277664] DETAIL:  The primary's identifier
> is 7337575856950914038, the standby's identifier is
> 7337575783125171076.
> 2024-02-20 12:32:12.341 IST [277491] LOG:  waiting for WAL to become
> available at 0/3000F10
> 
> To fix this I am avoiding pg_createsubscriber to run if the standby
> node is primary to any other server.
> Made the change in v23-0012 patch

IIRC we already discussed the cascading replication scenario. Of course,
breaking a node is not good that's why you proposed v23-0012. However,
preventing pg_createsubscriber to run if there are standbys attached to it is
also annoying. If you don't access to these hosts you need to (a) kill
walsender (very fragile / unstable), (b) start with max_wal_senders = 0 or (3)
add a firewall rule to prevent that these hosts do not establish a connection
to the target server. I wouldn't like to include the patch as-is. IMO we need
at least one message explaining the situation to the user, I mean, add a hint
message.  I'm resistant to a new option but probably a --force option is an
answer. There is no test coverage for it. I adjusted this patch (didn't include
the --force option) and add a test case.

> 2.
> While checking 'max_replication_slots' in 'check_publisher' function,
> we are not considering the temporary slot in the check:
> +   if (max_repslots - cur_repslots < num_dbs)
> +   {
> +   pg_log_error("publisher requires %d replication slots, but
> only %d remain",
> +num_dbs, max_repslots - cur_repslots);
> +   pg_log_error_hint("Consider increasing max_replication_slots
> to at least %d.",
> + cur_repslots + num_dbs);
> +   return false;
> +   }
> Fixed this in v23-0013

Good catch!

Both are included in the next patch.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Add publisher and subscriber to glossary documentation.

2024-02-22 Thread Peter Smith
Here are some comments for patch v2.

==

1. There are whitespace problems

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:43:
trailing whitespace.
  A node where publication is defined for
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:45:
trailing whitespace.
  It replicates a set of changes from a table or a group of tables in
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:66:
trailing whitespace.
 A node where subscription is defined for
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:67:
trailing whitespace.
 logical replication.
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:68:
trailing whitespace.
 It subscribes to one or more publications on a publisher node and pulls
warning: squelched 2 whitespace errors
warning: 7 lines add whitespace errors.

~~~

2. Publisher node

+  
+   Publisher node
+   
+
+  A node where publication is defined for
+  logical
replication.
+  It replicates a set of changes from a table or a group of tables in
+  publication to the subscriber node.
+
+
+ For more information, see
+ .
+
+   
+  
+


2a.
I felt the term here should be "Publication node".

Indeed, there should also be additional glossary terms like
"Publisher" (i.e. it is the same as "Publication node") and
"Subscriber" (i.e. it is the same as "Subscription node"). These
definitions will then be consistent with node descriptions already in
sections 30.1 and 30.2


~

2b.
"node" should include a link to the glossary term. Same for any other
terms mentioned

~

2c.
/A node where publication is defined for/A node where a publication is
defined for/

~

2d.
"It replicates" is misleading because it is the PUBLICATION doing the
replicating, not the node.

IMO it will be better to include 2 more glossary terms "Publication"
and "Subscription" where you can say this kind of information. Then
the link "logical-replication-publication" also belongs under the
"Publication" term.


~~~

3.
+  
+   Subscriber node
+   
+
+ A node where subscription is defined for
+ logical replication.
+ It subscribes to one or more publications on a publisher node and pulls
+ a set of changes from a table or a group of tables in publications it
+ subscribes to.
+
+
+ For more information, see
+ .
+
+   
+  

All comments are similar to those above.

==

In summary, IMO it should be a bit more like below:

SUGGESTION (include all the necessary links etc)

Publisher
See "Publication node"

Publication
A publication replicates the changes of one or more tables to a
subscription. For more information, see Section 30.1

Publication node
A node where a publication is defined for logical replication.

Subscriber
See "Subscription node"

Subscription
A subscription receives the changes of one or more tables from the
publications it subscribes to. For more information, see Section 30.2

Subscription node
A node where a subscription is defined for logical replication.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Synchronizing slots from primary to standby

2024-02-22 Thread Zhijie Hou (Fujitsu)
On Friday, February 23, 2024 10:02 AM Zhijie Hou (Fujitsu) 
 wrote:
> 
> Hi,
> 
> Since the slotsync worker patch has been committed, I rebased the remaining
> patches.
> And here is the V95 patch set.
> 
> Also, I fixed a bug in the current 0001 patch where the member of the standby
> slot names list pointed to the freed memory after calling ProcessConfigFile().
> Now, we will obtain a new list when we call ProcessConfigFile(). The
> optimization to only get the new list when the names actually change has been
> removed. I think this change is acceptable because ProcessConfigFile is not a
> frequent occurrence.
> 
> Additionally, I reordered the tests in 040_standby_failover_slots_sync.pl. Now
> the new test will be conducted after the sync slot test to prevent the risk 
> of the
> logical slot occasionally not catching up to the latest catalog_xmin and, as a
> result, not being able to be synced immediately.

There is one unexpected change in the previous version, sorry for that.
Here is the correct version.

Best Regards,
Hou zj


v95_2-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v95_2-0002-Document-the-steps-to-check-if-the-standby-is-r.patch


v95_2-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v95_2-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch


RE: Synchronizing slots from primary to standby

2024-02-22 Thread Zhijie Hou (Fujitsu)
Hi,

Since the slotsync worker patch has been committed, I rebased the remaining 
patches.
And here is the V95 patch set.

Also, I fixed a bug in the current 0001 patch where the member of the standby
slot names list pointed to the freed memory after calling ProcessConfigFile().
Now, we will obtain a new list when we call ProcessConfigFile(). The
optimization to only get the new list when the names actually change has been
removed. I think this change is acceptable because ProcessConfigFile is not a
frequent occurrence.

Additionally, I reordered the tests in 040_standby_failover_slots_sync.pl. Now 
the new test
will be conducted after the sync slot test to prevent the risk of the logical
slot occasionally not catching up to the latest catalog_xmin and, as a result,
not being able to be synced immediately.

Best Regards,
Hou zj


v95-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v95-0002-Document-the-steps-to-check-if-the-standby-is-r.patch


v95-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v95-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch


Re: Improve readability by using designated initializers when possible

2024-02-22 Thread Jeff Davis
On Fri, 2024-02-23 at 01:35 +0100, Jelte Fennema-Nio wrote:
> On Thu, Feb 22, 2024, 23:46 Jeff Davis  wrote:
> > 
> > Am I missing something?
> 
> The main benefits it has are:

Sorry, I was unclear. I was asking a question about the reason the
ObjectClass and the object_classes[] array exist in the current code,
it wasn't a direct question about your patch.

ObjectClass is only used in a couple places, and I don't immediately
see why those places can't just use the OID of the class (like 
OperatorClassRelationId).

Regards,
Jeff Davis





Re: Improve readability by using designated initializers when possible

2024-02-22 Thread Jelte Fennema-Nio
On Thu, Feb 22, 2024, 23:46 Jeff Davis  wrote:

>
> Am I missing something?


The main benefits it has are:
1. The order of the array doesn't have to exactly match the order of the
enum for the arrays to contain the correct mapping.
2. Typos in the enum variant names are caught by the compiler because
actual symbols are used, not comments.
3. The left-to-right order reads more natural imho for such key-value
pairs, e.g. OCLASS_PROC maps to ProcedureRelationId.


Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-22 Thread Tom Lane
I wrote:
> I think that this is a band-aid that's just masking an error in the
> rowmarking logic: it's not doing the right thing for appendrels
> made from UNION ALL subqueries.  I've not wrapped my head around
> exactly where it's going off the rails, but I feel like this ought
> to get fixed somewhere else.  Please hold off pushing your patch
> for now.

So after studying this for awhile, I see that the planner is emitting
a PlanRowMark that presumes that the UNION ALL subquery will be
scanned as though it's a base relation; but since we've converted it
to an appendrel, the executor just ignores that rowmark, and the wrong
things happen.  I think the really right fix would be to teach the
executor to honor such PlanRowMarks, by getting nodeAppend.c and
nodeMergeAppend.c to perform EPQ row substitution.  I wrote a draft
patch for that, attached, and it almost works but not quite.  The
trouble is that we're jamming the contents of the row identity Var
created for the rowmark into the output of the Append or MergeAppend,
and it doesn't necessarily match exactly.  In the test case you
created, the planner produces targetlists like

Output: src_1.val, src_1.id, ROW(src_1.id, src_1.val)

and as you can see the order of the columns doesn't match.
I can see three ways we might attack that:

1. Persuade the planner to build output tlists that always match
the row identity Var.  This seems undesirable, because the planner
might have intentionally elided columns that won't be read by the
upper parts of the plan.

2. Change generation of the ROW() expression so that it lists only
the values we're going to output, in the order we're going to
output them.  This would amount to saying that for UNION cases
the "identity" of a row need only consider columns used by the
plan, which feels a little odd but I can't think of a reason why
it wouldn't work.  I'm not sure how messy this'd be to implement
though, as the set of columns to be emitted isn't fully determined
until much later than where we currently expand the row identity
Vars into RowExprs.

3. Fix the executor to remap what it gets out of the ROW() into the
order of the subquery tlists.  This is probably do-able but I'm
not certain; it may be that the executor hasn't enough info.
We might need to teach the planner to produce a mapping projection
and attach it to the Append node, which carries some ABI risk (but
in the past we've gotten away with adding new fields to the ends
of plan nodes in the back branches).  Another objection is that
adding cycles to execution rather than planning might be a poor
tradeoff --- although if we only do the work when EPQ is invoked,
maybe it'd be the best way.

It might be that any of these things is too messy to be considered
for back-patching, and we ought to apply what you did in the
back branches.  I'd like a better fix in HEAD though.

regards, tom lane

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c7059e7528..113217e607 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -288,8 +288,65 @@ static TupleTableSlot *
 ExecAppend(PlanState *pstate)
 {
 	AppendState *node = castNode(AppendState, pstate);
+	EState	   *estate = node->ps.state;
 	TupleTableSlot *result;
 
+	if (estate->es_epq_active != NULL)
+	{
+		/*
+		 * We are inside an EvalPlanQual recheck.  If there is a relevant
+		 * rowmark for the append relation, return the test tuple if one is
+		 * available.
+		 */
+		EPQState   *epqstate = estate->es_epq_active;
+		int			scanrelid;
+
+		if (bms_get_singleton_member(castNode(Append, node->ps.plan)->apprelids,
+	 ))
+		{
+			if (epqstate->relsubs_done[scanrelid - 1])
+			{
+/*
+ * Return empty slot, as either there is no EPQ tuple for this
+ * rel or we already returned it.
+ */
+TupleTableSlot *slot = node->ps.ps_ResultTupleSlot;
+
+return ExecClearTuple(slot);
+			}
+			else if (epqstate->relsubs_slot[scanrelid - 1] != NULL)
+			{
+/*
+ * Return replacement tuple provided by the EPQ caller.
+ */
+TupleTableSlot *slot = epqstate->relsubs_slot[scanrelid - 1];
+
+Assert(epqstate->relsubs_rowmark[scanrelid - 1] == NULL);
+
+/* Mark to remember that we shouldn't return it again */
+epqstate->relsubs_done[scanrelid - 1] = true;
+
+return slot;
+			}
+			else if (epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+			{
+/*
+ * Fetch and return replacement tuple using a non-locking
+ * rowmark.
+ */
+TupleTableSlot *slot = node->ps.ps_ResultTupleSlot;
+
+/* Mark to remember that we shouldn't return more */
+epqstate->relsubs_done[scanrelid - 1] = true;
+
+if (!EvalPlanQualFetchRowMark(epqstate, scanrelid, slot))
+	return NULL;
+
+return slot;
+			}
+		}
+	}
+
 	/*
 	 * If this is the first call after Init or ReScan, we need to do the
 	 * initialization work.
@@ -405,6 +462,7 @@ ExecEndAppend(AppendState *node)

Re: Running the fdw test from the terminal crashes into the core-dump

2024-02-22 Thread Tom Lane
Alvaro Herrera  writes:
> After having pushed that, I wonder if we should document this.  It seems
> quite the minor thing, but I'm sure somebody will complain if we don't.

Yup, no doubt.

> I propose the attached.  (Extra context so that the full paragraph can
> be read from the comfort of your email program.)

This reads awkwardly to me.  I think it'd be better to construct it
so that DO NOTHING's requirement is stated exactly parallel to the other
three clause types, more or less as attached.

> (While at it, I found the placement of the previous-to-last sentence in
> that paragraph rather strange, so I moved it to the end.)

Agreed, and done in my version too.

BTW, if you read it without paying attention to markup, you'll notice
that we are saying things like

If you specify an insert action, you must have the INSERT
privilege on the target_table_name.

which is fairly nonsensical: we don't define privileges on the name
of something.  While I've not done anything about that here,
I wonder if we shouldn't just write "privilege on the target table"
without any special markup.

regards, tom lane

diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml
index 655f7dcc05..79ebff1033 100644
--- a/doc/src/sgml/ref/merge.sgml
+++ b/doc/src/sgml/ref/merge.sgml
@@ -104,12 +104,15 @@ DELETE
privilege on the target_table_name.
If you specify a delete action, you must have the DELETE
privilege on the target_table_name.
-   Privileges are tested once at statement start and are checked
-   whether or not particular WHEN clauses are executed.
-   You will require the SELECT privilege on any column(s)
+   If you specify a DO NOTHING action, you must have
+   the SELECT privilege on at least one column
+   of target_table_name.
+   You will also need SELECT privilege on any column(s)
of the data_source and
target_table_name referred to
in any condition or expression.
+   Privileges are tested once at statement start and are checked
+   whether or not particular WHEN clauses are executed.
   
 
   


Re: WIP Incremental JSON Parser

2024-02-22 Thread Andrew Dunstan



On 2024-02-22 Th 15:29, Jacob Champion wrote:

On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan  wrote:

Patch 5 in this series fixes those issues and
adjusts most of the tests to add some trailing junk to the pieces of
json, so we can be sure that this is done right.

This fixes the test failure for me, thanks! I've attached my current
mesonification diff, which just adds test_json_parser to the suite. It
relies on the PATH that's set up, which appears to include the build
directory for both VPATH builds and Meson.




OK, thanks, will add this in the next version.




Are there plans to fill out the test suite more? Since we should be
able to control all the initial conditions, it'd be good to get fairly
comprehensive coverage of the new code.




Well, it's tested (as we know) by the backup manifest tests. During 
development, I tested by making the regular parser use the 
non-recursive  parser (see FORCE_JSON_PSTACK). That doesn't test the 
incremental piece of it, but it does check that the rest of it is doing 
the right thing. We could probably extend the incremental test by making 
it output a stream of tokens and making sure that was correct.




As an aside, I find the behavior of need_escapes=false to be
completely counterintuitive. I know the previous code did this, but it
seems like the opposite of "provides unescaped strings" should be
"provides raw strings", not "all strings are now NULL".



Yes, we could possibly call it "need_strings"  or something like that.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Documentation to upgrade logical replication cluster

2024-02-22 Thread Peter Smith
Hi Vignesh, I have no further comments. Patch v9 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Avoid stack frame setup in performance critical routines using tail calls

2024-02-22 Thread Andres Freund
Hi,

On 2024-02-23 00:46:26 +1300, David Rowley wrote:
> I've rebased the 0001 patch and gone over it again and made a few
> additional changes besides what I mentioned in my review.
>
> On Wed, 9 Aug 2023 at 20:44, David Rowley  wrote:
> > Here's a review of v2-0001:
> > 2. Why do you need to add the NULL check here?
> >
> >  #ifdef USE_VALGRIND
> > - if (method != MCTX_ALIGNED_REDIRECT_ID)
> > + if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID)
> >   VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
> >  #endif
>
> I removed this NULL check as we're calling the realloc function with
> no flags, so it shouldn't return NULL as it'll error out from any OOM
> errors.

That was probably a copy-paste issue...


> > 4. It would be good to see some API documentation in the
> > MemoryContextMethods struct.  This adds a lot of responsibility onto
> > the context implementation without any extra documentation to explain
> > what, for example, palloc is responsible for and what the alloc
> > function needs to do itself.
>
> I've done that too.
>
> I also added header comments for MemoryContextAllocationFailure and
> MemoryContextSizeFailure and added some comments to explain in places
> like palloc() to warn people not to add checks after the 'alloc' call.
>
> The rebased patch is 0001 and all of my changes are in 0002.  I will
> rebase your original 0002 patch later.

Thanks!


> I think 0001 is much more important, as evident by the reported benchmarks
> on this thread.

I agree that it's good to tackle 0001 first.

I don't understand the benchmark point though. Your benchmark seems to suggest
that 0002 improves aset performance by *more* than 0001: for 8 byte aset
allocs:

time
master: 8.86
0001:   8.12
0002:   7.02

So 0001 reduces time by 0.92x and 0002 by 0.86x.


John's test shows basically no change for 0002 - which is unsurprising, as
0002 changes aset.c, but the test seems to solely exercise slab, as only
SlabAlloc() shows up in the profile. As 0002 only touches aset.c it couldn't
really have affected that test.


> In absence of anyone else looking at this, I think it's ready to go.
> If anyone is following along and wants to review or test it, please do
> so soon.

Makes sense!



> @@ -1061,6 +1072,16 @@ MemoryContextAlloc(MemoryContext context, Size size)
>
>   context->isReset = false;
>

For a moment this made me wonder if we could move the isReset handling into
the allocator slow paths as well - it's annoying to write that bit (and thus
dirty the cacheline) over and ove. But it'd be somewhat awkward due to
pre-allocated blocks. So that'd be a larger change better done separately.

Greetings,

Andres Freund




Re: Improve readability by using designated initializers when possible

2024-02-22 Thread Jeff Davis
On Wed, 2024-02-21 at 16:03 +0100, Jelte Fennema-Nio wrote:
> Usage of designated initializers came up in:
> https://www.postgresql.org/message-id/flat/ZdWXhAt9Tz4d-lut%40paquier.xyz#9dc17e604e58569ad35643672bf74acc
> 
> This converts all arrays that I could find that could clearly benefit
> from this without any other code changes being necessary.

Looking at the object_classes array and the ObjectClass enum, I don't
quite understand the point. It seems like a way to write OCLASS_OPCLASS
instead of OperatorClassRelationId, and similar?

Am I missing something?

Regards,
Jeff Davis





Re: Add lookup table for replication slot invalidation causes

2024-02-22 Thread Peter Smith
On Thu, Feb 22, 2024 at 5:56 PM Michael Paquier  wrote:
>
> On Thu, Feb 22, 2024 at 05:30:08PM +1100, Peter Smith wrote:
> > Oops. Perhaps I meant more like below -- in any case, the point was
> > the same -- to ensure RS_INVAL_NONE is what returns if something
> > unexpected happens.
>
> You are right that this could be a bit confusing, even if we should
> never reach this state.  How about avoiding to return the index of the
> loop as result, as of the attached?  Would you find that cleaner?
> --

Hi, yes, it should never happen, but thanks for making the changes.

I would've just removed every local variable instead of adding more of
them. I also felt the iteration starting from RS_INVAL_NONE instead of
0 is asserting RS_INVAL_NONE must always be the first enum and can't
be rearranged. Probably it will never happen, but why require it?

--
ReplicationSlotInvalidationCause
GetSlotInvalidationCause(const char *conflict_reason)
{
for (ReplicationSlotInvalidationCause cause = 0; cause <=
RS_INVAL_MAX_CAUSES; cause++)
if (strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0)
return cause;

Assert(0);
return RS_INVAL_NONE;
}
--

But maybe those nits are a matter of personal choice. Your patch code
addressed my main concern, so it LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

2024-02-22 Thread Michael Banck
Hi,

On Wed, Jan 24, 2024 at 02:50:52PM -0500, Kirk Wolak wrote:
> On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak  wrote:
> > On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson  wrote:
> >> > On 19 Jan 2024, at 23:09, Kirk Wolak  wrote:
> > Thank you, that made it possible to build and run...
> > UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
> > I am watching it already consuming 6% of my system memory.
> >
> Daniel,
>   In the previous email, I made note that once the JIT was enabled, the
> problem exists in 17Devel.
> I re-included my script, which forced the JIT to be used...
> 
>   I attached an updated script that forced the settings.
> But this is still leaking memory (outside of the
> pg_backend_memory_context() calls).
> Probably because it's at the LLVM level?  And it does NOT happen from
> planning/opening the query.  It appears I have to fetch the rows to
> see the problem.

I had a look at this (and blogged about it here[1]) and was also
wondering what was going on with 17devel and the recent back-branch
releases, cause I could also reproduce those continuing memory leaks.

Adding some debug logging when llvm_inline_reset_caches() is called
solves the mystery: as you are calling a function, the fix that is in
17devel and the back-branch releases is not applicable and only after
the function returns llvm_inline_reset_caches() is being called (as
llvm_jit_context_in_use_count is greater than zero, presumably, so it
never reaches the call-site of llvm_inline_reset_caches()).

If you instead run your SQL in a DO-loop (as in the blog post) and not
as a PL/PgSQL function, you should see that it no longer leaks. This
might be obvious to some (and Andres mentioned it in 
https://www.postgresql.org/message-id/20210421002056.gjd6rpe6toumiqd6%40alap3.anarazel.de)
but it took me a while to figure out/find.


Michael

[1] 
https://www.credativ.de/en/blog/postgresql-en/quick-benchmark-postgresql-2024q1-release-performance-improvements/




Re: speed up a logical replica setup

2024-02-22 Thread Euler Taveira
On Thu, Feb 22, 2024, at 9:43 AM, Hayato Kuroda (Fujitsu) wrote:
> > The possible solution would be
> > 1) allow to run pg_createsubscriber if standby is initially stopped .
> > I observed that pg_logical_createsubscriber also uses this approach.
> > 2) read GUCs via SHOW command and restore them when server restarts
> >

3. add a config-file option. That's similar to what pg_rewind does. I expect
that Debian-based installations will have this issue.

> I also prefer the first solution. 
> Another reason why the standby should be stopped is for backup purpose.
> Basically, the standby instance should be saved before running 
> pg_createsubscriber.
> An easiest way is hard-copy, and the postmaster should be stopped at that 
> time.
> I felt it is better that users can run the command immediately later the 
> copying.
> Thought?

It was not a good idea if you want to keep the postgresql.conf outside PGDATA.
I mean you need extra steps that can be error prone (different settings between
files).

Shlok, I didn't read your previous email carefully. :-/


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: WIP Incremental JSON Parser

2024-02-22 Thread Jacob Champion
On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan  wrote:
> Patch 5 in this series fixes those issues and
> adjusts most of the tests to add some trailing junk to the pieces of
> json, so we can be sure that this is done right.

This fixes the test failure for me, thanks! I've attached my current
mesonification diff, which just adds test_json_parser to the suite. It
relies on the PATH that's set up, which appears to include the build
directory for both VPATH builds and Meson.

Are there plans to fill out the test suite more? Since we should be
able to control all the initial conditions, it'd be good to get fairly
comprehensive coverage of the new code.

As an aside, I find the behavior of need_escapes=false to be
completely counterintuitive. I know the previous code did this, but it
seems like the opposite of "provides unescaped strings" should be
"provides raw strings", not "all strings are now NULL".

--Jacob
commit 590ea7bec167058340624313d98c72976fa89d7a
Author: Jacob Champion 
Date:   Wed Feb 21 06:36:55 2024 -0800

WIP: mesonify

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..e5c9bd10cf 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_dsm_registry')
 subdir('test_extensions')
 subdir('test_ginpostinglist')
 subdir('test_integerset')
+subdir('test_json_parser')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
new file mode 100644
index 00..a5bedce94e
--- /dev/null
+++ b/src/test/modules/test_json_parser/meson.build
@@ -0,0 +1,50 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_json_parser_incremental_sources = files(
+  'test_json_parser_incremental.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_incremental_sources += rc_bin_gen.process(win32ver_rc, 
extra_args: [
+'--NAME', 'test_json_parser_incremental',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_incremental = executable('test_json_parser_incremental',
+  test_json_parser_incremental_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)
+
+test_json_parser_perf_sources = files(
+  'test_json_parser_perf.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_perf_sources += rc_bin_gen.process(win32ver_rc, extra_args: 
[
+'--NAME', 'test_json_parser_perf',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_perf = executable('test_json_parser_perf',
+  test_json_parser_perf_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)
+
+tests += {
+  'name': 'test_json_parser',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'tests': [
+  't/001_test_json_parser_incremental.pl',
+],
+  },
+}
diff --git 
a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl 
b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
index fc9718baf3..8eeb7f5b91 100644
--- a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
+++ b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
@@ -8,7 +8,7 @@ use FindBin;
 
 my $test_file = "$FindBin::RealBin/../tiny.json";
 
-my $exe = "$ENV{TESTDATADIR}/../test_json_parser_incremental";
+my $exe = "test_json_parser_incremental";
 
 my ($stdout, $stderr) = run_command( [$exe, $test_file] );
 


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-22 Thread Alvaro Herrera
On 2024-Feb-07, Dilip Kumar wrote:

> On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  wrote:

> > Sure, but is that really what we want?
> 
> So your question is do we want these buffers to be in multiple of
> SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> don't think it should create any problem logically.  I mean we can
> look again in the patch to see if we have made any such assumptions
> but that should be fairly easy to fix, then maybe if we are going in
> this way we should get rid of the check_slru_buffers() function as
> well.

Not really, I just don't think the macro should be in slru.h.

Another thing I've been thinking is that perhaps it would be useful to
make the banks smaller, when the total number of buffers is small.  For
example, if you have 16 or 32 buffers, it's not really clear to me that
it makes sense to have just 1 bank or 2 banks.  It might be more
sensible to have 4 banks with 4 or 8 buffers instead.  That should make
the algorithm scale down as well as up ...

I haven't done either of those things in the attached v19 version.  I
did go over the comments once again and rewrote the parts I was unhappy
with, including some existing ones.  I think it's OK now from that point
of view ... at some point I thought about creating a separate README,
but in the end I thought it not necessary.

I did add a bunch of Assert()s to make sure the locks that are supposed
to be held are actually held.  This led me to testing the page status to
be not EMPTY during SimpleLruWriteAll() before calling
SlruInternalWritePage(), because the assert was firing.  The previous
code is not really *buggy*, but to me it's weird to call WritePage() on
a slot with no contents.

Another change was in TransactionGroupUpdateXidStatus: the original code
had the leader doing pg_atomic_read_u32(>clogGroupFirst) to
know which bank to lock.  I changed it to simply be the page used by the
leader process; this doesn't need an atomic read, and should be the same
page anyway.  (If it isn't, it's no big deal).  But what's more: even if
we do read ->clogGroupFirst at that point, there's no guarantee that
this is going to be exactly for the same process that ends up being the
first in the list, because since we have not set it to INVALID by the
time we grab the bank lock, it is quite possible for more processes to
add themselves to the list.

I realized all this while rewriting the comments in a way that would let
me understand what was going on ... so IMO the effort was worthwhile.

Anyway, what I send now should be pretty much final, modulo the
change to the check_slru_buffers routines and documentation additions to
match pg_stat_slru to the new GUC names.

> > Another painful point is that pg_stat_slru uses internal names in the
> > data it outputs, which obviously do not match the new GUCs.
> 
> Yeah, that's true, but I think this could be explained somewhere not
> sure what is the right place for this.

Or we can change those names in the view.

> FYI, I have also repeated all the performance tests I performed in my
> first email[1], and I am seeing a similar gain.

Excellent, thanks for doing that.

> Some comments on v18 in my first pass of the review.
> 
> 1.
> @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
>   lsnindex = GetLSNIndex(slotno, xid);
>   *lsn = XactCtl->shared->group_lsn[lsnindex];
> 
> - LWLockRelease(XactSLRULock);
> + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));
> 
> Maybe here we can add an assert before releasing the lock for a safety check
> 
> Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

Hmm, I think this would just throw a warning or error "you don't hold
such lwlock", so it doesn't seem necessary.

> 2.
> + *
> + * XXX could we make the LSNs to be bank-based?
>   */
>   XLogRecPtr *group_lsn;
> 
> IMHO, the flush still happens at the page level so up to which LSN
> should be flush before flushing the particular clog page should also
> be maintained at the page level.

Yeah, this was a misguided thought, nevermind.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.
>From e1aabcbf5ce1417decfe24f513e5cfe8b6de77f2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 22 Feb 2024 18:42:56 +0100
Subject: [PATCH v19] Make SLRU buffer sizes configurable

Also, divide the slot array in banks, so that the LRU algorithm can be
made more scalable.

Also remove the centralized control lock for even better scalability.

Authors: Dilip Kumar, Andrey Borodin
---
 doc/src/sgml/config.sgml  | 139 +++
 src/backend/access/transam/clog.c | 235 
 src/backend/access/transam/commit_ts.c|  89 +++--
 src/backend/access/transam/multixact.c| 190 +++---
 src/backend/access/transam/slru.c | 345 +-
 

Re: Running the fdw test from the terminal crashes into the core-dump

2024-02-22 Thread Alvaro Herrera
On 2024-Feb-20, Tom Lane wrote:

> > So, this means we can fix this by simply requiring ACL_SELECT privileges
> > on a DO NOTHING action.  We don't need to request specific privileges on
> > any particular column (perminfo->selectedCols continues to be the empty
> > set) -- which means that any role that has privileges on *any* column
> > would get a pass.
> 
> LGTM.

Thanks for looking!

After having pushed that, I wonder if we should document this.  It seems
quite the minor thing, but I'm sure somebody will complain if we don't.
I propose the attached.  (Extra context so that the full paragraph can
be read from the comfort of your email program.)

(While at it, I found the placement of the previous-to-last sentence in
that paragraph rather strange, so I moved it to the end.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)
diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml
index 655f7dcc05..85938eda07 100644
--- a/doc/src/sgml/ref/merge.sgml
+++ b/doc/src/sgml/ref/merge.sgml
@@ -97,26 +97,29 @@ DELETE
   
There is no separate MERGE privilege.
If you specify an update action, you must have the
UPDATE privilege on the column(s)
of the target_table_name
that are referred to in the SET clause.
If you specify an insert action, you must have the INSERT
privilege on the target_table_name.
If you specify a delete action, you must have the DELETE
privilege on the target_table_name.
-   Privileges are tested once at statement start and are checked
-   whether or not particular WHEN clauses are executed.
You will require the SELECT privilege on any column(s)
of the data_source and
target_table_name referred to
-   in any condition or expression.
+   in any condition or expression;
+   in addition, if a DO NOTHING action is specified, you
+   will require the SELECT privilege on at least one column
+   of target_table_name.
+   Privileges are tested once at statement start and are checked
+   whether or not particular WHEN clauses are executed.
   
 
   
MERGE is not supported if the
target_table_name is a
materialized view, foreign table, or if it has any
rules defined on it.
   
  
 


Re: speed up a logical replica setup

2024-02-22 Thread 'Alvaro Herrera'
Hello,

On 2024-Feb-22, Hayato Kuroda (Fujitsu) wrote:

> Dear Alvaro,

> > Hmm, but doesn't this mean that the server will log an ugly message
> > that "client closed connection unexpectedly"?  I think it's nicer to
> > close the connection before terminating the process (especially
> > since the code for that is already written).
> 
> OK. So we should disconnect properly even if the process exits. I
> added the function call again. Note that PQclear() was not added
> because it is only related with the application.

Sounds about right, but I didn't verify the patches in detail.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)




Re: locked reads for atomics

2024-02-22 Thread Jeff Davis
On Thu, 2024-02-22 at 12:58 +0530, Bharath Rupireddy wrote:
> There's some immediate use for reads/writes with barrier semantics -

Is this mainly a convenience for safety/readability? Or is it faster in
some cases than doing an atomic access with separate memory barriers?

Regards,
Jeff Davis





Re: LogwrtResult contended spinlock

2024-02-22 Thread Jeff Davis
On Thu, 2024-02-22 at 10:17 +0530, Robert Haas wrote:
> I think the problems tend to be worst when you have some bit of data
> that's being frequently modified by multiple backends. Every backend
> that wants to modify the value needs to steal the cache line, and
> eventually you spend most of your CPU time stealing cache lines from
> other sockets and not much of it doing any actual work. If you have a
> value that's just being read by a lot of backends without
> modification, I think the cache line can be shared in read only mode
> by all the CPUs and it's not too bad.

That makes sense. I guess they'd be on the same cache line as well,
which means a write to either will invalidate both.

Some places (XLogWrite, XLogInsertRecord, XLogSetAsyncXactLSN,
GetFlushRecPtr, GetXLogWriteRecPtr) already update it from shared
memory anyway, so those are non-issues.

The potential problem areas (unless I missed something) are:

  * AdvanceXLInsertBuffer reads it as an early check to see if a buffer
is already written out.

  * XLogFlush / XLogNeedsFlush use it for an early return

  * WALReadFromBuffers reads it for an error check (currently an
Assert, but we might want to make that an elog).

  * We are discussing adding a Copy pointer, which would be advanced by
WaitXLogInsertionsToFinish(), and if we do replication-before-flush, we
may want to be more eager about advancing it. That could cause an
issue, as well.

I don't see the global non-shared variable as a huge problem, so if it
serves a purpose then I'm fine keeping it. Perhaps we could make it a
bit safer by using some wrapper functions. Something like:

  bool
  IsWriteRecPtrAtLeast(XLogRecPtr recptr)
  {
 XLogRecPtr writeptr;
 if (LogwrtResult.Write >= recptr)
 return true;
 writeptr = GetXLogWriteRecPtr();
 return (writeptr >= recptr);
  }

That would reduce the number of direct references to LogwrtResult,
callers would never see a stale value, and would avoid the cache miss
problem that you're concerned about. Not sure if they'd need to be
inline or not.

Regards,
Jeff Davis





Re: Unlinking Parallel Hash Join inner batch files sooner

2024-02-22 Thread Thomas Munro
On Thu, Feb 22, 2024 at 5:37 PM Andrei Lepikhov
 wrote:
> On 22/2/2024 06:42, Thomas Munro wrote:
> > extreme skew for one version of the problem, but even with zero/normal
> > skewness and perfect estimation of the number of partitions, if you

Sorry, I meant to write "but even with no duplicates" there (mention
of "normal" was brain fade).




re: Speeding up COPY TO for uuids and arrays

2024-02-22 Thread Ranier Vilela
Hi,

On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote:
> As a test case, I created a table with 1 rows, each of which
> had an array of 1 uuids. The table resided in shared buffers.
Can you share exactly script used to create a table?

best regards,

Ranier Vilela


Re: Optimize planner memory consumption for huge arrays

2024-02-22 Thread Alena Rybakina

Hi!

On 20.02.2024 07:41, Andrei Lepikhov wrote:

On 20/2/2024 04:51, Tom Lane wrote:

Tomas Vondra  writes:

On 2/19/24 16:45, Tom Lane wrote:

Tomas Vondra  writes:
For example, I don't think we expect selectivity functions to 
allocate

long-lived objects, right? So maybe we could run them in a dedicated
memory context, and reset it aggressively (after each call).

Here's a quick and probably-incomplete implementation of that idea.
I've not tried to study its effects on memory consumption, just made
sure it passes check-world.
Thanks for the sketch. The trick with the planner_tmp_cxt_depth 
especially looks interesting.
I think we should design small memory contexts - one per scalable 
direction of memory utilization, like selectivity or partitioning 
(appending ?).
My coding experience shows that short-lived GEQO memory context forces 
people to learn on Postgres internals more precisely :).


I think there was a problem in your patch when you freed up the memory 
of a variable in the eqsel_internal function, because we have a case 
where the variable was deleted by reference in the 
eval_const_expressions_mutator function (it is only for T_SubPlan and 
T_AlternativeSubPlan type of nodes.


This query just causes an error in your case:

create table a (id bigint, c1 bigint, primary key(id));
create table b (a_id bigint, b_id bigint, b2 bigint, primary key(a_id, 
b_id));

explain select id
    from a, b
    where id = a_id
  and b2 = (select  min(b2)
    from    b
    where   id = a_id);
drop table a;
drop table b;

We can return a copy of the variable or not release the memory of this 
variable.


I attached two patch: the first one is removing your memory cleanup and 
another one returns the copy of variable.


The author of the corrections is not only me, but also Daniil Anisimov.

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/optimizer/util/clauses.c 
b/src/backend/optimizer/util/clauses.c
index edc25d712e9..ac560b1605e 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -2915,7 +2915,7 @@ eval_const_expressions_mutator(Node *node,
 * XXX should we ereport() here instead?  Probably this 
routine
 * should never be invoked after SubPlan creation.
 */
-   return node;
+   return CopyObject(node);
case T_RelabelType:
{
RelabelType *relabel = (RelabelType *) node;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d40..e5bad75ec1c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -281,6 +281,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate)
selec = var_eq_non_const(, operator, collation, other,
 varonleft, 
negate);
 
+   pfree(other);
ReleaseVariableStats(vardata);
 
return selec;
@@ -1961,15 +1962,15 @@ scalararraysel(PlannerInfo *root,
{
List   *args;
Selectivity s2;
-
-   args = list_make2(leftop,
- 
makeConst(nominal_element_type,
-   
-1,
-   
nominal_element_collation,
-   
elmlen,
-   
elem_values[i],
-   
elem_nulls[i],
-   
elmbyval));
+   Const *c = makeConst(nominal_element_type,
+   -1,
+   
nominal_element_collation,
+   elmlen,
+   elem_values[i],
+   elem_nulls[i],
+   elmbyval);
+
+   args = list_make2(leftop, c);
if (is_join_clause)
s2 = 
DatumGetFloat8(FunctionCall5Coll(,

  clause->inputcollid,
@@ -1985,7 +1986,8 @@ scalararraysel(PlannerInfo *root,

  

how to read table options during smgropen()

2024-02-22 Thread Dima Rybakov (Tlt)
Dear pgsql hackers,

I am developing custom storage for pgsql tables. I am using md* functions
and smgrsw[] structure to switch between different magnetic disk
access methods.

I want to add some custom options while table created
psql# create table t(...) with (my_option='value');

And thus I want to set "reln->smgr_which" conditionally during smgropen().
If myoption='value' i would use another smgr_which

I am really stuck at this point.

smgr.c:
SMgrRelation
smgropen(RelFileNode rnode, BackendId backend){
...
  if ( HasOption(rnode, "my_option","value")){ //<< how to implement this
check ?
reln->smgr_which = 1; //new access method
  }else{
reln->smgr_which = 0; //old access method
  }
...
}


The question is --- can I read table options while the table is
identified by  "RelFileNode rnode" ??

The only available information is
typedef struct RelFileNode
{
  Oid spcNode; /* tablespace */
  Oid dbNode; /* database */
  Oid relNode; /* relation */
} RelFileNode;

But there are no table options available directly from this structure.
What is the best way to implement HasOption(rnode, "my_option","value")

Thank you in advance for any ideas.
Sincerely,
Dmitry R


Re: Transaction timeout

2024-02-22 Thread Andrey M. Borodin


> On 19 Feb 2024, at 15:17, Japin Li  wrote:
> 
> 
> +1

PFA patch set of 4 patches:
1. remove all potential flaky tests. BTW recently we had a bingo when 3 of them 
failed together [0]
2-3. waiting injection points patchset by Michael Paquier, intact v2 from 
nearby thread.
4. prototype of simple TAP tests for timeouts.

I did not add a test for statement_timeout, because it still have good coverage 
in isolation tests. But added test for idle_sessoin_timeout.
Maybe these tests could be implemented with NOTICE injection points (not 
requiring steps 2-3), but I'm afraid that they might be flaky too: FATALed 
connection might not send information necesary for test success (we will see 
something like "PQconsumeInput failed: server closed the connection 
unexpectedly" as in [1]).


Best regards, Andrey Borodin.

[0] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-20%2010%3A20%3A13
[1] 
https://www.postgresql.org/message-id/flat/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com



0001-Remove-flacky-isolation-tests-for-timeouts.patch
Description: Binary data


0004-Add-timeouts-TAP-tests.patch
Description: Binary data


0003-Add-regression-test-for-restart-points-during-promot.patch
Description: Binary data


0002-injection_points-Add-routines-to-wait-and-wake-proce.patch
Description: Binary data


Re: Experiments with Postgres and SSL

2024-02-22 Thread Heikki Linnakangas

On 22/02/2024 01:43, Matthias van de Meent wrote:

On Wed, 10 Jan 2024 at 09:31, Heikki Linnakangas  wrote:

4. The number of combinations of sslmode, gssencmode and sslnegotiation
settings is scary. And we have very few tests for them.


Yeah, it's not great. We could easily automate this better though. I
mean, can't we run the tests using a "cube" configuration, i.e. test
every combination of parameters? We would use a mapping function of
(psql connection parameter values -> expectations), which would be
along the lines of the attached pl testfile. I feel it's a bit more
approachable than the lists of manual option configurations, and makes
it a bit easier to program the logic of which connection security
option we should have used to connect.
The attached file would be a drop-in replacement; it's tested to work
with SSL only - without GSS - because I've been having issues getting
GSS working on my machine.


+1 testing all combinations. I don't think the 'mapper' function 
approach in your version is much better than the original though. Maybe 
it would be better with just one 'mapper' function that contains all the 
rules, along the lines of: (This isn't valid perl, just pseudo-code)


sub expected_outcome
{
my ($user, $sslmode, $negotiation, $gssmode) = @_;

my @possible_outcomes = { 'plain', 'ssl', 'gss' }

delete $possible_outcomes{'plain'} if $sslmode eq 'require';
delete $possible_outcomes{'ssl'} if $sslmode eq 'disable';

delete $possible_outcomes{'plain'} if $user eq 'ssluser';
delete $possible_outcomes{'plain'} if $user eq 'ssluser';

if $sslmode eq 'allow' {
# move 'plain' before 'ssl' in the list
}
if $sslmode eq 'prefer' {
# move 'ssl' before 'plain' in the list
}

# more rules here


# If there are no outcomes left in $possible_outcomes, return 'fail'
# If there's exactly one outcome left, return that.
# If there's more, return the first one.
}


Or maybe a table that lists all the combinations and the expected 
outcome. Something lieke this:


nossluser   nogssuser   ssluser gssuser 
sslmode=require fail...
sslmode=prefer  plain
sslmode=disable plain


The problem is that there are more than two dimensions. So maybe an 
exhaustive list like this:


usersslmode gssmode outcome

nossluser   require disable fail
nossluser   prefer  disable plain
nossluser   disable disable plain
ssluser require disable ssl
...


I'm just throwing around ideas here, can you experiment with different 
approaches and see what looks best?



ALPN


Does the TLS ALPN spec allow protocol versions in the protocol tag? It
would be very useful to detect clients with new capabilities at the
first connection, rather than having to wait for one round trip, and
would allow one avenue for changing the protocol version.


Looking at the list of registered ALPN tags [0], I can see "http/0.9"; 
"http/1.0" and "http/1.1". I think we'd want to changing the major 
protocol version in a way that would introduce a new roundtrip, though.


--
Heikki Linnakangas
Neon (https://neon.tech)





Potential issue in ecpg-informix decimal converting functions

2024-02-22 Thread a . imamov

Hi, everyone!

I found a potential bug in dectoint() and dectolong() functions from
informix.c. "Informix Compatibility Mode" doc chapter says that
ECPG_INFORMIX_NUM_OVERFLOW is returned if an overflow occurred. But
check this line in dectoint() or dectolong() (it is present in both):
if (ret == PGTYPES_NUM_OVERFLOW) - condition is always
false because PGTYPESnumeric_to_int() and PGTYPESnumeric_to_long()
functions return only 0 or -1. So ECPG_INFORMIX_NUM_OVERFLOW can never
be returned.

I think dectoint(), dectolong() and PGTYPESnumeric_to_int() functions
should be a little bit different like in proposing patch.
What do you think?

The flaw was catched with the help of Svace static analyzer.
https://svace.pages.ispras.ru/svace-website/en/

Thank you!diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml
index 73351a9136..8390a61cf4 100644
--- a/doc/src/sgml/ecpg.sgml
+++ b/doc/src/sgml/ecpg.sgml
@@ -8882,7 +8882,7 @@ int dectodbl(decimal *np, double *dblp);
   dectoint
   

-Convert a variable to type decimal to an integer.
+Convert a variable of type decimal to an integer.
 
 int dectoint(decimal *np, int *ip);
 
@@ -8892,8 +8892,9 @@ int dectoint(decimal *np, int *ip);


 On success, 0 is returned and a negative value if the conversion
-failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW
-is returned.
+failed. If overflow or underflow occurred, the function returns
+ECPG_INFORMIX_NUM_OVERFLOW or
+ECPG_INFORMIX_NUM_UNDERFLOW respectively.


 Note that the ECPG implementation differs from the Informix
@@ -8908,7 +8909,7 @@ int dectoint(decimal *np, int *ip);
   dectolong
   

-Convert a variable to type decimal to a long integer.
+Convert a variable of type decimal to a long integer.
 
 int dectolong(decimal *np, long *lngp);
 
@@ -8918,8 +8919,9 @@ int dectolong(decimal *np, long *lngp);


 On success, 0 is returned and a negative value if the conversion
-failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW
-is returned.
+failed. If overflow or underflow occurred, the function returns
+ECPG_INFORMIX_NUM_OVERFLOW or
+ECPG_INFORMIX_NUM_UNDERFLOW respectively.


 Note that the ECPG implementation differs from the Informix
diff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c
index 80d40aa3e0..73e84bc6e3 100644
--- a/src/interfaces/ecpg/compatlib/informix.c
+++ b/src/interfaces/ecpg/compatlib/informix.c
@@ -435,6 +435,7 @@ dectoint(decimal *np, int *ip)
 {
 	int			ret;
 	numeric*nres = PGTYPESnumeric_new();
+	int			errnum = 0;
 
 	if (nres == NULL)
 		return ECPG_INFORMIX_OUT_OF_MEMORY;
@@ -445,11 +446,18 @@ dectoint(decimal *np, int *ip)
 		return ECPG_INFORMIX_OUT_OF_MEMORY;
 	}
 
+	errno = 0;
 	ret = PGTYPESnumeric_to_int(nres, ip);
+	errnum = errno;
 	PGTYPESnumeric_free(nres);
 
-	if (ret == PGTYPES_NUM_OVERFLOW)
-		ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	if (ret)
+	{
+		if (errnum == PGTYPES_NUM_UNDERFLOW)
+			ret = ECPG_INFORMIX_NUM_UNDERFLOW;
+		else if (errnum == PGTYPES_NUM_OVERFLOW)
+			ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	}
 
 	return ret;
 }
@@ -459,6 +467,7 @@ dectolong(decimal *np, long *lngp)
 {
 	int			ret;
 	numeric*nres = PGTYPESnumeric_new();
+	int			errnum = 0;
 
 	if (nres == NULL)
 		return ECPG_INFORMIX_OUT_OF_MEMORY;
@@ -469,11 +478,18 @@ dectolong(decimal *np, long *lngp)
 		return ECPG_INFORMIX_OUT_OF_MEMORY;
 	}
 
+	errno = 0;
 	ret = PGTYPESnumeric_to_long(nres, lngp);
+	errnum = errno;
 	PGTYPESnumeric_free(nres);
 
-	if (ret == PGTYPES_NUM_OVERFLOW)
-		ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	if (ret)
+	{
+		if (errnum == PGTYPES_NUM_UNDERFLOW)
+			ret = ECPG_INFORMIX_NUM_UNDERFLOW;
+		else if (errnum == PGTYPES_NUM_OVERFLOW)
+			ret = ECPG_INFORMIX_NUM_OVERFLOW;
+	}
 
 	return ret;
 }
diff --git a/src/interfaces/ecpg/pgtypeslib/numeric.c b/src/interfaces/ecpg/pgtypeslib/numeric.c
index 35e7b92da4..52b49e1ce9 100644
--- a/src/interfaces/ecpg/pgtypeslib/numeric.c
+++ b/src/interfaces/ecpg/pgtypeslib/numeric.c
@@ -1502,7 +1502,12 @@ PGTYPESnumeric_to_int(numeric *nv, int *ip)
 /* silence compilers that might complain about useless tests */
 #if SIZEOF_LONG > SIZEOF_INT
 
-	if (l < INT_MIN || l > INT_MAX)
+	if (l < INT_MIN)
+	{
+		errno = PGTYPES_NUM_UNDERFLOW;
+		return -1;
+	}
+	else if(l > INT_MAX)
 	{
 		errno = PGTYPES_NUM_OVERFLOW;
 		return -1;
diff --git a/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout b/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout
index 1f8675b3f3..71a5afa4a7 100644
--- a/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout
+++ b/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout
@@ -3,8 +3,8 @@
 (no errno set) - dec[0,3]: r: -1, *
 (no errno set) - dec[0,4]: r: -1, *
 dec[0,5]: r: 0, 0.00

Re: Sequence Access Methods, round two

2024-02-22 Thread Tomas Vondra
Hi Michael,

I took a quick look at this patch series, mostly to understand how it
works and how it might interact with the logical decoding patches
discussed in a nearby thread.

First, some general review comments:

0001
--

I think this bit in pg_proc.dat is not quite right:

  proallargtypes => '{regclass,bool,int8}', proargmodes => '{i,o,o}',
  proargnames => '{seqname,is_called,last_value}',

the first argument should not be "seqname" but rather "seqid".


0002, 0003

seems fine, cosmetic changes


0004
--

I don't understand this bit in AlterSequence:

last_value = newdataform->last_value;
is_called = newdataform->is_called;

UnlockReleaseBuffer(buf);

/* Check and set new values */
init_params(pstate, stmt->options, stmt->for_identity, false,
seqform, _value, _state, _called,
_seq_rewrite, _by);

Why set the values only to pass them to init_params(), which will just
overwrite them anyway? Or do I get this wrong?

Also, isn't "reset_state" just a different name for (original) log_cnt?


0005
--

I don't quite understand what "elt" stands for :-(

stmt->tableElts = NIL;

Do we need AT_AddColumnToSequence? It seems to work exactly like
AT_AddColumn. OTOH we do have AT_AddColumnToView too ...

Thinking about this code:

case T_CreateSeqStmt:
EventTriggerAlterTableStart(parsetree);
address = DefineSequence(pstate, (CreateSeqStmt *) parsetree);
/* stashed internally */
commandCollected = true;
EventTriggerAlterTableEnd();
break;

Does this actually make sense? I mean, are sequences really relations?
Or was that just a side effect of storing the state in a heap table
(which is more of an implementation detail)?


0006
--
no comment, just moving code


0007
--
I wonder why heap_create_with_catalog needs to do this (check that it's
a sequence):

if ((RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE) ||
relkind == RELKIND_SEQUENCE)

Presumably this is to handle sequences that use heap to store the state?
Maybe the comment should explain that. Also, will the other table AMs
need to do something similar, just in case some sequence happens to use
that table AM (which seems out of control of the table AM)?

I don't understand why DefineSequence need to copy the string:

stmt->accessMethod = seq->accessMethod ? pstrdup(seq->accessMethod)
: NULL;

RelationInitTableAccessMethod now does not need to handle sequences, or
rather should not be asked to handle sequences. Is there a risk we'd
pass a sequence to the function anyway? Maybe an assert / error would be
appropriate?

This bit in RelationBuildLocalRelation looks a bit weird ...

if (RELKIND_HAS_TABLE_AM(relkind))
RelationInitTableAccessMethod(rel);
else if (relkind == RELKIND_SEQUENCE)
RelationInitSequenceAccessMethod(rel);

It's not a fault of this patch, but shouldn't we now have something like
RELKIND_HAS_SEQUENCE_AM()?


0008-0010
---
no comment


logical decoding / replication

Now, regarding the logical decoding / replication, would introducing the
sequence AM interfere with that in some way? Either in general, or with
respect to the nearby patch.

That is, what would it take to support logical replication of sequences
with some custom sequence AM? I believe that requires (a) synchronizing
the initial value, and (b) decoding the sequence WAL and (c) apply the
decoded changes. I don't think the sequence AM breaks any of this, as
long as it allows selecting "current value", decoding the values from
WAL, sending them to the subscriber, etc.

I guess the decoding would be up to the RMGR, and this patch maintains
the 1:1 mapping of sequences to relfilenodes, right? (That is, CREATE
and ALTER SEQUENCE would still create a new relfilenode, which is rather
important to decide if a sequence change is transactional.)

It seems to me this does not change the non-transactional behavior of
sequences, right?


alternative sequence AMs
--
I understand one of the reasons for adding sequence AMs is to allow
stuff like global/distributed sequences, etc. But will people actually
use that?

For example, I believe Simon originally proposed this in 2016 because
the plan was to implement distributed sequences in BDR on top of it. But
I believe BDR ultimately went with a very different approach, not with
custom sequence AMs. So I'm a bit skeptical this being suitable for
other active-active systems ...

Especially when the general consensus seems to be that for active-active
systems it's much better to use e.g. UUID, because that does not require
any coordination between the nodes, etc.

I'm not claiming there are no use cases for sequence AMs, of course. For
example the PRNG-based sequences mentioned by Mattias seems interesting.
I don't know how widely useful that is, though, and if it's worth it
(considering they managed to 

Re: speed up a logical replica setup

2024-02-22 Thread vignesh C
On Thu, 22 Feb 2024 at 21:17, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Vignesh,
>
> > Few comments on the tests:
> > 1) If the dry run was successful because of some issue then the server
> > will be stopped so we can check for "pg_ctl status" if the server is
> > running otherwise the connection will fail in this case. Another way
> > would be to check if it does not have "postmaster was stopped"
> > messages in the stdout.
> > +
> > +# Check if node S is still a standby
> > +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'),
> > +   't', 'standby is in recovery');
>
> Just to confirm - your suggestion is to add `pg_ctl status`, right? Added.

Yes, that is correct.

> > 2) Can we add verification of  "postmaster was stopped" messages in
> > the stdout for dry run without --databases testcase
> > +# pg_createsubscriber can run without --databases option
> > +command_ok(
> > +   [
> > +   'pg_createsubscriber', '--verbose',
> > +   '--dry-run', '--pgdata',
> > +   $node_s->data_dir, '--publisher-server',
> > +   $node_p->connstr('pg1'), '--subscriber-server',
> > +   $node_s->connstr('pg1')
> > +   ],
> > +   'run pg_createsubscriber without --databases');
> > +
>
> Hmm, in case of --dry-run, the server would be never shut down.
> See below part.
>
> ```
> if (!dry_run)
> stop_standby_server(pg_ctl_path, opt.subscriber_dir);
> ```

One way to differentiate whether the server is run in dry_run mode or
not is to check if the server was stopped or not. So I mean we can
check that the stdout does not have a "postmaster was stopped" message
from the stdout. Can we add validation based on this code:
+   if (action)
+   pg_log_info("postmaster was started");

Or another way is to check pg_ctl status to see that the server is not shutdown.

> > 3) This message "target server must be running" seems to be wrong,
> > should it be cannot specify cascading replicating standby as standby
> > node(this is for v22-0011 patch :
> > +   'pg_createsubscriber', '--verbose', '--pgdata',
> > $node_c->data_dir,
> > +   '--publisher-server', $node_s->connstr('postgres'),
> > +   '--port', $node_c->port, '--socketdir', $node_c->host,
> > +   '--database', 'postgres'
> > ],
> > -   'primary server is in recovery');
> > +   1,
> > +   [qr//],
> > +   [qr/primary server cannot be in recovery/],
> > +   'target server must be running');
>
> Fixed.
>
> > 4) Should this be "Wait for subscriber to catch up"
> > +# Wait subscriber to catch up
> > +$node_s->wait_for_subscription_sync($node_p, $subnames[0]);
> > +$node_s->wait_for_subscription_sync($node_p, $subnames[1]);
>
> Fixed.
>
> > 5) Should this be 'Subscriptions has been created on all the specified
> > databases'
> > +);
> > +is($result, qq(2),
> > +   'Subscriptions has been created to all the specified databases'
> > +);
>
> Fixed, but "has" should be "have".
>
> > 6) Add test to verify current_user is not a member of
> > ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no
> > permissions to execution replication origin advance functions
> >
> > 7) Add tests to verify insufficient max_logical_replication_workers,
> > max_replication_slots and max_worker_processes for the subscription
> > node
> >
> > 8) Add tests to verify invalid configuration of  wal_level,
> > max_replication_slots and max_wal_senders for the publisher node
>
> Hmm, but adding these checks may increase the test time. we should
> measure the time and then decide.

We can check and see if it does not take significantly more time, then
we can have these tests.

Regards,
Vignesh




Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-22 Thread Tom Lane
Dean Rasheed  writes:
> Attached is a patch that prevents UNION ALL subquery pullup in MERGE only.

I think that this is a band-aid that's just masking an error in the
rowmarking logic: it's not doing the right thing for appendrels
made from UNION ALL subqueries.  I've not wrapped my head around
exactly where it's going off the rails, but I feel like this ought
to get fixed somewhere else.  Please hold off pushing your patch
for now.

(The test case looks valuable though, thanks for working on that.)

regards, tom lane




RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> Few comments on the tests:
> 1) If the dry run was successful because of some issue then the server
> will be stopped so we can check for "pg_ctl status" if the server is
> running otherwise the connection will fail in this case. Another way
> would be to check if it does not have "postmaster was stopped"
> messages in the stdout.
> +
> +# Check if node S is still a standby
> +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'),
> +   't', 'standby is in recovery');

Just to confirm - your suggestion is to add `pg_ctl status`, right? Added.

> 2) Can we add verification of  "postmaster was stopped" messages in
> the stdout for dry run without --databases testcase
> +# pg_createsubscriber can run without --databases option
> +command_ok(
> +   [
> +   'pg_createsubscriber', '--verbose',
> +   '--dry-run', '--pgdata',
> +   $node_s->data_dir, '--publisher-server',
> +   $node_p->connstr('pg1'), '--subscriber-server',
> +   $node_s->connstr('pg1')
> +   ],
> +   'run pg_createsubscriber without --databases');
> +

Hmm, in case of --dry-run, the server would be never shut down.
See below part.

```
if (!dry_run)
stop_standby_server(pg_ctl_path, opt.subscriber_dir);
```

> 3) This message "target server must be running" seems to be wrong,
> should it be cannot specify cascading replicating standby as standby
> node(this is for v22-0011 patch :
> +   'pg_createsubscriber', '--verbose', '--pgdata',
> $node_c->data_dir,
> +   '--publisher-server', $node_s->connstr('postgres'),
> +   '--port', $node_c->port, '--socketdir', $node_c->host,
> +   '--database', 'postgres'
> ],
> -   'primary server is in recovery');
> +   1,
> +   [qr//],
> +   [qr/primary server cannot be in recovery/],
> +   'target server must be running');

Fixed.

> 4) Should this be "Wait for subscriber to catch up"
> +# Wait subscriber to catch up
> +$node_s->wait_for_subscription_sync($node_p, $subnames[0]);
> +$node_s->wait_for_subscription_sync($node_p, $subnames[1]);

Fixed.

> 5) Should this be 'Subscriptions has been created on all the specified
> databases'
> +);
> +is($result, qq(2),
> +   'Subscriptions has been created to all the specified databases'
> +);

Fixed, but "has" should be "have".

> 6) Add test to verify current_user is not a member of
> ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no
> permissions to execution replication origin advance functions
> 
> 7) Add tests to verify insufficient max_logical_replication_workers,
> max_replication_slots and max_worker_processes for the subscription
> node
> 
> 8) Add tests to verify invalid configuration of  wal_level,
> max_replication_slots and max_wal_senders for the publisher node

Hmm, but adding these checks may increase the test time. we should
measure the time and then decide.

> 9) We can use the same node name in comment and for the variable
> +# Set up node P as primary
> +$node_p = PostgreSQL::Test::Cluster->new('node_p');
> +$node_p->init(allows_streaming => 'logical');
> +$node_p->start;

Fixed.

> 10) Similarly we can use node_f instead of F in the comments.
> +# Set up node F as about-to-fail node
> +# Force it to initialize a new cluster instead of copying a
> +# previously initdb'd cluster.
> +{
> +   local $ENV{'INITDB_TEMPLATE'} = undef;
> +
> +   $node_f = PostgreSQL::Test::Cluster->new('node_f');
> +   $node_f->init(allows_streaming => 'logical');
> +   $node_f->start;
>

Fixed. Also, recent commit [1] allows to run the initdb forcibly. So followed.

New patch can be available in [2].

[1]: 
https://github.com/postgres/postgres/commit/ff9e1e764fcce9a34467d614611a34d4d2a91b50
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> Few comments:
> 1) The below code can lead to assertion failure if the publisher is
> stopped while dropping the replication slot:
> +   if (primary_slot_name != NULL)
> +   {
> +   conn = connect_database(dbinfo[0].pubconninfo);
> +   if (conn != NULL)
> +   {
> +   drop_replication_slot(conn, [0],
> primary_slot_name);
> +   }
> +   else
> +   {
> +   pg_log_warning("could not drop replication
> slot \"%s\" on primary",
> +  primary_slot_name);
> +   pg_log_warning_hint("Drop this replication
> slot soon to avoid retention of WAL files.");
> +   }
> +   disconnect_database(conn);
> +   }
> 
> pg_createsubscriber: error: connection to database failed: connection
> to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or
> directory
> Is the server running locally and accepting connections on that socket?
> pg_createsubscriber: warning: could not drop replication slot
> "standby_1" on primary
> pg_createsubscriber: hint: Drop this replication slot soon to avoid
> retention of WAL files.
> pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database:
> Assertion `conn != ((void *)0)' failed.
> Aborted (core dumped)
> 
> This is happening because we are calling disconnect_database in case
> of connection failure case too which has the following assert:
> +static void
> +disconnect_database(PGconn *conn)
> +{
> +   Assert(conn != NULL);
> +
> +   PQfinish(conn);
> +}

Right. disconnect_database() was moved to if (conn != NULL) block.

> 2) There is a CheckDataVersion function which does exactly this, will
> we be able to use this:
> +   /* Check standby server version */
> +   if ((ver_fd = fopen(versionfile, "r")) == NULL)
> +   pg_fatal("could not open file \"%s\" for reading: %m",
> versionfile);
> +
> +   /* Version number has to be the first line read */
> +   if (!fgets(rawline, sizeof(rawline), ver_fd))
> +   {
> +   if (!ferror(ver_fd))
> +   pg_fatal("unexpected empty file \"%s\"", versionfile);
> +   else
> +   pg_fatal("could not read file \"%s\": %m", 
> versionfile);
> +   }
> +
> +   /* Strip trailing newline and carriage return */
> +   (void) pg_strip_crlf(rawline);
> +
> +   if (strcmp(rawline, PG_MAJORVERSION) != 0)
> +   {
> +   pg_log_error("standby server is of wrong version");
> +   pg_log_error_detail("File \"%s\" contains \"%s\",
> which is not compatible with this program's version \"%s\".",
> +   versionfile,
> rawline, PG_MAJORVERSION);
> +   exit(1);
> +   }
> +
> +   fclose(ver_fd);


> 3) Should this be added to typedefs.list:
> +enum WaitPMResult
> +{
> +   POSTMASTER_READY,
> +   POSTMASTER_STILL_STARTING
> +};

But the comment from Peter E. [1] was opposite. I did not handle this.

> 4) pgCreateSubscriber should be mentioned after pg_controldata to keep
> the ordering consistency:
> diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
> index aa94f6adf6..c5edd244ef 100644
> --- a/doc/src/sgml/reference.sgml
> +++ b/doc/src/sgml/reference.sgml
> @@ -285,6 +285,7 @@
> 
> 
> 
> +   
> 

This has been already pointed out by Peter E. I did not handle this.

> 5) Here pg_replication_slots should be pg_catalog.pg_replication_slots:
> +   if (primary_slot_name)
> +   {
> +   appendPQExpBuffer(str,
> + "SELECT 1 FROM
> pg_replication_slots "
> + "WHERE active AND
> slot_name = '%s'",
> + primary_slot_name);

Fixed.

> 6) Here pg_settings should be pg_catalog.pg_settings:
> +* - max_worker_processes >= 1 + number of dbs to be converted
> +
> *
> +*/
> +   res = PQexec(conn,
> +"SELECT setting FROM pg_settings
> WHERE name IN ("
> +"'max_logical_replication_workers', "
> +"'max_replication_slots', "
> +"'max_worker_processes', "
> +"'primary_slot_name') "
> +"ORDER BY name");

Fixed.

New version can be available in [2]

[1]: 
https://www.postgresql.org/message-id/3ee79f2c-e8b3-4342-857c-a31b87e1afda%40eisentraut.org
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> Few comments regarding the documentation:
> 1) max_replication_slots information seems to be present couple of times:
> 
> +
> + The target instance must have
> +  linkend="guc-max-replication-slots">max_replication_slots me>
> + and  linkend="guc-max-logical-replication-workers">max_logical_replica
> tion_workers
> + configured to a value greater than or equal to the number of target
> + databases.
> +
>
> +   
> +
> + The target instance must have
> +  linkend="guc-max-replication-slots">max_replication_slots me>
> + configured to a value greater than or equal to the number of target
> + databases and replication slots.
> +
> +   

Fixed.

> 2) Can we add an id to prerequisites and use it instead of referring
> to r1-app-pg_createsubscriber-1:
> - pg_createsubscriber checks if the
> given target data
> - directory has the same system identifier than the source data directory.
> - Since it uses the recovery process as one of the steps, it starts the
> - target server as a replica from the source server. If the system
> - identifier is not the same,
> pg_createsubscriber will
> - terminate with an error.
> + Checks the target can be converted.  In particular, things listed in
> + above section
> would be
> + checked.  If these are not met
> pg_createsubscriber
> + will terminate with an error.
>  

Changed.

> 3) The code also checks the following:
>  Verify if a PostgreSQL binary (progname) is available in the same
> directory as pg_createsubscriber.
> 
> But this is not present in the pre-requisites of documentation.

I think it is quite trivial so that I did not add.

> 4) Here we mention that the target server should be stopped, but the
> same is not mentioned in prerequisites:
> +   Here is an example of using
> pg_createsubscriber.
> +   Before running the command, please make sure target server is stopped.
> +
> +$ pg_ctl -D /usr/local/pgsql/data
> stop
> +
> +

Oh, it is opposite, it should NOT be stopped. Fixed.

> 5) If there is an error during any of the pg_createsubscriber
> operation like if create subscription fails, it might not be possible
> to rollback to the earlier state which had physical-standby
> replication. I felt we should document this and also add it to the
> console message like how we do in case of pg_upgrade.

Added.

New version can be available in [1]

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Alvaro,

> > 15.
> >
> > You said in case of failure, cleanups is not needed if the process exits 
> > soon [1].
> > But some functions call PQfinish() then exit(1) or pg_fatal(). Should we 
> > follow?
> 
> Hmm, but doesn't this mean that the server will log an ugly message that
> "client closed connection unexpectedly"?  I think it's nicer to close
> the connection before terminating the process (especially since the
> code for that is already written).

OK. So we should disconnect properly even if the process exits. I added the 
function call
again. Note that PQclear() was not added because it is only related with the 
application.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Re: btree: downlink right separator/HIKEY optimization

2024-02-22 Thread Matthias van de Meent
On Sat, 6 Jan 2024 at 16:40, vignesh C  wrote:
>
> CFBot shows the following compilation error at [1]:
> [16:56:22.153] FAILED:
> src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj
> [...]
> ../src/backend/access/nbtree/nbtsearch.c
> [16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error
> C2143: syntax error: missing ';' before 'type'
> [16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning
> C4091: ' ': ignored on left of 'int' when no variable is declared

I forgot to address this in the previous patch, so here's v3 which
fixes the issue warning.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch
Description: Binary data


Re: Reducing output size of nodeToString

2024-02-22 Thread Matthias van de Meent
On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
 wrote:
>
> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
>  wrote:
> > Attached the updated version of the patch on top of 5497daf3, which
> > incorporates this last round of feedback.
>
> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
> and an issue in the previous patchset: I attached one too many v3-0001
> from a previous patch I worked on.

... and now with a fix for not overwriting newly deserialized location
attributes with -1, which breaks test output for
READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
changes since the patch of last Monday.

Sorry for the noise,

-Matthias


v7-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v7-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v7-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v7-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


v7-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v7-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch
Description: Binary data


Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Tom Lane
Peter Eisentraut  writes:
> The problem is, we don't really have any end-to-end coverage of

> dump
> restore
> dump again
> compare the two dumps

> with a database with lots of interesting objects in it.

I'm very much against adding another full run of the core regression
tests to support this.  But beyond the problem of not bloating the
check-world test runtime, there is the question of what this would
actually buy us.  I doubt that it is worth very much, because
it would not detect bugs-of-omission in pg_dump.  As I remarked in
the other thread, if pg_dump is blind to the existence of some
feature or field, testing that the dumps compare equal will fail
to reveal that it didn't restore that property.

I'm not sure what we could do about that.  One could imagine writing
some test infrastructure that dumps out the contents of the system
catalogs directly, and comparing that instead of pg_dump output.
But that'd be a lot of infrastructure to write and maintain ...
and it's not real clear why it wouldn't *also* suffer from
I-forgot-to-add-this hazards.

On balance, I think there are good reasons that we've not added
such a test, and I don't believe those tradeoffs have changed.

regards, tom lane




Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c

2024-02-22 Thread Daniel Gustafsson
> On 6 Dec 2023, at 14:03, Nazir Bilal Yavuz  wrote:

> There is an ongoing thread [1] for adding missing SQL error codes to
> PANIC and FATAL error reports in xlogrecovery.c file. I did the same
> but for xlog.c and relcache.c files.

-   elog(PANIC, "space reserved for WAL record does not match what was 
written");
+   ereport(PANIC,
+   (errcode(ERRCODE_DATA_CORRUPTED),
+errmsg("space reserved for WAL record does not match 
what was written")));

elogs turned into ereports should use errmsg_internal() to keep the strings
from being translated.

-   elog(FATAL, "could not write init file");
+   ereport(FATAL,
+   (errcode_for_file_access(),
+errmsg("could not write init file")));

Is it worthwhile adding %m on these to get a little more help when debugging
errors that shouldn't happen?

-   elog(FATAL, "could not write init file");
+   ereport(FATAL,
+   (errcode_for_file_access(),

The extra parenthesis are no longer needed, I don't know if we have a policy to
remove them when changing an ereport call but we should at least not introduce
new ones.

-   elog(FATAL, "cannot read pg_class without having selected a database");
+   ereport(FATAL,
+   (errcode(ERRCODE_INTERNAL_ERROR),

ereport (and thus elog) already defaults to ERRCODE_INTERNAL_ERROR for ERROR or
higher, so unless there is a better errcode an elog() call if preferrable here.

> I couldn't find a suitable error code for the "cache lookup failed for
> relation" error in relcache.c and this error comes up in many places.
> Would it be reasonable to create a new error code specifically for
> this?

We use ERRCODE_UNDEFINED_OBJECT for similar errors elsewhere, perhaps we can
use that for these as well?

--
Daniel Gustafsson





Re: Documentation to upgrade logical replication cluster

2024-02-22 Thread vignesh C
On Thu, 22 Feb 2024 at 09:35, Peter Smith  wrote:
>
> Here are some minor comments for patch v8-0001.
>
> ==
> doc/src/sgml/glossary.sgml
>
> 1.
> +   
> +
> + A set of publisher and subscriber instances with publisher instance
> + replicating changes to the subscriber instance.
> +
> +   
>
> /with publisher instance/with the publisher instance/

Modified

> ~~~
>
> 2.
> There are 2 SQL fragments but they are wrapped differently (see
> below). e.g. it is not clear why is the 2nd fragment wrapped since it
> is shorter than the 1st. OTOH, maybe you want the 1st fragment to
> wrap. Either way, consistency wrapping would be better.

Modified

Thanks for the comments, the attached v9 version patch has the changes
for the same.

I have added a commitfest entry for this:
https://commitfest.postgresql.org/47/4848/

Regards,
Vignesh
From 3ff1910ec221c4e5f2e1eb8291848528d6ec479a Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 30 Jan 2024 08:55:20 +0530
Subject: [PATCH v9] Documentation for upgrading logical replication cluster.

Documentation for upgrading logical replication cluster.
---
 doc/src/sgml/glossary.sgml|  10 +
 doc/src/sgml/logical-replication.sgml | 821 ++
 doc/src/sgml/ref/pgupgrade.sgml   | 131 +---
 3 files changed, 839 insertions(+), 123 deletions(-)

diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 8c2f11480d..45cea16e8f 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1068,6 +1068,16 @@

   
 
+  
+   Logical replication cluster
+   
+
+ A set of publisher and subscriber instances with the publisher instance
+ replicating changes to the subscriber instance.
+
+   
+  
+
   
Log record
 
diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index ec2130669e..d1eebe60e8 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -1926,6 +1926,827 @@ CONTEXT:  processing remote data for replication origin "pg_16395" during "INSER
 
  
 
+ 
+  Upgrade
+
+  
+   Migration of logical replication clusters
+   is possible only when all the members of the old logical replication
+   clusters are version 17.0 or later.
+  
+
+  
+   Prepare for publisher upgrades
+
+   
+ attempts to migrate logical
+slots. This helps avoid the need for manually defining the same
+logical slots on the new publisher. Migration of logical slots is
+only supported when the old cluster is version 17.0 or later.
+Logical slots on clusters before version 17.0 will silently be
+ignored.
+   
+
+   
+Before you start upgrading the publisher cluster, ensure that the
+subscription is temporarily disabled to avoid the subscriber connection
+failures during publisher upgrade, by executing
+ALTER SUBSCRIPTION ... DISABLE.
+Re-enable the subscription after the upgrade.
+   
+
+   
+The following prerequisites are required for pg_upgrade
+to be able to upgrade the logical slots. If these are not met an error
+will be reported.
+   
+
+   
+
+ 
+  The new cluster must have
+  wal_level as
+  logical.
+ 
+
+
+ 
+  The new cluster must have
+  max_replication_slots
+  configured to a value greater than or equal to the number of slots
+  present in the old cluster.
+ 
+
+
+ 
+  The output plugins referenced by the slots on the old cluster must be
+  installed in the new PostgreSQL executable
+  directory.
+ 
+
+
+ 
+  The old cluster must have replicated all the transactions and logical
+  decoding messages to subscribers.
+ 
+
+
+ 
+  All slots on the old cluster must be usable, i.e., there are no slots
+  whose
+  pg_replication_slots.conflict_reason
+  is not NULL, e.g.:
+
+postgres=# SELECT slot_name
+postgres-#   FROM pg_replication_slots
+postgres-#   WHERE slot_type = 'logical' AND conflict_reason IS NOT NULL;
+ slot_name
+---
+(0 rows)
+
+ 
+
+
+ 
+  The new cluster must not have permanent logical slots, i.e.,
+  there must be no slots listed with:
+
+SELECT count(*)
+  FROM pg_replication_slots
+  WHERE slot_type = 'logical' AND temporary IS false;
+
+ 
+
+   
+  
+
+  
+   Prepare for subscriber upgrades
+
+   
+Setup the 
+subscriber configurations in the new subscriber.
+   
+
+   
+pg_upgrade attempts to migrate subscription
+dependencies which includes the subscription's table information present in
+pg_subscription_rel
+system catalog and also the subscription's replication origin. This allows
+logical replication on the new subscriber to continue from where the
+old subscriber was up to. Migration of subscription dependencies is only
+supported when the old cluster is version 17.0 or later. Subscription
+dependencies on clusters before 

Re: btree: downlink right separator/HIKEY optimization

2024-02-22 Thread Matthias van de Meent
On Tue, 5 Dec 2023 at 08:43, Heikki Linnakangas  wrote:
>
> On 01/11/2023 00:08, Matthias van de Meent wrote:
> > By caching the right separator index tuple in _bt_search, we can
> > compare the downlink's right separator and the HIKEY, and when they
> > are equal (memcmp() == 0) we don't have to call _bt_compare - the
> > HIKEY is known to be larger than the scan key, because our key is
> > smaller than the right separator, and thus transitively also smaller
> > than the HIKEY because it contains the same data. As _bt_compare can
> > call expensive user-provided functions, this can be a large
> > performance boon, especially when there are only a small number of
> > column getting compared on each page (e.g. index tuples of many 100s
> > of bytes, or dynamic prefix truncation is enabled).
>
> What would be the worst case scenario for this? One situation where the
> memcmp() would not match is when there is a concurrent page split. I
> think it's OK to pessimize that case. Are there any other situations?

There is also concurrent page deletion which can cause downlinked
pages to get removed from the set of accessible pages, but that's
quite rare, too: arguably even more rare than page splits.

> When the memcmp() matches, I think this is almost certainly not slower
> than calling the datatype's comparison function.
>
> >   if (offnum < PageGetMaxOffsetNumber(page))
> > [...]
> >   else if (!P_RIGHTMOST(opaque))
> > [...]
> >   }
>
> This could use a one-line comment above this, something like "Remember
> the right separator of the downlink we follow, to speed up the next
> _bt_moveright call".

Done.

> Should there be an "else rightsep = NULL;" here? Is it possible that we
> follow the non-rightmost downlink on a higher level and rightmost
> downlink on next level? Concurrent page deletion?

While possible, the worst this could do is be less efficient in those
fringe cases: The cached right separator is a key that is known to
compare larger than the search key and thus always correct to use as
an optimization for "is this HIKEY larger than my search key", as long
as we don't clobber the data in that cache (which we don't).
Null-ing the argument, while not incorrect, could be argued to be
worse than useless here, as the only case where NULL may match an
actual highkey is on the rightmost page, which we already special-case
in _bt_moveright before hitting the 'compare the highkey' code.
Removal of the value would thus remove any chance of using the
optimization after hitting the rightmost page in a layer below.

I've added a comment to explain this in an empty else block in the
attached version 2 of the patch.

> Please update the comment above _bt_moveright to describe the new
> argument. Perhaps the text from README should go there, this feels like
> a detail specific to _bt_search and _bt_moveright.

Done.

Thank you for the review.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v2-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch
Description: Binary data


Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-22 Thread Давыдов Виталий

Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied 
quite slowly on a replica during logical replication. There is a master and a 
replica with established logical replication from the master to the replica 
with twophase = true. With some load level on the master, the replica starts to 
lag behind the master, and the lag will be increasing. We have to significantly 
decrease the load on the master to allow replica to complete the catchup. Such 
problem may create significant difficulties in the production. The problem 
appears at least on REL_16_STABLE branch.
To reproduce the problem:
 * Setup logical replication from master to replica with subscription parameter 
twophase =  true. * Create some intermediate load on the master (use pgbench 
with custom sql with prepare+commit) * Optionally switch off the replica for 
some time (keep load on master). * Switch on the replica and wait until it 
reaches the master.
The replica will never reach the master with even some low load on the master. 
If to remove the load, the replica will reach the master for much greater time, 
than expected. I tried the same for regular transactions, but such problem 
doesn't appear even with a decent load.
I think, the main proplem of 2PC catchup bad performance - the lack of 
asynchronous commit support for 2PC. For regular transactions asynchronous 
commit is used on the replica by default (subscrition sycnronous_commit = off). 
It allows the replication worker process on the replica to avoid fsync 
(XLogFLush) and to utilize 100% CPU (the background wal writer or checkpointer 
will do fsync). I agree, 2PC are mostly used in multimaster configurations with 
two or more nodes which are performed synchronously, but when the node in 
catchup (node is not online in a multimaster cluster), asynchronous commit have 
to be used to speedup the catchup.
There is another thing that affects on the disbalance of the master and replica 
performance. When the master executes requestes from multiple clients, there is 
a fsync optimization takes place in XLogFlush. It allows to decrease the number 
of fsync in case when a number of parallel backends write to the WAL 
simultaneously. The replica applies received transactions in one thread 
sequentially, such optimization is not applied.
I see some possible solutions:
 * Implement asyncronous commit for 2PC transactions. * Do some hacking with 
enableFsync when it is possible.
I think, asynchronous commit support for 2PC transactions should significantly 
increase replica performance and help to solve this problem. I tried to 
implement it (like for usual transactions) but I've found another problem: 2PC 
state is stored in WAL on prepare, on commit we have to read 2PC state from WAL 
but the read is delayed until WAL is flushed by the background wal writer (read 
LSN should be less than flush LSN). Storing 2PC state in a shared memory (as it 
proposed earlier) may help.

I used the following query to monitor the catchup progress on the master:SELECT 
sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
I used the following script for pgbench to the master:SELECT 
md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;
 
What do you think?
 
With best regards,
Vitaly Davydov


Re: When extended query protocol ends?

2024-02-22 Thread Jelte Fennema-Nio
On Thu, 22 Feb 2024 at 13:01, Michael Zhilin  wrote:
> I would like to say this document states that "at completion... frontend 
> should issue a Sync message... causes the backend to close the current 
> transaction"
> It looks like the sense of wording is "to complete transaction" at the 
> eventual end of traffic, but not "to switch to single protocol".
> Otherwise, we can't use both protocols under same transaction that looks too 
> strict limitation.

Sync only closes the current transaction when you didn't explicitly
open one, i.e. you're using an implicit transaction (part of t. If you
open an explicit transaction (, then you can use both extended and
simple protocol messages in the same transaction. The way I understand
it is: Multiple extended messages together form a single pipeline
(easier understood as SQL statement), until a Sync is reached. So
Parse-Bind-Execute-Parse-Bind-Execute-Sync counts as a single unit
from postgres its visibility/rollback behaviour standpoint.

And that's also where sending a Query instead of a Sync introduces a
problem: What is supposed to happen if something fails. Let's take the
simple example Bind-Execute-Query: If the Execute causes an error, is
the Query still executed or not? And what about if the Query fails, is
the Execute before committed or rolled back?

That's why we have the Sync messages, clarify what happens when a
failure occurs somewhere in the pipeline. And only extended protocol
messages are allowed to be part of a pipeline: "Use of the extended
query protocol allows pipelining"




RE: Synchronizing slots from primary to standby

2024-02-22 Thread Zhijie Hou (Fujitsu)
On Thursday, February 22, 2024 8:41 PM Amit Kapila  
wrote:
> 
> On Thu, Feb 22, 2024 at 5:23 PM Amit Kapila 
> wrote:
> >
> > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
> >  wrote:
> > >
> > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote:
> > > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Some random comments about v92_001 (Sorry if it has already been
> > > > > discussed
> > > > > up-thread):
> > > >
> > > > Thanks for the feedback. The patch is pushed 15 minutes back.
> > >
> > > Yeah, saw that after I send the comments ;-)
> > >
> >
> > There is a BF failure. See
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-0
> 2-22%2010%3A13%3A03.
> >
> > The initial analysis suggests that for some reason, the primary went
> > down after the slot sync worker was invoked the first time. See the
> > below in the primary's LOG:
> >
> 
> The reason is that the test failed waiting on below LOG:
> 
> ### Reloading node "standby1"
> # Running: pg_ctl -D
> /home/ec2-user/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_
> 040_standby_failover_slots_sync_standby1_data/pgdata
> reload
> server signaled
> timed out waiting for match: (?^:LOG:  slot sync worker started) at
> t/040_standby_failover_slots_sync.pl line 376.
> 
> Now, on standby, we see a LOG like 2024-02-22 10:57:35.432 UTC [2721638:1]
> LOG: 0: slot sync worker started. Even then the test failed and the 
> reason is
> that it has an extra  before the actual message which is due to
> log_error_verbosity = verbose in config. I think here the test's log matching
> code needs to have a more robust log line matching code.

Agreed. Here is a small patch to change the msg in wait_for_log so that it only
search the message part.

Best Regards,
Hou zj


0001-Make-recovery-test-pass-with-log_error_verbosity-ver.patch
Description:  0001-Make-recovery-test-pass-with-log_error_verbosity-ver.patch


RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Shlok,

> Hi,
> 
> I have reviewed the v21 patch. And found an issue.
> 
> Initially I started the standby server with a new postgresql.conf file
> (not the default postgresql.conf that is present in the instance).
> pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"
> 
> And I have made 'max_replication_slots = 1' in new postgresql.conf and
> made  'max_replication_slots = 0' in the default postgresql.conf file.
> Now when we run pg_createsubscriber on standby we get error:
> pg_createsubscriber: error: could not set replication progress for the
> subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
> manipulate replication origin when max_replication_slots = 0
> NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
> pg_createsubscriber: error: could not drop publication
> "pg_createsubscriber_5" on database "postgres": ERROR:  publication
> "pg_createsubscriber_5" does not exist
> pg_createsubscriber: error: could not drop replication slot
> "pg_createsubscriber_5_242843" on database "postgres": ERROR:
> replication slot "pg_createsubscriber_5_242843" does not exist
> 
> I observed that when we run the pg_createsubscriber command, it will
> stop the standby instance (the non-default postgres configuration) and
> restart the standby instance which will now be started with default
> postgresql.conf, where the 'max_replication_slot = 0' and
> pg_createsubscriber will now fail with the error given above.
> I have added the script file with which we can reproduce this issue.
> Also similar issues can happen with other configurations such as port, etc.

Possible. So the issue is that GUC settings might be changed after the restart
so that the verification phase may not be enough. There are similar GUCs in [1]
and they may have similar issues. E.g., if "hba_file" is set when the server
started, the file cannot be seen after the restart, so pg_createsubscriber may
not connect to it anymore.

> The possible solution would be
> 1) allow to run pg_createsubscriber if standby is initially stopped .
> I observed that pg_logical_createsubscriber also uses this approach.
> 2) read GUCs via SHOW command and restore them when server restarts
>

I also prefer the first solution. 
Another reason why the standby should be stopped is for backup purpose.
Basically, the standby instance should be saved before running 
pg_createsubscriber.
An easiest way is hard-copy, and the postmaster should be stopped at that time.
I felt it is better that users can run the command immediately later the 
copying.
Thought?

[1]: https://www.postgresql.org/docs/current/storage-file-layout.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Re: Synchronizing slots from primary to standby

2024-02-22 Thread Amit Kapila
On Thu, Feb 22, 2024 at 5:23 PM Amit Kapila  wrote:
>
> On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
>  wrote:
> >
> > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote:
> > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks!
> > > >
> > > > Some random comments about v92_001 (Sorry if it has already been 
> > > > discussed
> > > > up-thread):
> > >
> > > Thanks for the feedback. The patch is pushed 15 minutes back.
> >
> > Yeah, saw that after I send the comments ;-)
> >
>
> There is a BF failure. See
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-22%2010%3A13%3A03.
>
> The initial analysis suggests that for some reason, the primary went
> down after the slot sync worker was invoked the first time. See the
> below in the primary's LOG:
>

The reason is that the test failed waiting on below LOG:

### Reloading node "standby1"
# Running: pg_ctl -D
/home/ec2-user/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_040_standby_failover_slots_sync_standby1_data/pgdata
reload
server signaled
timed out waiting for match: (?^:LOG:  slot sync worker started) at
t/040_standby_failover_slots_sync.pl line 376.

Now, on standby, we see a LOG like 2024-02-22 10:57:35.432 UTC
[2721638:1] LOG: 0: slot sync worker started. Even then the test
failed and the reason is that it has an extra  before the actual
message which is due to log_error_verbosity = verbose in config. I
think here the test's log matching code needs to have a more robust
log line matching code.

-- 
With Regards,
Amit Kapila.




Re: Reducing output size of nodeToString

2024-02-22 Thread Matthias van de Meent
On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
 wrote:
> Attached the updated version of the patch on top of 5497daf3, which
> incorporates this last round of feedback.

Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
and an issue in the previous patchset: I attached one too many v3-0001
from a previous patch I worked on.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v6-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
Description: Binary data


v6-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
Description: Binary data


v6-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch
Description: Binary data


v6-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch
Description: Binary data


v6-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch
Description: Binary data


v6-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch
Description: Binary data


v6-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch
Description: Binary data


v6-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch
Description: Binary data


Re: Porting PostgresSQL libraries for QNX710

2024-02-22 Thread Daniel Gustafsson
> On 22 Feb 2024, at 11:35, Rajith Rao .B(App Software)  
> wrote:

> I have been using the Qt IDE with C++ for database connection and query 
> execution, and unfortunately, I cannot share the code with you.

No worries, I have no intention to work on this.

> You mentioned that PostgreSQL support for QNX was removed starting from 
> version 8.2. Are there any alternative methods to port or build PostgreSQL 
> libraries for QNX 7.1.0?

There is no other way to build any software on a new architecture than rolling
up the sleeves and getting started.

I suggest looking at commits f55808828569, a1675649e402 and 6f84b2da75d3 in the
postgres repo as a starting point for research.  The QNX support which was
removed in 8.2 was targeting QNX4, so it may or may not be helpful.

--
Daniel Gustafsson





Re: When extended query protocol ends?

2024-02-22 Thread Michael Zhilin

Hi,

On 2/22/24 14:09, Jelte Fennema-Nio wrote:

Apparently, sending an extra message would increase the overhead of the 
protocol, thus reducing the efficiency of the application.
What is the benefit of sending extra Sync?

https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-MESSAGE-CONCEPTS
suggests that is is fine to mix both simple and extended messages
depending on the needs of the application.

Yes, it's fine to mix and match extended and simple protocol. But the
protocol docs quite clearly state that a sync is required before going
back to the Simple protocol: "At completion of each series of
extended-query messages, the frontend should issue a Sync message."
https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY
I would like to say this document states that "at completion... frontend 
should issue a Sync message... causes the backend to close the current 
transaction"
It looks like the sense of wording is "to complete transaction" at the 
eventual end of traffic, but not "to switch to single protocol".
Otherwise, we can't use both protocols under same transaction that looks 
too strict limitation.

Terminating a sequence of extended messages with a Query message
instead of a Sync message is definitely undefined behaviour.





--
Michael Zhilin
Postgres Professional
+7(925)3366270
https://www.postgrespro.ru


Re: Synchronizing slots from primary to standby

2024-02-22 Thread Amit Kapila
On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot
 wrote:
>
> On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote:
> > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot
> >  wrote:
> > >
> > > Hi,
> > >
> > > Thanks!
> > >
> > > Some random comments about v92_001 (Sorry if it has already been discussed
> > > up-thread):
> >
> > Thanks for the feedback. The patch is pushed 15 minutes back.
>
> Yeah, saw that after I send the comments ;-)
>

There is a BF failure. See
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-22%2010%3A13%3A03.

The initial analysis suggests that for some reason, the primary went
down after the slot sync worker was invoked the first time. See the
below in the primary's LOG:

2024-02-22 10:59:56.896 UTC [2721639:29] standby1_slotsync worker LOG:
 0: statement: SELECT slot_name, plugin, confirmed_flush_lsn,
restart_lsn, catalog_xmin, two_phase, failover, database,
conflict_reason FROM pg_catalog.pg_replication_slots WHERE failover
and NOT temporary
2024-02-22 10:59:56.896 UTC [2721639:30] standby1_slotsync worker
LOCATION:  exec_simple_query, postgres.c:1070
2024-02-22 11:00:26.967 UTC [2721639:31] standby1_slotsync worker LOG:
 0: statement: SELECT slot_name, plugin, confirmed_flush_lsn,
restart_lsn, catalog_xmin, two_phase, failover, database,
conflict_reason FROM pg_catalog.pg_replication_slots WHERE failover
and NOT temporary
2024-02-22 11:00:26.967 UTC [2721639:32] standby1_slotsync worker
LOCATION:  exec_simple_query, postgres.c:1070
2024-02-22 11:00:35.908 UTC [2721435:309] LOG:  0: received
immediate shutdown request
2024-02-22 11:00:35.908 UTC [2721435:310] LOCATION:
process_pm_shutdown_request, postmaster.c:2859
2024-02-22 11:00:35.911 UTC [2721435:311] LOG:  0: database system
is shut down
2024-02-22 11:00:35.911 UTC [2721435:312] LOCATION:  UnlinkLockFiles,
miscinit.c:1138


-- 
With Regards,
Amit Kapila.




Re: Avoid stack frame setup in performance critical routines using tail calls

2024-02-22 Thread David Rowley
I've rebased the 0001 patch and gone over it again and made a few
additional changes besides what I mentioned in my review.

On Wed, 9 Aug 2023 at 20:44, David Rowley  wrote:
> Here's a review of v2-0001:
> 2. Why do you need to add the NULL check here?
>
>  #ifdef USE_VALGRIND
> - if (method != MCTX_ALIGNED_REDIRECT_ID)
> + if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID)
>   VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size);
>  #endif

I removed this NULL check as we're calling the realloc function with
no flags, so it shouldn't return NULL as it'll error out from any OOM
errors.

> 3.
>
> /*
> * XXX: Probably no need to check for huge allocations, we only support
> * one size? Which could theoretically be huge, but that'd not make
> * sense...
> */
>
> They can't be huge per Assert(fullChunkSize <= MEMORYCHUNK_MAX_VALUE)
> in SlabContextCreate().

I removed this comment and adjusted the comment just below that which
checks the 'size' matches the expected slab chunk size. i.e.

/*
* Make sure we only allow correct request size.  This doubles as the
* MemoryContextCheckSize check.
*/
if (unlikely(size != slab->chunkSize))


> 4. It would be good to see some API documentation in the
> MemoryContextMethods struct.  This adds a lot of responsibility onto
> the context implementation without any extra documentation to explain
> what, for example, palloc is responsible for and what the alloc
> function needs to do itself.

I've done that too.

I also added header comments for MemoryContextAllocationFailure and
MemoryContextSizeFailure and added some comments to explain in places
like palloc() to warn people not to add checks after the 'alloc' call.

The rebased patch is 0001 and all of my changes are in 0002.  I will
rebase your original 0002 patch later.  I think 0001 is much more
important, as evident by the reported benchmarks on this thread.

In absence of anyone else looking at this, I think it's ready to go.
If anyone is following along and wants to review or test it, please do
so soon.

David
From 854751aeb1d0b917bf6e96c30e80f8480f96e883 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 18 Jul 2023 18:55:58 -0700
Subject: [PATCH v3 1/2] Optimize palloc() etc to allow sibling calls

The reason palloc() etc previously couldn't use sibling calls is that they did
check the returned value (to e.g. raise an error when the allocation
fails). Push the error handling down into the memory context implementation -
they can avoid performing the check at all in the hot code paths.
---
 src/backend/utils/mmgr/alignedalloc.c |  11 +-
 src/backend/utils/mmgr/aset.c |  23 ++--
 src/backend/utils/mmgr/generation.c   |  14 ++-
 src/backend/utils/mmgr/mcxt.c | 172 ++
 src/backend/utils/mmgr/slab.c |  12 +-
 src/include/nodes/memnodes.h  |   4 +-
 src/include/utils/memutils_internal.h |  29 +++--
 7 files changed, 101 insertions(+), 164 deletions(-)

diff --git a/src/backend/utils/mmgr/alignedalloc.c 
b/src/backend/utils/mmgr/alignedalloc.c
index 7204fe64ae..c266fb3dbb 100644
--- a/src/backend/utils/mmgr/alignedalloc.c
+++ b/src/backend/utils/mmgr/alignedalloc.c
@@ -57,7 +57,7 @@ AlignedAllocFree(void *pointer)
  * memory will be uninitialized.
  */
 void *
-AlignedAllocRealloc(void *pointer, Size size)
+AlignedAllocRealloc(void *pointer, Size size, int flags)
 {
MemoryChunk *redirchunk = PointerGetMemoryChunk(pointer);
Sizealignto;
@@ -97,14 +97,17 @@ AlignedAllocRealloc(void *pointer, Size size)
 #endif
 
ctx = GetMemoryChunkContext(unaligned);
-   newptr = MemoryContextAllocAligned(ctx, size, alignto, 0);
+   newptr = MemoryContextAllocAligned(ctx, size, alignto, flags);
 
/*
 * We may memcpy beyond the end of the original allocation request size,
 * so we must mark the entire allocation as defined.
 */
-   VALGRIND_MAKE_MEM_DEFINED(pointer, old_size);
-   memcpy(newptr, pointer, Min(size, old_size));
+   if (likely(newptr != NULL))
+   {
+   VALGRIND_MAKE_MEM_DEFINED(pointer, old_size);
+   memcpy(newptr, pointer, Min(size, old_size));
+   }
pfree(unaligned);
 
return newptr;
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 2f99fa9a2f..81c3120c2b 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -700,7 +700,7 @@ AllocSetDelete(MemoryContext context)
  * return space that is marked NOACCESS - AllocSetRealloc has to beware!
  */
 void *
-AllocSetAlloc(MemoryContext context, Size size)
+AllocSetAlloc(MemoryContext context, Size size, int flags)
 {
AllocSetset = (AllocSet) context;
AllocBlock  block;
@@ -717,6 +717,9 @@ AllocSetAlloc(MemoryContext context, Size size)
 */
if (size > set->allocChunkLimit)
{
+   /* check size, only allocation path where the limits could be 
hit */
+   

Re: When extended query protocol ends?

2024-02-22 Thread Jelte Fennema-Nio
On Thu, 22 Feb 2024 at 10:28, Vladimir Sitnikov
 wrote:
>
> >When splitting a multi insert statement you're going to duplicate some work
>
> I do not know how this could be made more efficient as I execute parse only 
> once, and then I send bind+exec+bind+exec
> without intermediate sync messages, so the data should flow nicely in TCP 
> packets.

I agree you cannot change that flow to be more efficient, but I meant
that your comparison was not fair:
1. Multi-insert vs multiple single inserts is actually executing
different queries
2. Going from Query -> Parse+Bind+Exec for the same query, only
changes protocol related things

> Here are some measurements regarding savepoints for simple vs extended
> Sure they are not very scientific, however, they show improvement for simple 
> protocol

Alright, those improvements are not huge, but I agree it's clear that
the extended protocol has some overhead. So probably you'd want to
keep using the simple protocol to send the SAVEPOINT query.

> Apparently, sending an extra message would increase the overhead of the 
> protocol, thus reducing the efficiency of the application.
> What is the benefit of sending extra Sync?
>
> https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-MESSAGE-CONCEPTS
> suggests that is is fine to mix both simple and extended messages
> depending on the needs of the application.

Yes, it's fine to mix and match extended and simple protocol. But the
protocol docs quite clearly state that a sync is required before going
back to the Simple protocol: "At completion of each series of
extended-query messages, the frontend should issue a Sync message."
https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY

Terminating a sequence of extended messages with a Query message
instead of a Sync message is definitely undefined behaviour.




Re: Synchronizing slots from primary to standby

2024-02-22 Thread Bertrand Drouvot
Hi,

On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote:
> On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > Thanks!
> >
> > Some random comments about v92_001 (Sorry if it has already been discussed
> > up-thread):
> 
> Thanks for the feedback. The patch is pushed 15 minutes back.

Yeah, saw that after I send the comments ;-)

> I will
> prepare a top-up patch for your comments.

Thanks!

> > 4 ===
> >
> > +   if (wal_level < WAL_LEVEL_LOGICAL)
> > +   {
> > +   ereport(ERROR,
> > +   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +   errmsg("slot synchronization requires 
> > wal_level >= \"logical\""));
> > +   return false;
> > +   }
> >
> > I think the return is not needed here as it won't be reached due to the 
> > "ERROR".
> > Or should we use "elevel" instead of "ERROR"?
> 
> It was suggested to raise ERROR for wal_level validation, please see
> [1]. But yes, I will  remove the return value.

Yeah, thanks, ERROR makes sense here.

> > 5 ===
> >
> > +* operate as a superuser. This is safe because the slot sync 
> > worker does
> > +* not interact with user tables, eliminating the risk of executing
> > +* arbitrary code within triggers.
> >
> > Right. I did not check but if we are using operators in our remote SPI calls
> > then it would be worth to ensure they are coming from the pg_catalog schema?
> > Using something like "OPERATOR(pg_catalog.=)" using "=" as an example.
> 
> Can you please elaborate this one, I am not sure if I understood it.

Suppose that in synchronize_slots() the query would be:

const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
" restart_lsn, catalog_xmin, two_phase, failover,"
" database, conflict_reason"
" FROM pg_catalog.pg_replication_slots"
" WHERE failover and NOT temporary and 1 = 1";

Then my comment is to rewrite it to:

const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn,"
" restart_lsn, catalog_xmin, two_phase, failover,"
" database, conflict_reason"
" FROM pg_catalog.pg_replication_slots"
" WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1";

to ensure the operator "=" is coming from the pg_catalog schema.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-22 Thread Dean Rasheed
On Thu, 22 Feb 2024 at 03:46, zwj  wrote:
>
>   If I want to get the same results as Oracle, do I need to adjust the lock 
> behavior of the update and merge statements?
>   If I want to achieve the same results as Oracle, can I achieve exclusive 
> locking by adjusting update and merge?  Do you have any suggestions?
>

I think that trying to get the same results in Oracle and Postgres may
not always be possible. Each has their own (probably quite different)
implementation of these features, that simply may not be compatible.

In Postgres, MERGE aims to make UPDATE and DELETE actions behave in
the same way as standalone UPDATE and DELETE commands under concurrent
modifications. However, it does not attempt to prevent INSERT actions
from inserting duplicates.

In that context, the UNION ALL issue is a clear bug, and I'll aim to
get that patch committed and back-patched sometime in the next few
days, if there are no objections from other hackers.

However, the issue with INSERT actions inserting duplicates is a
design choice, rather than something that we regard as a bug. It's
possible that a future version of Postgres might improve MERGE,
providing some way round that issue, but there's no guarantee of that
ever happening. Similarly, it sounds like Oracle also sometimes allows
duplicates, as well as having other "bugs" like the one discussed in
[1], that may be difficult for them to fix within their
implementation.

In Postgres, if the target table is subject to concurrent inserts (or
primary key updates), it might be better to use INSERT ... ON CONFLICT
DO UPDATE [2] instead of MERGE. That would avoid inserting duplicates
(though I can't say how compatible that is with anything in Oracle).

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/CAEZATCV_6t5E57q7HsWQBX6a5YOjN5o7K-HicZ8a73EPzfwo=a...@mail.gmail.com

[2] https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT




Re: About a recently-added message

2024-02-22 Thread Amit Kapila
On Thu, Feb 22, 2024 at 11:10 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 22 Feb 2024 10:51:07 +0530, Amit Kapila  
> wrote in
> > > Do you think some additional tests for the rest of the messages are
> > > worth the trouble?
> > >
> >
> > We have discussed this during development and didn't find it worth
> > adding tests for all misconfigured parameters. However, in the next
> > patch where we are planning to add a slot sync worker that will
> > automatically sync slots, we are adding a test for one more parameter.
> > I am not against adding tests for all the parameters but it didn't
> > appeal to add more test cycles for this.
>
> Thanks for the explanation. I'm fine with that.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: speed up a logical replica setup

2024-02-22 Thread vignesh C
On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear hackers,
>
> > Since it may be useful, I will post top-up patch on Monday, if there are no
> > updating.
>
> And here are top-up patches. Feel free to check and include.
>
> v22-0001: Same as v21-0001.
> === rebased patches ===
> v22-0002: Update docs per recent changes. Same as v20-0002.
> v22-0003: Add check versions of the target. Extracted from v20-0003.
> v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was
>   slightly changed.
> === Newbie ===
> V22-0005: Addressed my comments which seems to be trivial[1].
>   Comments #1, 3, 4, 8, 10, 14, 17 were addressed here.
> v22-0006: Consider the scenario when commands are failed after the recovery.
>   drop_subscription() is removed and some messages are added per [2].
> V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were 
> addressed here.
> V22-0008: Fix a strange report when physical_primary_slot is null. Per 
> comment #9 [1].
> V22-0009: Prohibit reuse publications when it has already existed. Per 
> comments #11 and 12 [1].
> V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits 
> soon. Per comment #15 [1].
> V22-0011: Update testcode. Per comments #17- [1].

Few comments on the tests:
1) If the dry run was successful because of some issue then the server
will be stopped so we can check for "pg_ctl status" if the server is
running otherwise the connection will fail in this case. Another way
would be to check if it does not have "postmaster was stopped"
messages in the stdout.
+
+# Check if node S is still a standby
+is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'),
+   't', 'standby is in recovery');

2) Can we add verification of  "postmaster was stopped" messages in
the stdout for dry run without --databases testcase
+# pg_createsubscriber can run without --databases option
+command_ok(
+   [
+   'pg_createsubscriber', '--verbose',
+   '--dry-run', '--pgdata',
+   $node_s->data_dir, '--publisher-server',
+   $node_p->connstr('pg1'), '--subscriber-server',
+   $node_s->connstr('pg1')
+   ],
+   'run pg_createsubscriber without --databases');
+

3) This message "target server must be running" seems to be wrong,
should it be cannot specify cascading replicating standby as standby
node(this is for v22-0011 patch :
+   'pg_createsubscriber', '--verbose', '--pgdata',
$node_c->data_dir,
+   '--publisher-server', $node_s->connstr('postgres'),
+   '--port', $node_c->port, '--socketdir', $node_c->host,
+   '--database', 'postgres'
],
-   'primary server is in recovery');
+   1,
+   [qr//],
+   [qr/primary server cannot be in recovery/],
+   'target server must be running');

4) Should this be "Wait for subscriber to catch up"
+# Wait subscriber to catch up
+$node_s->wait_for_subscription_sync($node_p, $subnames[0]);
+$node_s->wait_for_subscription_sync($node_p, $subnames[1]);

5) Should this be 'Subscriptions has been created on all the specified
databases'
+);
+is($result, qq(2),
+   'Subscriptions has been created to all the specified databases'
+);

6) Add test to verify current_user is not a member of
ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no
permissions to execution replication origin advance functions

7) Add tests to verify insufficient max_logical_replication_workers,
max_replication_slots and max_worker_processes for the subscription
node

8) Add tests to verify invalid configuration of  wal_level,
max_replication_slots and max_wal_senders for the publisher node

9) We can use the same node name in comment and for the variable
+# Set up node P as primary
+$node_p = PostgreSQL::Test::Cluster->new('node_p');
+$node_p->init(allows_streaming => 'logical');
+$node_p->start;

10) Similarly we can use node_f instead of F in the comments.
+# Set up node F as about-to-fail node
+# Force it to initialize a new cluster instead of copying a
+# previously initdb'd cluster.
+{
+   local $ENV{'INITDB_TEMPLATE'} = undef;
+
+   $node_f = PostgreSQL::Test::Cluster->new('node_f');
+   $node_f->init(allows_streaming => 'logical');
+   $node_f->start;

Regards,
Vignesh




Re: Synchronizing slots from primary to standby

2024-02-22 Thread shveta malik
On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> Thanks!
>
> Some random comments about v92_001 (Sorry if it has already been discussed
> up-thread):

Thanks for the feedback. The patch is pushed 15 minutes back. I will
prepare a top-up patch for your comments.

> 1 ===
>
> +* We do not update the 'synced' column from true to false here
>
> Worth to mention from which system view the 'synced' column belongs to?

Sure, I will change it.

> 2 === (Nit)
>
> +#define MIN_WORKER_NAPTIME_MS  200
> +#define MAX_WORKER_NAPTIME_MS  3   /* 30s */
>
> [MIN|MAX]_SLOTSYNC_WORKER_NAPTIME_MS instead? It is used only in slotsync.c so
> more a Nit.

Okay, will change it,

> 3 ===
>
> res = walrcv_exec(wrconn, query, SLOTSYNC_COLUMN_COUNT, slotRow);
> -
> if (res->status != WALRCV_OK_TUPLES)
>
> Line removal intended?

I feel the current style is better, where we do not have space between
the function call and return value checking.

> 4 ===
>
> +   if (wal_level < WAL_LEVEL_LOGICAL)
> +   {
> +   ereport(ERROR,
> +   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +   errmsg("slot synchronization requires 
> wal_level >= \"logical\""));
> +   return false;
> +   }
>
> I think the return is not needed here as it won't be reached due to the 
> "ERROR".
> Or should we use "elevel" instead of "ERROR"?

It was suggested to raise ERROR for wal_level validation, please see
[1]. But yes, I will  remove the return value. Thanks for catching
this.

> 5 ===
>
> +* operate as a superuser. This is safe because the slot sync worker 
> does
> +* not interact with user tables, eliminating the risk of executing
> +* arbitrary code within triggers.
>
> Right. I did not check but if we are using operators in our remote SPI calls
> then it would be worth to ensure they are coming from the pg_catalog schema?
> Using something like "OPERATOR(pg_catalog.=)" using "=" as an example.

Can you please elaborate this one, I am not sure if I understood it.

[1]: 
https://www.postgresql.org/message-id/CAD21AoB2ipSzQb5-o5pEYKie4oTPJTsYR1ip9_wRVrF6HbBWDQ%40mail.gmail.com

thanks
Shveta




Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Peter Eisentraut

On 22.02.24 11:00, Daniel Gustafsson wrote:

On 22 Feb 2024, at 10:55, Ashutosh Bapat  wrote:
On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson  wrote:



Somebody looking for dump/restore tests wouldn't search
src/bin/pg_upgrade, I think.


Quite possibly not, but pg_upgrade is already today an important testsuite for
testing pg_dump in binary-upgrade mode so maybe more developers touching
pg_dump should?


Yeah, I think attaching this to the existing pg_upgrade test would be a 
good idea.  Not only would it save test run time, it would probably also 
reduce code duplication.






Re: Synchronizing slots from primary to standby

2024-02-22 Thread Bertrand Drouvot
Hi,

On Thu, Feb 22, 2024 at 12:16:34PM +0530, shveta malik wrote:
> On Thu, Feb 22, 2024 at 10:31 AM shveta malik  wrote:
> There was a recent commit 801792e to improve error messaging in
> slotsync.c which resulted in conflict. Thus rebased the patch. There
> is no new change in the patch attached

Thanks!

Some random comments about v92_001 (Sorry if it has already been discussed
up-thread):

1 ===

+* We do not update the 'synced' column from true to false here

Worth to mention from which system view the 'synced' column belongs to?

2 === (Nit)

+#define MIN_WORKER_NAPTIME_MS  200
+#define MAX_WORKER_NAPTIME_MS  3   /* 30s */

[MIN|MAX]_SLOTSYNC_WORKER_NAPTIME_MS instead? It is used only in slotsync.c so
more a Nit.

3 ===

res = walrcv_exec(wrconn, query, SLOTSYNC_COLUMN_COUNT, slotRow);
-
if (res->status != WALRCV_OK_TUPLES)

Line removal intended?

4 ===

+   if (wal_level < WAL_LEVEL_LOGICAL)
+   {
+   ereport(ERROR,
+   errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+   errmsg("slot synchronization requires wal_level 
>= \"logical\""));
+   return false;
+   }

I think the return is not needed here as it won't be reached due to the "ERROR".
Or should we use "elevel" instead of "ERROR"?

5 ===

+* operate as a superuser. This is safe because the slot sync worker 
does
+* not interact with user tables, eliminating the risk of executing
+* arbitrary code within triggers.

Right. I did not check but if we are using operators in our remote SPI calls
then it would be worth to ensure they are coming from the pg_catalog schema?
Using something like "OPERATOR(pg_catalog.=)" using "=" as an example.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Porting PostgresSQL libraries for QNX710

2024-02-22 Thread Daniel Gustafsson
> On 22 Feb 2024, at 06:12, Rajith Rao .B(App Software)  
> wrote:

> Hope my query is clear for you and expecting  a resolution for this.

There is no official port of libpq to QNX, so the short answer is that you're
on your own.  QNX support was removed in 8.2, so maybe looking at the code
before that happened might give some insights on how to get started?

--
Daniel Gustafsson





Porting PostgresSQL libraries for QNX710

2024-02-22 Thread Rajith Rao .B(App Software)

Hello,

I have been working on ubuntu 22.04 LTS with postgres in my applications 
and need to deploy that application on QNX710.


I have a requirement to port postgresSQL 12.18 to QNX 7.1 ,is it 
possible to build/port postgreSQL libraries for QNX7.1 Intel and Aarch64 
architectures.


Hope my query is clear for you and expecting  a resolution for this.

Thanks & Regards,
Ranjith Rao.B

***
Disclaimer:The information contained in this e-mail and/or attachments to it 
may contain confidential data (or) privileged information of Medha. If you are 
not the intended recipient, any dissemination, use in any manner, review, 
distribution, printing, copying of the information contained in this e-mail 
and/or attachments to it are strictly prohibited. If you have received this 
communication in error, please notify the sender and immediately delete the 
message and attachments (if any) permanently.
"Please consider the environment before printing this message."
***


?????? bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-02-22 Thread zwj
Thanks your reply.


 I understand what you mean and have tried to correct this patch.
 According to the previous use case, the result obtained is as follows:



id | name | year | xmax | xmin | ctid 
+--+--+--+--+---
 1 | liuwei | 20 | 0 | 
859 | (0,1)
 2 | zhangbin | 30 | 866 | 866 | (0,7)
 3 | fuguo | 44 | 866 | 866 | 
(0,8)
 4 | yihe | 33 | 0 
| 865 | (0,6)
 4 | yihe | 33 | 0 
| 866 | (0,9)
(5 rows)


 At present, the behavior of the number of rows for ??id?? 2 and 3 
appears to be normal, but there is duplicate data in the data for ??id?? 
4.
 According to what you said, this is a normal manifestation of 
transaction isolation level. 

 But there are still differences between the results and those of 
Oracle(no duplicate data 'id' 4). 



 After that I have tried several scenarios in Oracle and PG:
 1??session1?? insert?? session2??merge into?? duplicate data 
may also occur ??pg and oracle consistent??.
 2??session1: update + insert ,session2?? merge into?? 
there will be no duplicate data in oracle ??pg has duplicate data.
 

 It looks like there is an exclusive lock between the update statement 
and merge statement in oracle. After submitting both update and insert, 
merge will proceed with locking and execution.
 (Of course, this is just my guess.)

 However, it seems that both PG and Oracle have no obvious issues, and 
their respective designs are reasonable.



 If I want to get the same results as Oracle, do I need to adjust the 
lock behavior of the update and merge statements?
 If I want to achieve the same results as Oracle, can I achieve exclusive 
locking by adjusting update and merge? Do you have any suggestions?



Regards,
wenjiang zhang



----
??: 
   "Dean Rasheed"   
 
https://www.postgresql.org/docs/current/transaction-iso.html

Regards,
Dean

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-02-22 Thread David Benjamin
Thanks for the very thorough comments! I've attached a new version of the
patch.

On Thu, Feb 15, 2024 at 4:17 PM Andres Freund  wrote:

> Hi,
>
> On 2024-02-11 13:19:00 -0500, David Benjamin wrote:
> > I've attached a patch for the master branch to fix up the custom BIOs
> used
> > by PostgreSQL, in light of the issues with the OpenSSL update recently.
> > While c82207a548db47623a2bfa2447babdaa630302b9 (switching from
> BIO_get_data
> > to BIO_get_app_data) resolved the immediate conflict, I don't think it
> > addressed the root cause, which is that PostgreSQL was mixing up two BIO
> > implementations, each of which expected to be driving the BIO.
>
> Yea, that's certainly not nice - and I think we've been somewhat lucky it
> hasn't caused more issues. There's some nasty possibilities, e.g.
> sock_ctrl()
> partially enabling ktls without our initialization having called
> ktls_enable(). Right now that just means ktls is unusable, but it's not
> hard
> to imagine accidentally ending up sending unencrypted data.
>

Yeah. Even if, say, the ktls bits work, given you all care enough about I/O
to have wanted to wrap the BIO, I assume you'd want to pick up those
features on your own terms, e.g. by implementing the BIO_CTRLs yourself.


> I've in the past looked into not using a custom BIO [1], but I have my
> doubts
> about that being a good idea. I think medium term we want to be able to do
> network IO asynchronously, which seems quite hard to do when using
> openssl's
> socket BIO.



> Once we've done that, we're free to use BIO_set_data. While
> BIO_set_app_data
> > works fine, I've reverted back to BIO_set_data because it's more
> commonly used.
> > app_data depends on OpenSSL's "ex_data" mechanism, which is a tad
> heavier under
> > the hood.
>
> At first I was a bit wary of that, because it requires us to bring back the
> fallback implementation. But you're right, it's noticeably heavier than
> BIO_get_data(), and we do call BIO_get_app_data() fairly frequently.
>

TBH, I doubt it makes any real difference perf-wise. But I think
BIO_get_data is a bit more common in this context.


> > That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only
> > operation that matters is BIO_CTRL_FLUSH, which is implemented as a
> no-op. All
> > other operations are unused. It's once again good that they're unused
> because
> > otherwise OpenSSL might mess with postgres's socket, break the
> assumptions
> > around interrupt handling, etc.
>
> How did you determine that only FLUSH is required? I didn't even really
> find
> documentation about what the intended semantics actually are.
>

The unhelpful answer is that my day job is working on BoringSSL, so I've
spent a lot of time with this mess. :-) But, yeah, it's not well-documented
at all. OpenSSL ends up calling BIO_flush at the end of each batch of
writes in libssl. TBH, I suspect that was less intentional and more an
emergent property of them internally layering a buffer BIO at one point in
the process, but it's long been part of the (undocumented) API contract.
Conversely, I don't think OpenSSL can possibly make libssl *require* a new
BIO_CTRL because they'd break every custom BIO anyone has ever used with
the library.


> E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so
> far, because we never set it, but is that right?


Ah hmm, BIO_CTRL_EOF is... a bit of a mess. OpenSSL kind of messed things
up. So, up until recently, I would have said that BIO_CTRL_EOF was not part
of the interface here. OpenSSL 1.0.x did not implement it for sockets, and
the BIO_read API *already* had a way to signal EOF: did you return zero or
-1?

Then OpenSSL 1.1.x introduced size_t-based BIO_read_ex APIs. However, in
the process, they *forgot that EOF and error are different things* and made
it impossible to detect EOFs if you use BIO_read_ex! They never noticed
this, because they didn't actually convert their own code to their new API.
See this discussion, which alas ended with OpenSSL deciding to ignore the
problem and not even document their current interface.
https://github.com/openssl/openssl/issues/8208

Though they never responded, they seem to have tacitly settled using the
out-of-band BIO_eof function (which is BIO_CTRL_EOF) as the way to signal
EOF for BIO_read_ex. This is kind of fiddly, but is at least a well-defined
option. But the problem is no one's BIO_METHODs, including their own, are
read_ex-based. They all implement the old read callback. But someone might
call BIO_read_ex on a read-based BIO_METHOD. IMO, BIO_read_ex should be
responsible for translating between the two EOF conventions. For example,
it could automatically set a flag when the read callback returns 0 and then
make BIO_ctrl check the flag and automatically implement BIO_CTRL_EOF for
BIOs that don't do it themselves. Alas, OpenSSL did not do this and instead
they just made their built-in BIOs implement BIO_CTRL_EOF, even though this
new expectation is a compatibility 

Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Daniel Gustafsson
> On 22 Feb 2024, at 10:55, Ashutosh Bapat  wrote:
> On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson  wrote:

> Somebody looking for dump/restore tests wouldn't search
> src/bin/pg_upgrade, I think.

Quite possibly not, but pg_upgrade is already today an important testsuite for
testing pg_dump in binary-upgrade mode so maybe more developers touching
pg_dump should?

>> When upgrading to the same version, we could perhaps also use this to test a
>> scenario like: Dump A, restore into B, upgrade B into C, dump C and compare C
>> to A.
> 
> If comparison of C to A fails, we wouldn't know which step fails. I
> would rather compare outputs of each step separately.

To be clear, this wasn't intended to replace what you are proposing, but an
idea for using it to test *more* scenarios.

--
Daniel Gustafsson





Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Ashutosh Bapat
On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson  wrote:
>
> > On 22 Feb 2024, at 10:16, Peter Eisentraut  wrote:
>
> > We have somewhat relied on the pg_upgrade test to provide this testing, but 
> > we have recently discovered that the dumps in binary-upgrade mode are 
> > different enough to not test the normal dumps well.
> >
> > Yes, this test is a bit expensive.  We could save some time by doing the 
> > first dump at the end of the normal regress test and have the pg_dump test 
> > reuse that, but then that would make the regress test run a bit longer.  Is 
> > that a better tradeoff?
>
> Something this expensive seems like what PG_TEST_EXTRA is intended for, we
> already have important test suites there.

That's ok with me.

>
> But.  We know that the cluster has an interesting state when the pg_upgrade
> test starts, could we use that to make a dump/restore test before continuing
> with testing pg_upgrade?  It can be argued that pg_upgrade shouldn't be
> responsible for testing pg_dump, but it's already now a pretty important
> testcase for pg_dump in binary upgrade mode so it's that far off.  If pg_dump
> has bugs then pg_upgrade risks subtly breaking.

Somebody looking for dump/restore tests wouldn't search
src/bin/pg_upgrade, I think. However if more people think we should
just add this test 002_pg_upgrade.pl, I am fine with it.

>
> When upgrading to the same version, we could perhaps also use this to test a
> scenario like: Dump A, restore into B, upgrade B into C, dump C and compare C
> to A.

If comparison of C to A fails, we wouldn't know which step fails. I
would rather compare outputs of each step separately.

-- 
Best Wishes,
Ashutosh Bapat




Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-22 Thread Sutou Kouhei
Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Thu, 22 Feb 2024 15:44:16 +0900,
  Michael Paquier  wrote:

> I was comparing what you have here, and what's been attached by Andres
> at [1] and the top of the changes on my development branch at [2]
> (v3-0008, mostly).  And, it strikes me that there is no need to do any
> major changes in any of the callbacks proposed up to v13 and v14 in
> this thread, as all the changes proposed want to plug in more data
> into each StateData for COPY FROM and COPY TO, the best part being
> that v3-0008 can just reuse the proposed callbacks as-is.  v1-0001
> from Sutou-san would need one slight tweak in the per-row callback,
> still that's minor.

I think so too. But I thought that some minor conflicts will
be happen with this and the v15. So I worked on this before
the v15.

We agreed that this optimization doesn't block v15: [1]
So we can work on the v15 without this optimization for now.

[1] 
https://www.postgresql.org/message-id/flat/20240219195351.5vy7cdl3wxia66kg%40awork3.anarazel.de#20f9677e074fb0f8c5bb3994ef059a15

> I have been spending more time on the patch to introduce the COPY
> APIs, leading me to the v15 attached, where I have replaced the
> previous attribute callbacks for the output representation and the
> reads with hardcoded routines that should be optimized by compilers,
> and I have done more profiling with -O2.

Thanks! I wanted to work on it but I didn't have enough time
for it in a few days...

I've reviewed the v15.


> @@ -751,8 +751,9 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int 
> nbytes)
>   *
>   * NOTE: force_not_null option are not applied to the returned fields.
>   */
> -bool
> -NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
> +static bool

"inline" is missing here.

> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields,
> +   bool is_csv)
>  {
>   int fldct;


How about adding "is_csv" to CopyReadline() and
CopyReadLineText() too?


diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index 25b8d4bc52..79fabecc69 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -150,8 +150,8 @@ static const char BinarySignature[11] = 
"PGCOPY\n\377\r\n\0";
 
 
 /* non-export function prototypes */
-static bool CopyReadLine(CopyFromState cstate);
-static bool CopyReadLineText(CopyFromState cstate);
+static inline bool CopyReadLine(CopyFromState cstate, bool is_csv);
+static inline bool CopyReadLineText(CopyFromState cstate, bool is_csv);
 static inline int CopyReadAttributesText(CopyFromState cstate);
 static inline int CopyReadAttributesCSV(CopyFromState cstate);
 static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
@@ -770,7 +770,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, 
int *nfields,
tupDesc = RelationGetDescr(cstate->rel);
 
cstate->cur_lineno++;
-   done = CopyReadLine(cstate);
+   done = CopyReadLine(cstate, is_csv);
 
if (cstate->opts.header_line == COPY_HEADER_MATCH)
{
@@ -823,7 +823,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, 
int *nfields,
cstate->cur_lineno++;
 
/* Actually read the line into memory here */
-   done = CopyReadLine(cstate);
+   done = CopyReadLine(cstate, is_csv);
 
/*
 * EOF at start of line means we're done.  If we see EOF after some
@@ -1133,8 +1133,8 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
  * by newline.  The terminating newline or EOF marker is not included
  * in the final value of line_buf.
  */
-static bool
-CopyReadLine(CopyFromState cstate)
+static inline bool
+CopyReadLine(CopyFromState cstate, bool is_csv)
 {
boolresult;
 
@@ -1142,7 +1142,7 @@ CopyReadLine(CopyFromState cstate)
cstate->line_buf_valid = false;
 
/* Parse data and transfer into line_buf */
-   result = CopyReadLineText(cstate);
+   result = CopyReadLineText(cstate, is_csv);
 
if (result)
{
@@ -1209,8 +1209,8 @@ CopyReadLine(CopyFromState cstate)
 /*
  * CopyReadLineText - inner loop of CopyReadLine for text mode
  */
-static bool
-CopyReadLineText(CopyFromState cstate)
+static inline bool
+CopyReadLineText(CopyFromState cstate, bool is_csv)
 {
char   *copy_input_buf;
int input_buf_ptr;
@@ -1226,7 +1226,7 @@ CopyReadLineText(CopyFromState cstate)
charquotec = '\0';
charescapec = '\0';
 
-   if (cstate->opts.csv_mode)
+   if (is_csv)
{
quotec = cstate->opts.quote[0];
escapec = cstate->opts.escape[0];
@@ -1306,7 +1306,7 @@ CopyReadLineText(CopyFromState cstate)
prev_raw_ptr = 

Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Daniel Gustafsson
> On 22 Feb 2024, at 10:16, Peter Eisentraut  wrote:

> We have somewhat relied on the pg_upgrade test to provide this testing, but 
> we have recently discovered that the dumps in binary-upgrade mode are 
> different enough to not test the normal dumps well.
> 
> Yes, this test is a bit expensive.  We could save some time by doing the 
> first dump at the end of the normal regress test and have the pg_dump test 
> reuse that, but then that would make the regress test run a bit longer.  Is 
> that a better tradeoff?

Something this expensive seems like what PG_TEST_EXTRA is intended for, we
already have important test suites there.

But.  We know that the cluster has an interesting state when the pg_upgrade
test starts, could we use that to make a dump/restore test before continuing
with testing pg_upgrade?  It can be argued that pg_upgrade shouldn't be
responsible for testing pg_dump, but it's already now a pretty important
testcase for pg_dump in binary upgrade mode so it's that far off.  If pg_dump
has bugs then pg_upgrade risks subtly breaking.

When upgrading to the same version, we could perhaps also use this to test a
scenario like: Dump A, restore into B, upgrade B into C, dump C and compare C
to A.

--
Daniel Gustafsson





Re: When extended query protocol ends?

2024-02-22 Thread Vladimir Sitnikov
>When splitting a multi insert statement you're going to duplicate some work

I do not understand why I am going to duplicate some work.
I assume the database does its best to perform all the needed preparation
when processing "parse" message,
and it should perform only the minimum required work when processing
bind+exec messages.

Unfortunately, it is not completely the case, so using bind+exec+bind+exec
is suboptimal even for trivial insert statements.

Please, take into account the following sequence:
One-time-only:
parse S1 as insert into table(id, name) values(?,?)

Some time later:
bind S1 ...
exec S1
bind S1 ...
exec S1
bind S1 ...
exec S1
bind S1 ...
exec S1
sync

I do not know how this could be made more efficient as I execute parse only
once, and then I send bind+exec+bind+exec
without intermediate sync messages, so the data should flow nicely in TCP
packets.
As I said above, the same flow for multi-value insert beats the
bind+exec+bind+exec sequence at a cost of poor reporting.
For instance, for multivalue insert we can't tell how many rows are
generated for each statement.

---

Here are some measurements regarding savepoints for simple vs extended
Sure they are not very scientific, however, they show improvement for
simple protocol

$ cat svpnt

BEGIN;
SAVEPOINT PGJDBC_AUTOSAVE;
RELEASE SAVEPOINT PGJDBC_AUTOSAVE;
COMMIT;

$ pgbench -f svpnt --protocol=extended --time=10 --progress=1 -r

progress: 1.0 s, 4213.8 tps, lat 0.237 ms stddev 0.034, 0 failed
progress: 2.0 s, 4367.9 tps, lat 0.229 ms stddev 0.024, 0 failed
progress: 3.0 s, 4296.2 tps, lat 0.233 ms stddev 0.038, 0 failed
progress: 4.0 s, 4382.0 tps, lat 0.228 ms stddev 0.026, 0 failed
progress: 5.0 s, 4374.1 tps, lat 0.228 ms stddev 0.026, 0 failed
progress: 6.0 s, 4305.7 tps, lat 0.232 ms stddev 0.035, 0 failed
progress: 7.0 s, 4111.1 tps, lat 0.243 ms stddev 0.182, 0 failed
progress: 8.0 s, 4245.0 tps, lat 0.235 ms stddev 0.042, 0 failed
progress: 9.0 s, 4219.9 tps, lat 0.237 ms stddev 0.036, 0 failed
progress: 10.0 s, 4231.1 tps, lat 0.236 ms stddev 0.031, 0 failed
transaction type: svpnt
scaling factor: 1
query mode: extended
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 42748
number of failed transactions: 0 (0.000%)
latency average = 0.234 ms
latency stddev = 0.065 ms
initial connection time = 2.178 ms
tps = 4275.562760 (without initial connection time)
statement latencies in milliseconds and failures:
 0.058   0  BEGIN;
 0.058   0  SAVEPOINT PGJDBC_AUTOSAVE;
 0.058   0  RELEASE SAVEPOINT PGJDBC_AUTOSAVE;
 0.060   0  COMMIT;

$ pgbench -f svpnt --protocol=simple --time=10 --progress=1 -r

progress: 1.0 s, 4417.7 tps, lat 0.225 ms stddev 0.033, 0 failed
progress: 2.0 s, 4446.0 tps, lat 0.225 ms stddev 0.079, 0 failed
progress: 3.0 s, 4377.1 tps, lat 0.228 ms stddev 0.048, 0 failed
progress: 4.0 s, 4485.0 tps, lat 0.223 ms stddev 0.024, 0 failed
progress: 5.0 s, 4355.9 tps, lat 0.229 ms stddev 0.353, 0 failed
progress: 6.0 s, .3 tps, lat 0.225 ms stddev 0.035, 0 failed
progress: 7.0 s, 4530.7 tps, lat 0.220 ms stddev 0.020, 0 failed
progress: 8.0 s, 4431.1 tps, lat 0.225 ms stddev 0.022, 0 failed
progress: 9.0 s, 4497.1 tps, lat 0.222 ms stddev 0.027, 0 failed
progress: 10.0 s, 4507.0 tps, lat 0.222 ms stddev 0.024, 0 failed
transaction type: svpnt
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 44493
number of failed transactions: 0 (0.000%)
latency average = 0.224 ms
latency stddev = 0.116 ms
initial connection time = 2.690 ms
tps = 4450.372095 (without initial connection time)
statement latencies in milliseconds and failures:
 0.056   0  BEGIN;
 0.056   0  SAVEPOINT PGJDBC_AUTOSAVE;
 0.056   0  RELEASE SAVEPOINT PGJDBC_AUTOSAVE;
 0.057   0  COMMIT;

Vladimir


Re: A problem about partitionwise join

2024-02-22 Thread Ashutosh Bapat
On Wed, Feb 21, 2024 at 4:55 PM Richard Guo  wrote:
>
>
> On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion  wrote:
>>
>> Once you think you've built up some community support and the patchset
>> is ready for review, you (or any interested party) can resurrect the
>> patch entry by visiting
>>
>> https://commitfest.postgresql.org/38/2266/
>>
>> and changing the status to "Needs Review", and then changing the
>> status again to "Move to next CF". (Don't forget the second step;
>> hopefully we will have streamlined this in the near future!)
>
>
> This patch was returned due to 'lack of interest'.  However, upon
> verification, it appears that the reported issue still exists, and the
> proposed fix in the thread remains valid.  Hence, resurrect this patch
> after rebasing it on master.  I've also written a detailed commit
> message which hopefully can help people review the changes more
> effectively.

The concept looks useful. The SQL statement added in the test looks
cooked though (it outputs data that has same value for two columns
which is equal to primary key of other table - when would somebody do
that?). Is there some real life example of this?

The patch uses restrictclauses as well as EC's. Tom has proposed to
make EC work with outer joins sensibly. Has that happened? Can this
patch leverage it rather than having two loops?


-- 
Best Wishes,
Ashutosh Bapat




Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Peter Eisentraut

On 22.02.24 02:01, Michael Paquier wrote:

On Wed, Feb 21, 2024 at 12:18:45PM +0530, Ashutosh Bapat wrote:

Even with 1 and 2 the test is useful to detect dump/restore anomalies.
I think we should improve 3, but I don't have a good and simpler
solution. I didn't find any way to compare two given clusters in our
TAP test framework. Building it will be a lot of work. Not sure if
it's worth it.


+   my $rc =
+ system($ENV{PG_REGRESS}
+ . " $extra_opts "
+ . "--dlpath=\"$dlpath\" "
+ . "--bindir= "
+ . "--host="
+ . $node->host . " "
+ . "--port="
+ . $node->port . " "
+ . "--schedule=$srcdir/src/test/regress/parallel_schedule "
+ . "--max-concurrent-tests=20 "
+ . "--inputdir=\"$inputdir\" "
+ . "--outputdir=\"$outputdir\"");

I am not sure that it is a good idea to add a full regression test
cycle while we have already 027_stream_regress.pl that would be enough
to test some dump scenarios.  These are very expensive and easy to
notice even with a high level of parallelization of the tests.


The problem is, we don't really have any end-to-end coverage of

dump
restore
dump again
compare the two dumps

with a database with lots of interesting objects in it.

Note that each of these steps could fail.

We have somewhat relied on the pg_upgrade test to provide this testing, 
but we have recently discovered that the dumps in binary-upgrade mode 
are different enough to not test the normal dumps well.


Yes, this test is a bit expensive.  We could save some time by doing the 
first dump at the end of the normal regress test and have the pg_dump 
test reuse that, but then that would make the regress test run a bit 
longer.  Is that a better tradeoff?


I have done some timing tests:

master:

pg_dump check: 22s
pg_dump check -j8: 8s
check-world -j8:   2min44s

patched:

pg_dump check: 34s
pg_dump check -j8: 13s
check-world -j8:   2min46s

So overall it doesn't seem that bad.





Re: When extended query protocol ends?

2024-02-22 Thread Jelte Fennema-Nio
On Wed, 21 Feb 2024 at 17:07, Vladimir Sitnikov
 wrote:
> From many measurements we know that insert into table(id, name) 
> values(?,?),(?,?),(?,?) is much more efficient than
> sending individual bind-exec-bind-exec-bind-exec-sync messages like "insert 
> into table(id, name) values(?,?)"
> For instance, here are some measurements: 
> https://www.baeldung.com/spring-jdbc-batch-inserts#performance-comparisons
> Based on that measurements I assume there's a non-trivial per-message 
> overhead.

That's quite a different case. When splitting a multi insert statement
you're going to duplicate some work, e.g. executor initialization and
possibly even planning. But when replacing one Query packet with
Parse-Bind-Exec-Sync, these 4 packets are not duplicating such
expensive work. The only thing they should be doing extra is a bit of
packet parsing, which is very cheap.




Re: Test to dump and restore objects left behind by regression

2024-02-22 Thread Ashutosh Bapat
On Thu, Feb 22, 2024 at 6:32 AM Michael Paquier  wrote:
>
> On Wed, Feb 21, 2024 at 12:18:45PM +0530, Ashutosh Bapat wrote:
> > Even with 1 and 2 the test is useful to detect dump/restore anomalies.
> > I think we should improve 3, but I don't have a good and simpler
> > solution. I didn't find any way to compare two given clusters in our
> > TAP test framework. Building it will be a lot of work. Not sure if
> > it's worth it.
>
> +   my $rc =
> + system($ENV{PG_REGRESS}
> + . " $extra_opts "
> + . "--dlpath=\"$dlpath\" "
> + . "--bindir= "
> + . "--host="
> + . $node->host . " "
> + . "--port="
> + . $node->port . " "
> + . "--schedule=$srcdir/src/test/regress/parallel_schedule "
> + . "--max-concurrent-tests=20 "
> + . "--inputdir=\"$inputdir\" "
> + . "--outputdir=\"$outputdir\"");
>
> I am not sure that it is a good idea to add a full regression test
> cycle while we have already 027_stream_regress.pl that would be enough
> to test some dump scenarios.

That test *uses* pg_dump as a way to test whether the two clusters are
in sync. The test might change in future to use some other method to
make sure the two clusters are consistent. Adding the test here to
that test will make that change much harder.

It's not the dump, but restore, we are interested in here. No test
that runs PG_REGRESS also runs pg_restore in non-binary mode.

Also we need to keep this test near other pg_dump tests, not far from them.

> These are very expensive and easy to
> notice even with a high level of parallelization of the tests.

I agree, but I didn't find a suitable test to ride on.

-- 
Best Wishes,
Ashutosh Bapat




Re: Removing unneeded self joins

2024-02-22 Thread Andrei Lepikhov

On 21/2/2024 14:26, Richard Guo wrote:

I think the right fix for these issues is to introduce a new element
'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker
to 1) recurse into subselects with sublevels_up increased, and 2)
perform the replacement only when varlevelsup is equal to sublevels_up.

This code looks good. No idea how we have lost it before.


While writing the fix, I noticed some outdated comments.  Such as in
remove_rel_from_query, the first for loop updates otherrel's attr_needed
as well as lateral_vars, but the comment only mentions attr_needed.  So
this patch also fixes some outdated comments.

Thanks, looks good.


While writing the test cases, I found that the test cases for SJE are
quite messy.  Below are what I have noticed:

* There are several test cases using catalog tables like pg_class,
pg_stats, pg_index, etc. for testing join removal.  I don't see a reason
why we need to use catalog tables, and I think this just raises the risk
of instability.

I see only one unusual query with the pg_class involved.


* In many test cases, a mix of uppercase and lowercase keywords is used
in one query.  I think it'd better to maintain consistency by using
either all uppercase or all lowercase keywords in one query.

I see uppercase -> lowercase change:
select t1.*, t2.a as ax from sj t1 join sj t2
and lowercase -> uppercase in many other cases:
explain (costs off)
I guess it is a matter of taste, so give up for the committer decision. 
Technically, it's OK.


* In most situations, we verify the plan and the output of a query like:

explain (costs off)
select ...;
select ...;

The two select queries are supposed to be the same.  But in the SJE test
cases, I have noticed instances where the two select queries differ from
each other.

This patch also includes some cosmetic tweaks for SJE test cases.  It
does not change the test cases using catalog tables though.  I think
that should be a seperate patch.
I can't assess the necessity of changing these dozens of lines of code 
because I follow another commenting style, but technically, it's still OK.


--
regards,
Andrei Lepikhov
Postgres Professional





Re: Reduce useless changes before reassembly during logical replication

2024-02-22 Thread Andy Fan

Hi, 
>
> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change(). I think I gave the same comment
> earlier as well but didn't see any satisfactory answer or performance
> data for successful cases to back this proposal.

I did some benchmark yesterday at [1] and found it adds 20% cpu time.
then come out a basic idea, I think it deserves a share. "transaction
related stuff" comes from the syscache/systable access except the
HistorySansphot. and the syscache is required in the following
sistuations: 

1. relfilenode (from wal) -> relid.
2. relid -> namespaceid (to check if the relid is a toast relation).
3. if toast, get its origianl relid.
4. access the data from pg_publication_tables.
5. see if the relid is a partition, if yes, we may get its root
relation.

Acutally we already has a RelationSyncCache for #4, and it *only* need
to access syscache when replicate_valid is false, I think this case
should be rare, but the caller doesn't know it, so the caller must
prepare the transaction stuff in advance even in the most case they are
not used. So I think we can get a optimization here.

then the attached patch is made.

Author: yizhi.fzh 
Date:   Wed Feb 21 18:40:03 2024 +0800

Make get_rel_sync_entry less depending on transaction state.

get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.

This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.

Then comes to #1, acutally we have RelfilenumberMapHash as a cache, when
the cache is hit (suppose this is a usual case), no transaction stuff
related.  I have two ideas then:

1. Optimize the cache hit sistuation like what we just did for
get_rel_sync_entry for the all the 5 kinds of data and only pay the
effort for cache miss case. for the data for #2, #3, #5, all the keys
are relid, so I think a same HTAB should be OK.

2. add the content for #1, #2, #3, #5 to wal when wal_level is set to
logical. 

In either case, the changes for get_rel_sync_entry should be needed. 

> Note, users can
> configure to stream_in_progress transactions in which case they
> shouldn't see such a big problem.

People would see the changes is spilled to disk, but the CPU cost for
Reorder should be still paid I think. 

[1] https://www.postgresql.org/message-id/87o7cadqj3.fsf%40163.com

-- 
Best Regards
Andy Fan

>From 90b8df330df049bd1d7e881dc6e9b108c17b0924 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 21 Feb 2024 18:40:03 +0800
Subject: [PATCH v1 1/1] Make get_rel_sync_entry less depending on transaction
 state.

get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.

This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.
---
 src/backend/replication/pgoutput/pgoutput.c | 60 -
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 998f92d671..25e55590a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include "access/tupconvert.h"
+#include "access/relation.h"
 #include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "catalog/pg_publication_rel.h"
@@ -214,6 +215,11 @@ static void init_rel_sync_cache(MemoryContext cachectx);
 static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 			 Relation relation);
+static RelationSyncEntry *get_rel_sync_entry_by_relid(PGOutputData *data,
+	  Oid relid);
+static RelationSyncEntry *get_rel_sync_entry_internal(PGOutputData *data,
+	  Relation relation,
+	  Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 		  uint32 hashvalue);
@@ -1962,11 +1968,29 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
  */
 static RelationSyncEntry *
 get_rel_sync_entry(PGOutputData *data, Relation relation)
+{
+	return get_rel_sync_entry_internal(data, relation, InvalidOid);
+}
+
+static RelationSyncEntry *
+__attribute__ ((unused))

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-02-22 Thread Bertrand Drouvot
Hi,

On Wed, Feb 21, 2024 at 08:10:00PM +0530, Bharath Rupireddy wrote:
> On Wed, Feb 21, 2024 at 5:55 PM Bertrand Drouvot
>  wrote:
> > My initial thought was to put "conflict" value in this new field in case of
> > conflict (not to mention the conflict reason in it). With the current 
> > proposal
> > invalidation_reason could report the same as conflict_reason, which sounds 
> > weird
> > to me.
> >
> > Does that make sense to you to use "conflict" as value in 
> > "invalidation_reason"
> > when the slot has "conflict_reason" not NULL?
> 
> I'm thinking the other way around - how about we revert
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=007693f2a3ac2ac19affcb03ad43cdb36ccff5b5,
> that is, put in place "conflict" as a boolean and introduce
> invalidation_reason the text form. So, for logical slots, whenever the
> "conflict" column is true, the reason is found in invaldiation_reason
> column? How does it sound?

Yeah, I think that looks fine too. We would need more change (like take care of
ddd5f4f54a for example).

CC'ing Amit, Hou-San and Shveta to get their point of view (as the ones behind
007693f2a3 and ddd5f4f54a).

Regarding,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Injection points: some tools to wait and wake

2024-02-22 Thread Bertrand Drouvot
Hi,

On Thu, Feb 22, 2024 at 12:02:01PM +0900, Michael Paquier wrote:
> On Wed, Feb 21, 2024 at 11:50:21AM +, Bertrand Drouvot wrote:
> > A few comments:
> > 
> > 1 ===
> > I think "up" is missing at several places in the patch where "wake" is used.
> > I could be wrong as non native english speaker though.
> 
> Patched up three places to be more consisten.

Thanks!

> > 5 ===
> > 
> > +PG_FUNCTION_INFO_V1(injection_points_wakeup);
> > +Datum
> > +injection_points_wakeup(PG_FUNCTION_ARGS)
> > 
> > Empty line missing before "Datum"?
> 
> I've used the same style as anywhere else.

humm, looking at src/test/regress/regress.c for example, I can see an empty
line between PG_FUNCTION_INFO_V1 and Datum for all the "PG_FUNCTION_INFO_V1"
ones.

> > While at it, should we add a second injection wait point in
> > t/041_invalid_checkpoint_after_promote.pl to check that they are wake up
> > individually?
> 
> I'm not sure that's a good idea for the sake of this test, TBH, as
> that would provide coverage outside the original scope of the
> restartpoint/promote check.

Yeah, that was my doubt too.

> I have also looked at if it would be possible to get an isolation test
> out of that, but the arbitrary wait does not make that possible with
> the existing isolation APIs, see GetSafeSnapshotBlockingPids() used in
> pg_isolation_test_session_is_blocked().  isolation/README seems to be
> a bit off here, actually, mentioning pg_locks..  We could use some
> tricks with transactions or locks internally, but I'm not really
> tempted to make the wait callback more complicated for the sake of
> more coverage as I'd rather keep it generic for anybody who wants to
> control the order of events across processes.

Makes sense to me, let's keep it as it is.
> 
> Attaching a v3 for reference with everything that has been mentioned
> up to now.

Thanks!

Sorry, I missed those ones previously:

1 ===

+/* Maximum number of wait usable in injection points at once */

s/Maximum number of wait/Maximum number of waits/ ?

2 ===

+# Check the logs that the restart point has started on standby.  This is
+# optional, but let's be sure.
+my $log = slurp_file($node_standby->logfile, $logstart);
+my $checkpoint_start = 0;
+if ($log =~ m/restartpoint starting: immediate wait/)
+{
+   $checkpoint_start = 1;
+}
+is($checkpoint_start, 1, 'restartpoint has started');

what about?

ok( $node_standby->log_contains( "restartpoint starting: immediate wait", 
$logstart),
"restartpoint has started");

Except for the above, v3 looks good to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com