Re: Synchronizing slots from primary to standby
Hi, On Fri, Feb 23, 2024 at 09:43:48AM +0530, shveta malik wrote: > On Fri, Feb 23, 2024 at 8:35 AM shveta malik wrote: > > > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > > wrote: > > > > > > Suppose that in synchronize_slots() the query would be: > > > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > > " restart_lsn, catalog_xmin, two_phase, failover," > > > " database, conflict_reason" > > > " FROM pg_catalog.pg_replication_slots" > > > " WHERE failover and NOT temporary and 1 = 1"; > > > > > > Then my comment is to rewrite it to: > > > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > > " restart_lsn, catalog_xmin, two_phase, failover," > > > " database, conflict_reason" > > > " FROM pg_catalog.pg_replication_slots" > > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) > > > 1"; > > > > > > to ensure the operator "=" is coming from the pg_catalog schema. > > > > > > > Thanks for the details, but slot-sync does not use SPI calls, it uses > > libpqrcv calls. So is this change needed? > > Additionally, I would like to have a better understanding of why it's > necessary and whether it addresses any potential security risks. Because one could create say the "=" OPERATOR in their own schema, attach a function to it doing undesired stuff and change the search_path for the database the sync slot worker connects to. Then this new "=" operator would be used (instead of the pg_catalog.= one), triggering the "undesired" function as superuser. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Fri, Feb 23, 2024 at 08:35:44AM +0530, shveta malik wrote: > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > wrote: > > > > Suppose that in synchronize_slots() the query would be: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 = 1"; > > > > Then my comment is to rewrite it to: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > > > to ensure the operator "=" is coming from the pg_catalog schema. > > > > Thanks for the details, but slot-sync does not use SPI calls, it uses > libpqrcv calls. Sorry for the confusion, I meant to say "remote SQL calls". > So is this change needed? The example I provided is a "fake" one (as currently the "=" operator is not used in the const char *query in synchronize_slots()). So there is currently nothing to change here. I just want to highlight that if we are using (or will use) operators in the remote SQL calls then we should ensure they are coming from the pg_catalog schema (as in the example provided above). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera > > wrote: > > > > Sure, but is that really what we want? > > > > So your question is do we want these buffers to be in multiple of > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > don't think it should create any problem logically. I mean we can > > look again in the patch to see if we have made any such assumptions > > but that should be fairly easy to fix, then maybe if we are going in > > this way we should get rid of the check_slru_buffers() function as > > well. > > Not really, I just don't think the macro should be in slru.h. Okay > Another thing I've been thinking is that perhaps it would be useful to > make the banks smaller, when the total number of buffers is small. For > example, if you have 16 or 32 buffers, it's not really clear to me that > it makes sense to have just 1 bank or 2 banks. It might be more > sensible to have 4 banks with 4 or 8 buffers instead. That should make > the algorithm scale down as well as up ... It might be helpful to have small-size banks when SLRU buffers are set to a very low value and we are only accessing a couple of pages at a time (i.e. no buffer replacement) because in such cases most of the contention will be on SLRU Bank lock. Although I am not sure how practical such a use case would be, I mean if someone is using multi-xact very heavily or creating frequent subtransaction overflow then wouldn't they should set this buffer limit to some big enough value? By doing this we would lose some simplicity of the patch I mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would need to compute this and store it in SlruShared. Maybe that's not that bad. > > I haven't done either of those things in the attached v19 version. I > did go over the comments once again and rewrote the parts I was unhappy > with, including some existing ones. I think it's OK now from that point > of view ... at some point I thought about creating a separate README, > but in the end I thought it not necessary. Thanks, I will review those changes. > I did add a bunch of Assert()s to make sure the locks that are supposed > to be held are actually held. This led me to testing the page status to > be not EMPTY during SimpleLruWriteAll() before calling > SlruInternalWritePage(), because the assert was firing. The previous > code is not really *buggy*, but to me it's weird to call WritePage() on > a slot with no contents. Okay, I mean internally SlruInternalWritePage() will flush only if the status is SLRU_PAGE_VALID, but it is better the way you have done. > Another change was in TransactionGroupUpdateXidStatus: the original code > had the leader doing pg_atomic_read_u32(>clogGroupFirst) to > know which bank to lock. I changed it to simply be the page used by the > leader process; this doesn't need an atomic read, and should be the same > page anyway. (If it isn't, it's no big deal). But what's more: even if > we do read ->clogGroupFirst at that point, there's no guarantee that > this is going to be exactly for the same process that ends up being the > first in the list, because since we have not set it to INVALID by the > time we grab the bank lock, it is quite possible for more processes to > add themselves to the list. Yeah, this looks better > I realized all this while rewriting the comments in a way that would let > me understand what was going on ... so IMO the effort was worthwhile. +1 I will review and do some more testing early next week and share my feedback. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Wed, Feb 21, 2024 at 8:34 AM Kyotaro Horiguchi wrote: > > At Tue, 20 Feb 2024 19:56:10 +0530, Robert Haas wrote > in > > It seems like maybe somebody should look into why this is happening, > > and perhaps fix it. > > GetConnection()@streamutil.c wants to ensure conninfo has a fallback > database name ("replication"). However, the function seems to be > ignoring the case where neither dbname nor connection string is given, > which is the first case Kuroda-san raised. The second case is the > intended behavior of the function. > > > /* pg_recvlogical uses dbname only; others use connection_string > > only. */ > > Assert(dbname == NULL || connection_string == NULL); > > And the function incorrectly assumes that the connection string > requires "dbname=replication". > > >* Merge the connection info inputs given in form of connection > > string, > >* options and default values (dbname=replication, replication=true, > > etc.) > > But the name is a pseudo database name only used by pg_hba.conf > (maybe) , which cannot be used as an actual database name (without > quotation marks, or unless it is actually created). The function > should not add the fallback database name because the connection > string for physical replication connection doesn't require the dbname > parameter. (attached patch) > We do append dbname=replication even in libpqrcv_connect for .pgpass lookup as mentioned in comments. See below: libpqrcv_connect() { else { /* * The database name is ignored by the server in replication mode, * but specify "replication" for .pgpass lookup. */ keys[++i] = "dbname"; vals[i] = "replication"; } ... } I think as part of this effort we should just add dbname to primary_conninfo written in postgresql.auto.conf file. As above, the question is valid whether we should do it just for 17 or for prior versions. Let's discuss more on that. I am not sure of the use case for versions before 17 but commit cca97ce6a665 mentioned that some middleware or proxies might however need to know the dbname to make the correct routing decision for the connection. Does that apply here as well? If so, we can do it and update the docs, otherwise, let's do it just for 17. -- With Regards, Amit Kapila.
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Wed, Feb 21, 2024 at 7:46 AM Hayato Kuroda (Fujitsu) wrote: > > > > Just FYI - here is an extreme case. And note that I have applied proposed > > > patch. > > > > > > When `pg_basebackup -D data_N2 -R` is used: > > > ``` > > > primary_conninfo = 'user=hayato ... dbname=hayato ... > > > ``` > > > > > > But when `pg_basebackup -d "" -D data_N2 -R` is used: > > > ``` > > > primary_conninfo = 'user=hayato ... dbname=replication > > > ``` > > > > It seems like maybe somebody should look into why this is happening, > > and perhaps fix it. > > I think this caused from below part [1] in GetConnection(). > > If both dbname and connection_string are the NULL, we will enter the else part > and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated > only here. > > Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(), > the strange part would be found and replaced to the username [2]. > > I think if both the connection string and the dbname are NULL, it should be > considered as the physical replication connection. here is a patch to fix it. > When dbname is NULL or not given, it defaults to username. This follows the specs of the connection string. See (dbname # The database name. Defaults to be the same as the user name...) [1]. Your patch breaks that specs, so I don't think it is correct. [1] - https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-CONNSTRING -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
On Fri, Feb 23, 2024 at 10:06 AM Zhijie Hou (Fujitsu) wrote: > > > I noticed one CFbot failure[1] which is because the tap-test doesn't wait for > the > standby to catch up before promoting, thus the data inserted after promotion > could not be replicated to the subscriber. Add a wait_for_replay_catchup to > fix it. > > Apart from this, I also adjusted some variable names in the tap-test to be > consistent. And added back a mis-removed ProcessConfigFile call. > > [1] https://cirrus-ci.com/task/6126787437002752?logs=check_world#L312 > Thanks for the patches. Had a quick look at v95_2, here are some trivial comments: slot.h: - 1) extern List *GetStandbySlotList(bool copy); extern void WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn); extern void FilterStandbySlots(XLogRecPtr wait_for_lsn, List **standby_slots); The order is different from the one in slot.c slot.c: - 2) warningfmt = _("replication slot \"%s\" specified in parameter \"%s\" does not exist, ignoring"); GUC names should not have double quotes. Same in each warningfmt in this function 3) errmsg("replication slot \"%s\" specified in parameter \"%s\" does not have active_pid", Same here, double quotes around standby_slot_names should be removed walsender.c: -- 4) * Used by logical decoding SQL functions that acquired slot with failover * enabled. To be consistent with other such comments in previous patches: slot with failover enabled --> failover enabled slot 5) Wake up the logical walsender processes with failover-enabled slots failover-enabled slots --> failover enabled slots postgresql.conf.sample: -- 6) streaming replication standby server slot names that logical walsender processes will wait for Is it better to say it like this? (I leave this to your preference) streaming replication standby server slot names for which logical walsender processes will wait. thanks Shveta
Re: Test to dump and restore objects left behind by regression
On Thu, Feb 22, 2024 at 8:35 PM Tom Lane wrote: > > Peter Eisentraut writes: > > The problem is, we don't really have any end-to-end coverage of > > > dump > > restore > > dump again > > compare the two dumps > > > with a database with lots of interesting objects in it. > > I'm very much against adding another full run of the core regression > tests to support this. This will be taken care of by Peter's latest idea of augmenting existing 002_pg_upgrade.pl. > But beyond the problem of not bloating the > check-world test runtime, there is the question of what this would > actually buy us. I doubt that it is worth very much, because > it would not detect bugs-of-omission in pg_dump. As I remarked in > the other thread, if pg_dump is blind to the existence of some > feature or field, testing that the dumps compare equal will fail > to reveal that it didn't restore that property. > > I'm not sure what we could do about that. One could imagine writing > some test infrastructure that dumps out the contents of the system > catalogs directly, and comparing that instead of pg_dump output. > But that'd be a lot of infrastructure to write and maintain ... > and it's not real clear why it wouldn't *also* suffer from > I-forgot-to-add-this hazards. If a developer forgets to add logic to dump objects that their patch adds, it's hard to detect it, through testing alone, in every possible case. We need reviewers to take care of that. I don't think that's the objective of this test case or of pg_upgrade test either. > > On balance, I think there are good reasons that we've not added > such a test, and I don't believe those tradeoffs have changed. > I am not aware of those reasons. Are they documented somewhere? Any pointers to the previous discussion on this topic? Googling "pg_dump regression pgsql-hackers" returns threads about performance regressions. On the flip side, the test I wrote reproduces the COMPRESSION/STORAGE bug you reported along with a few other bugs in that area which I will report soon on that thread. I think, that shows that we need such a test. -- Best Wishes, Ashutosh Bapat
Re: Test to dump and restore objects left behind by regression
On Thu, Feb 22, 2024 at 3:50 PM Peter Eisentraut wrote: > > On 22.02.24 11:00, Daniel Gustafsson wrote: > >> On 22 Feb 2024, at 10:55, Ashutosh Bapat > >> wrote: > >> On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson wrote: > > > >> Somebody looking for dump/restore tests wouldn't search > >> src/bin/pg_upgrade, I think. > > > > Quite possibly not, but pg_upgrade is already today an important testsuite > > for > > testing pg_dump in binary-upgrade mode so maybe more developers touching > > pg_dump should? > > Yeah, I think attaching this to the existing pg_upgrade test would be a > good idea. Not only would it save test run time, it would probably also > reduce code duplication. > That's more than one vote for adding the test to 002_pg_ugprade.pl. Seems fine to me. -- Best Wishes, Ashutosh Bapat
Re: RFC: Logging plan of the running query
On Thu, Feb 22, 2024 at 6:25 AM James Coleman wrote: > This is potentially a bit of a wild idea, but I wonder if having some > kind of argument to CHECK_FOR_INTERRUPTS() signifying we're in > "normal" as opposed to "critical" (using that word differently than > the existing critical sections) would be worth it. It's worth considering, but the definition of "normal" vs. "critical" might be hard to pin down. Or, we might end up with a definition that is specific to this particular case and not generalizable to others. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Slow catchup of 2PC (twophase) transactions on replica in LR
On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий wrote: > Dear All, > > I'd like to present and talk about a problem when 2PC transactions are > applied quite slowly on a replica during logical replication. There is a > master and a replica with established logical replication from the master > to the replica with twophase = true. With some load level on the master, > the replica starts to lag behind the master, and the lag will be > increasing. We have to significantly decrease the load on the master to > allow replica to complete the catchup. Such problem may create significant > difficulties in the production. The problem appears at least on > REL_16_STABLE branch. > > To reproduce the problem: > >- Setup logical replication from master to replica with subscription >parameter twophase = true. >- Create some intermediate load on the master (use pgbench with custom >sql with prepare+commit) >- Optionally switch off the replica for some time (keep load on >master). >- Switch on the replica and wait until it reaches the master. > > The replica will never reach the master with even some low load on the > master. If to remove the load, the replica will reach the master for much > greater time, than expected. I tried the same for regular transactions, but > such problem doesn't appear even with a decent load. > > > I tried this setup and I do see that the logical subscriber does reach the master in a short time. I'm not sure what I'm missing. I stopped the logical subscriber in between while pgbench was running and then started it again and ran the following: postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication; sent_lsn | pg_current_wal_lsn ---+ 0/6793FA0 | 0/6793FA0 <=== caught up (1 row) My pgbench command: pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5 my custom sql file: cat test.sql SELECT md5(random()::text) as mygid \gset BEGIN; DELETE FROM test WHERE v = pg_backend_pid(); INSERT INTO test(v) SELECT pg_backend_pid(); PREPARE TRANSACTION $$:mygid$$; COMMIT PREPARED $$:mygid$$; regards, Ajin Cherian Fujitsu Australia
Re: About a recently-added message
On Thu, Feb 22, 2024 at 11:36 AM Kyotaro Horiguchi wrote: > > At Wed, 21 Feb 2024 14:57:42 +0530, Amit Kapila > wrote in > > On Tue, Feb 20, 2024 at 3:21 PM shveta malik wrote: > > > > > > okay, attached v2 patch with changed error msgs and double quotes > > > around logical. > > > > > > > Horiguchi-San, does this address all your concerns related to > > translation with these new messages? > > Yes, I'm happy with all of the changes. The proposed patch appears to > cover all instances related to slotsync.c, and it looks fine to > me. Thanks! > > I found that logica.c is also using the policy that I complained > about, but it is a separate issue. > > ./logical.c 122:errmsg("logical decoding requires wal_level >= > logical"))); > Hmm. I have a currently stalled patch-set to simplify the quoting of all the GUC names by using one rule. The consensus is to *always* quote them. See [1]. And those patches already are addressing that logical.c code mentioned above. IMO it would be good if we could try to get that patch set approved to fix everything in one go, instead of ad-hoc hunting for and fixing cases one at a time. == [1] https://www.postgresql.org/message-id/CAHut%2BPsf3NewXbsFKY88Qn1ON1_dMD6343MuWdMiiM2Ds9a_wA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: speed up a logical replica setup
On Fri, Feb 23, 2024 at 8:16 AM Euler Taveira wrote: > > On Wed, Feb 21, 2024, at 5:00 AM, Shlok Kyal wrote: > > I found some issues and fixed those issues with top up patches > v23-0012 and v23-0013 > 1. > Suppose there is a cascade physical replication node1->node2->node3. > Now if we run pg_createsubscriber with node1 as primary and node2 as > standby, pg_createsubscriber will be successful but the connection > between node2 and node3 will not be retained and log og node3 will > give error: > 2024-02-20 12:32:12.340 IST [277664] FATAL: database system > identifier differs between the primary and standby > 2024-02-20 12:32:12.340 IST [277664] DETAIL: The primary's identifier > is 7337575856950914038, the standby's identifier is > 7337575783125171076. > 2024-02-20 12:32:12.341 IST [277491] LOG: waiting for WAL to become > available at 0/3000F10 > > To fix this I am avoiding pg_createsubscriber to run if the standby > node is primary to any other server. > Made the change in v23-0012 patch > > > IIRC we already discussed the cascading replication scenario. Of course, > breaking a node is not good that's why you proposed v23-0012. However, > preventing pg_createsubscriber to run if there are standbys attached to it is > also annoying. If you don't access to these hosts you need to (a) kill > walsender (very fragile / unstable), (b) start with max_wal_senders = 0 or (3) > add a firewall rule to prevent that these hosts do not establish a connection > to the target server. I wouldn't like to include the patch as-is. IMO we need > at least one message explaining the situation to the user, I mean, add a hint > message. > Yeah, it could be a bit tricky for users to ensure that no follow-on standby is present but I think it is still better to give an error and prohibit running pg_createsubscriber than breaking the existing replication. The possible solution, in this case, is to allow running pg_basebackup via this tool or otherwise and then let the user use it to convert to a subscriber. It would be good to keep things simple for the first version then we can add such options like --force in subsequent versions. -- With Regards, Amit Kapila.
Re: LogwrtResult contended spinlock
On Fri, Feb 23, 2024 at 1:18 AM Jeff Davis wrote: > I don't see the global non-shared variable as a huge problem, so if it > serves a purpose then I'm fine keeping it. Perhaps we could make it a > bit safer by using some wrapper functions. I actually really hate these kinds of variables. I think they likely mask various bugs (where the value might not be up to date where we think it is) and anti-bugs (where we actually rely on the value being out of date but we don't know that we do because the code is so opaque). My vintage 2021 adventures in getting rid of the global variable ThisTimeLineID and a few other global variables found some actual but minor bugs, also found a bunch of confused code that didn't really do what it thought it did, and took up a huge amount of my time trying to analyze what was happening. I'm not prepared to recommend what we should do in this particular case. I would like to believe that the local copies can be eliminated somehow, but I haven't studied the code or done benchmarking so I don't really know enough to guess what sort of code changes would or would not be good enough. -- Robert Haas EDB: http://www.enterprisedb.com
RE: Synchronizing slots from primary to standby
On Friday, February 23, 2024 10:18 AM Zhijie Hou (Fujitsu) wrote: > > > > Hi, > > > > Since the slotsync worker patch has been committed, I rebased the > > remaining patches. > > And here is the V95 patch set. > > > > Also, I fixed a bug in the current 0001 patch where the member of the > > standby slot names list pointed to the freed memory after calling > ProcessConfigFile(). > > Now, we will obtain a new list when we call ProcessConfigFile(). The > > optimization to only get the new list when the names actually change > > has been removed. I think this change is acceptable because > > ProcessConfigFile is not a frequent occurrence. > > > > Additionally, I reordered the tests in > > 040_standby_failover_slots_sync.pl. Now the new test will be conducted > > after the sync slot test to prevent the risk of the logical slot > > occasionally not catching up to the latest catalog_xmin and, as a result, > > not > being able to be synced immediately. > > There is one unexpected change in the previous version, sorry for that. > Here is the correct version. I noticed one CFbot failure[1] which is because the tap-test doesn't wait for the standby to catch up before promoting, thus the data inserted after promotion could not be replicated to the subscriber. Add a wait_for_replay_catchup to fix it. Apart from this, I also adjusted some variable names in the tap-test to be consistent. And added back a mis-removed ProcessConfigFile call. [1] https://cirrus-ci.com/task/6126787437002752?logs=check_world#L312 Best Regards, Hou zj v96-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch Description: v96-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch v96-0002-Document-the-steps-to-check-if-the-standby-is-re.patch Description: v96-0002-Document-the-steps-to-check-if-the-standby-is-re.patch
Re: Synchronizing slots from primary to standby
On Fri, Feb 23, 2024 at 8:35 AM shveta malik wrote: > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > wrote: > > > > Suppose that in synchronize_slots() the query would be: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 = 1"; > > > > Then my comment is to rewrite it to: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > > > to ensure the operator "=" is coming from the pg_catalog schema. > > > > Thanks for the details, but slot-sync does not use SPI calls, it uses > libpqrcv calls. So is this change needed? Additionally, I would like to have a better understanding of why it's necessary and whether it addresses any potential security risks. thanks Shveta
Re: Why is subscription/t/031_column_list.pl failing so much?
On Mon, Feb 19, 2024 at 2:19 PM Amit Kapila wrote: > > On Thu, Feb 15, 2024 at 10:46 AM vignesh C wrote: > > > > On Thu, 15 Feb 2024 at 08:42, Amit Kapila wrote: > > > > > > > > To resolve the BF > > > failure, I still feel, we should just recreate the subscription. This > > > is a pre-existing problem and we can track it via a separate patch > > > with a test case targetting such failures. > > > > +1 for going with recreation of the subscription, the changes for this > > are available at [1]. > > [1] - > > https://www.postgresql.org/message-id/CALDaNm1hLZW4H4Z61swK6WPW6pcNcjzXqw%3D6NqG7e-RMtkFaZA%40mail.gmail.com > > > > Tom, and others, does anyone still object to going ahead with an idea > by just changing the test to recreate the subscription to silence BF > failures for this test? > Seeing no objections, I have pushed the required test changes to silence the BF. -- With Regards, Amit Kapila.
Re: Slow catchup of 2PC (twophase) transactions on replica in LR
On Thu, Feb 22, 2024 at 6:59 PM Давыдов Виталий wrote: > > I'd like to present and talk about a problem when 2PC transactions are > applied quite slowly on a replica during logical replication. There is a > master and a replica with established logical replication from the master to > the replica with twophase = true. With some load level on the master, the > replica starts to lag behind the master, and the lag will be increasing. We > have to significantly decrease the load on the master to allow replica to > complete the catchup. Such problem may create significant difficulties in the > production. The problem appears at least on REL_16_STABLE branch. > > To reproduce the problem: > > Setup logical replication from master to replica with subscription parameter > twophase = true. > Create some intermediate load on the master (use pgbench with custom sql with > prepare+commit) > Optionally switch off the replica for some time (keep load on master). > Switch on the replica and wait until it reaches the master. > > The replica will never reach the master with even some low load on the > master. If to remove the load, the replica will reach the master for much > greater time, than expected. I tried the same for regular transactions, but > such problem doesn't appear even with a decent load. > > I think, the main proplem of 2PC catchup bad performance - the lack of > asynchronous commit support for 2PC. For regular transactions asynchronous > commit is used on the replica by default (subscrition sycnronous_commit = > off). It allows the replication worker process on the replica to avoid fsync > (XLogFLush) and to utilize 100% CPU (the background wal writer or > checkpointer will do fsync). I agree, 2PC are mostly used in multimaster > configurations with two or more nodes which are performed synchronously, but > when the node in catchup (node is not online in a multimaster cluster), > asynchronous commit have to be used to speedup the catchup. > I don't see we do anything specific for 2PC transactions to make them behave differently than regular transactions with respect to synchronous_commit setting. What makes you think so? Can you pin point the code you are referring to? > There is another thing that affects on the disbalance of the master and > replica performance. When the master executes requestes from multiple > clients, there is a fsync optimization takes place in XLogFlush. It allows to > decrease the number of fsync in case when a number of parallel backends write > to the WAL simultaneously. The replica applies received transactions in one > thread sequentially, such optimization is not applied. > Right, I think for this we need to implement parallel apply. > I see some possible solutions: > > Implement asyncronous commit for 2PC transactions. > Do some hacking with enableFsync when it is possible. > Can you be a bit more specific about what exactly you have in mind to achieve the above solutions? -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot wrote: > > Suppose that in synchronize_slots() the query would be: > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > " restart_lsn, catalog_xmin, two_phase, failover," > " database, conflict_reason" > " FROM pg_catalog.pg_replication_slots" > " WHERE failover and NOT temporary and 1 = 1"; > > Then my comment is to rewrite it to: > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > " restart_lsn, catalog_xmin, two_phase, failover," > " database, conflict_reason" > " FROM pg_catalog.pg_replication_slots" > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > to ensure the operator "=" is coming from the pg_catalog schema. > Thanks for the details, but slot-sync does not use SPI calls, it uses libpqrcv calls. So is this change needed? thanks Shveta
Re: speed up a logical replica setup
On Wed, Feb 21, 2024, at 5:00 AM, Shlok Kyal wrote: > I found some issues and fixed those issues with top up patches > v23-0012 and v23-0013 > 1. > Suppose there is a cascade physical replication node1->node2->node3. > Now if we run pg_createsubscriber with node1 as primary and node2 as > standby, pg_createsubscriber will be successful but the connection > between node2 and node3 will not be retained and log og node3 will > give error: > 2024-02-20 12:32:12.340 IST [277664] FATAL: database system > identifier differs between the primary and standby > 2024-02-20 12:32:12.340 IST [277664] DETAIL: The primary's identifier > is 7337575856950914038, the standby's identifier is > 7337575783125171076. > 2024-02-20 12:32:12.341 IST [277491] LOG: waiting for WAL to become > available at 0/3000F10 > > To fix this I am avoiding pg_createsubscriber to run if the standby > node is primary to any other server. > Made the change in v23-0012 patch IIRC we already discussed the cascading replication scenario. Of course, breaking a node is not good that's why you proposed v23-0012. However, preventing pg_createsubscriber to run if there are standbys attached to it is also annoying. If you don't access to these hosts you need to (a) kill walsender (very fragile / unstable), (b) start with max_wal_senders = 0 or (3) add a firewall rule to prevent that these hosts do not establish a connection to the target server. I wouldn't like to include the patch as-is. IMO we need at least one message explaining the situation to the user, I mean, add a hint message. I'm resistant to a new option but probably a --force option is an answer. There is no test coverage for it. I adjusted this patch (didn't include the --force option) and add a test case. > 2. > While checking 'max_replication_slots' in 'check_publisher' function, > we are not considering the temporary slot in the check: > + if (max_repslots - cur_repslots < num_dbs) > + { > + pg_log_error("publisher requires %d replication slots, but > only %d remain", > +num_dbs, max_repslots - cur_repslots); > + pg_log_error_hint("Consider increasing max_replication_slots > to at least %d.", > + cur_repslots + num_dbs); > + return false; > + } > Fixed this in v23-0013 Good catch! Both are included in the next patch. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Add publisher and subscriber to glossary documentation.
Here are some comments for patch v2. == 1. There are whitespace problems [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:43: trailing whitespace. A node where publication is defined for ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:45: trailing whitespace. It replicates a set of changes from a table or a group of tables in ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:66: trailing whitespace. A node where subscription is defined for ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:67: trailing whitespace. logical replication. ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:68: trailing whitespace. It subscribes to one or more publications on a publisher node and pulls warning: squelched 2 whitespace errors warning: 7 lines add whitespace errors. ~~~ 2. Publisher node + + Publisher node + + + A node where publication is defined for + logical replication. + It replicates a set of changes from a table or a group of tables in + publication to the subscriber node. + + + For more information, see + . + + + + 2a. I felt the term here should be "Publication node". Indeed, there should also be additional glossary terms like "Publisher" (i.e. it is the same as "Publication node") and "Subscriber" (i.e. it is the same as "Subscription node"). These definitions will then be consistent with node descriptions already in sections 30.1 and 30.2 ~ 2b. "node" should include a link to the glossary term. Same for any other terms mentioned ~ 2c. /A node where publication is defined for/A node where a publication is defined for/ ~ 2d. "It replicates" is misleading because it is the PUBLICATION doing the replicating, not the node. IMO it will be better to include 2 more glossary terms "Publication" and "Subscription" where you can say this kind of information. Then the link "logical-replication-publication" also belongs under the "Publication" term. ~~~ 3. + + Subscriber node + + + A node where subscription is defined for + logical replication. + It subscribes to one or more publications on a publisher node and pulls + a set of changes from a table or a group of tables in publications it + subscribes to. + + + For more information, see + . + + + All comments are similar to those above. == In summary, IMO it should be a bit more like below: SUGGESTION (include all the necessary links etc) Publisher See "Publication node" Publication A publication replicates the changes of one or more tables to a subscription. For more information, see Section 30.1 Publication node A node where a publication is defined for logical replication. Subscriber See "Subscription node" Subscription A subscription receives the changes of one or more tables from the publications it subscribes to. For more information, see Section 30.2 Subscription node A node where a subscription is defined for logical replication. == Kind Regards, Peter Smith. Fujitsu Australia
RE: Synchronizing slots from primary to standby
On Friday, February 23, 2024 10:02 AM Zhijie Hou (Fujitsu) wrote: > > Hi, > > Since the slotsync worker patch has been committed, I rebased the remaining > patches. > And here is the V95 patch set. > > Also, I fixed a bug in the current 0001 patch where the member of the standby > slot names list pointed to the freed memory after calling ProcessConfigFile(). > Now, we will obtain a new list when we call ProcessConfigFile(). The > optimization to only get the new list when the names actually change has been > removed. I think this change is acceptable because ProcessConfigFile is not a > frequent occurrence. > > Additionally, I reordered the tests in 040_standby_failover_slots_sync.pl. Now > the new test will be conducted after the sync slot test to prevent the risk > of the > logical slot occasionally not catching up to the latest catalog_xmin and, as a > result, not being able to be synced immediately. There is one unexpected change in the previous version, sorry for that. Here is the correct version. Best Regards, Hou zj v95_2-0002-Document-the-steps-to-check-if-the-standby-is-r.patch Description: v95_2-0002-Document-the-steps-to-check-if-the-standby-is-r.patch v95_2-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch Description: v95_2-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
RE: Synchronizing slots from primary to standby
Hi, Since the slotsync worker patch has been committed, I rebased the remaining patches. And here is the V95 patch set. Also, I fixed a bug in the current 0001 patch where the member of the standby slot names list pointed to the freed memory after calling ProcessConfigFile(). Now, we will obtain a new list when we call ProcessConfigFile(). The optimization to only get the new list when the names actually change has been removed. I think this change is acceptable because ProcessConfigFile is not a frequent occurrence. Additionally, I reordered the tests in 040_standby_failover_slots_sync.pl. Now the new test will be conducted after the sync slot test to prevent the risk of the logical slot occasionally not catching up to the latest catalog_xmin and, as a result, not being able to be synced immediately. Best Regards, Hou zj v95-0002-Document-the-steps-to-check-if-the-standby-is-r.patch Description: v95-0002-Document-the-steps-to-check-if-the-standby-is-r.patch v95-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch Description: v95-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Re: Improve readability by using designated initializers when possible
On Fri, 2024-02-23 at 01:35 +0100, Jelte Fennema-Nio wrote: > On Thu, Feb 22, 2024, 23:46 Jeff Davis wrote: > > > > Am I missing something? > > The main benefits it has are: Sorry, I was unclear. I was asking a question about the reason the ObjectClass and the object_classes[] array exist in the current code, it wasn't a direct question about your patch. ObjectClass is only used in a couple places, and I don't immediately see why those places can't just use the OID of the class (like OperatorClassRelationId). Regards, Jeff Davis
Re: Improve readability by using designated initializers when possible
On Thu, Feb 22, 2024, 23:46 Jeff Davis wrote: > > Am I missing something? The main benefits it has are: 1. The order of the array doesn't have to exactly match the order of the enum for the arrays to contain the correct mapping. 2. Typos in the enum variant names are caught by the compiler because actual symbols are used, not comments. 3. The left-to-right order reads more natural imho for such key-value pairs, e.g. OCLASS_PROC maps to ProcedureRelationId.
Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
I wrote: > I think that this is a band-aid that's just masking an error in the > rowmarking logic: it's not doing the right thing for appendrels > made from UNION ALL subqueries. I've not wrapped my head around > exactly where it's going off the rails, but I feel like this ought > to get fixed somewhere else. Please hold off pushing your patch > for now. So after studying this for awhile, I see that the planner is emitting a PlanRowMark that presumes that the UNION ALL subquery will be scanned as though it's a base relation; but since we've converted it to an appendrel, the executor just ignores that rowmark, and the wrong things happen. I think the really right fix would be to teach the executor to honor such PlanRowMarks, by getting nodeAppend.c and nodeMergeAppend.c to perform EPQ row substitution. I wrote a draft patch for that, attached, and it almost works but not quite. The trouble is that we're jamming the contents of the row identity Var created for the rowmark into the output of the Append or MergeAppend, and it doesn't necessarily match exactly. In the test case you created, the planner produces targetlists like Output: src_1.val, src_1.id, ROW(src_1.id, src_1.val) and as you can see the order of the columns doesn't match. I can see three ways we might attack that: 1. Persuade the planner to build output tlists that always match the row identity Var. This seems undesirable, because the planner might have intentionally elided columns that won't be read by the upper parts of the plan. 2. Change generation of the ROW() expression so that it lists only the values we're going to output, in the order we're going to output them. This would amount to saying that for UNION cases the "identity" of a row need only consider columns used by the plan, which feels a little odd but I can't think of a reason why it wouldn't work. I'm not sure how messy this'd be to implement though, as the set of columns to be emitted isn't fully determined until much later than where we currently expand the row identity Vars into RowExprs. 3. Fix the executor to remap what it gets out of the ROW() into the order of the subquery tlists. This is probably do-able but I'm not certain; it may be that the executor hasn't enough info. We might need to teach the planner to produce a mapping projection and attach it to the Append node, which carries some ABI risk (but in the past we've gotten away with adding new fields to the ends of plan nodes in the back branches). Another objection is that adding cycles to execution rather than planning might be a poor tradeoff --- although if we only do the work when EPQ is invoked, maybe it'd be the best way. It might be that any of these things is too messy to be considered for back-patching, and we ought to apply what you did in the back branches. I'd like a better fix in HEAD though. regards, tom lane diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index c7059e7528..113217e607 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -288,8 +288,65 @@ static TupleTableSlot * ExecAppend(PlanState *pstate) { AppendState *node = castNode(AppendState, pstate); + EState *estate = node->ps.state; TupleTableSlot *result; + if (estate->es_epq_active != NULL) + { + /* + * We are inside an EvalPlanQual recheck. If there is a relevant + * rowmark for the append relation, return the test tuple if one is + * available. + */ + EPQState *epqstate = estate->es_epq_active; + int scanrelid; + + if (bms_get_singleton_member(castNode(Append, node->ps.plan)->apprelids, + )) + { + if (epqstate->relsubs_done[scanrelid - 1]) + { +/* + * Return empty slot, as either there is no EPQ tuple for this + * rel or we already returned it. + */ +TupleTableSlot *slot = node->ps.ps_ResultTupleSlot; + +return ExecClearTuple(slot); + } + else if (epqstate->relsubs_slot[scanrelid - 1] != NULL) + { +/* + * Return replacement tuple provided by the EPQ caller. + */ +TupleTableSlot *slot = epqstate->relsubs_slot[scanrelid - 1]; + +Assert(epqstate->relsubs_rowmark[scanrelid - 1] == NULL); + +/* Mark to remember that we shouldn't return it again */ +epqstate->relsubs_done[scanrelid - 1] = true; + +return slot; + } + else if (epqstate->relsubs_rowmark[scanrelid - 1] != NULL) + { +/* + * Fetch and return replacement tuple using a non-locking + * rowmark. + */ +TupleTableSlot *slot = node->ps.ps_ResultTupleSlot; + +/* Mark to remember that we shouldn't return more */ +epqstate->relsubs_done[scanrelid - 1] = true; + +if (!EvalPlanQualFetchRowMark(epqstate, scanrelid, slot)) + return NULL; + +return slot; + } + } + } + /* * If this is the first call after Init or ReScan, we need to do the * initialization work. @@ -405,6 +462,7 @@ ExecEndAppend(AppendState *node)
Re: Running the fdw test from the terminal crashes into the core-dump
Alvaro Herrera writes: > After having pushed that, I wonder if we should document this. It seems > quite the minor thing, but I'm sure somebody will complain if we don't. Yup, no doubt. > I propose the attached. (Extra context so that the full paragraph can > be read from the comfort of your email program.) This reads awkwardly to me. I think it'd be better to construct it so that DO NOTHING's requirement is stated exactly parallel to the other three clause types, more or less as attached. > (While at it, I found the placement of the previous-to-last sentence in > that paragraph rather strange, so I moved it to the end.) Agreed, and done in my version too. BTW, if you read it without paying attention to markup, you'll notice that we are saying things like If you specify an insert action, you must have the INSERT privilege on the target_table_name. which is fairly nonsensical: we don't define privileges on the name of something. While I've not done anything about that here, I wonder if we shouldn't just write "privilege on the target table" without any special markup. regards, tom lane diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml index 655f7dcc05..79ebff1033 100644 --- a/doc/src/sgml/ref/merge.sgml +++ b/doc/src/sgml/ref/merge.sgml @@ -104,12 +104,15 @@ DELETE privilege on the target_table_name. If you specify a delete action, you must have the DELETE privilege on the target_table_name. - Privileges are tested once at statement start and are checked - whether or not particular WHEN clauses are executed. - You will require the SELECT privilege on any column(s) + If you specify a DO NOTHING action, you must have + the SELECT privilege on at least one column + of target_table_name. + You will also need SELECT privilege on any column(s) of the data_source and target_table_name referred to in any condition or expression. + Privileges are tested once at statement start and are checked + whether or not particular WHEN clauses are executed.
Re: WIP Incremental JSON Parser
On 2024-02-22 Th 15:29, Jacob Champion wrote: On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan wrote: Patch 5 in this series fixes those issues and adjusts most of the tests to add some trailing junk to the pieces of json, so we can be sure that this is done right. This fixes the test failure for me, thanks! I've attached my current mesonification diff, which just adds test_json_parser to the suite. It relies on the PATH that's set up, which appears to include the build directory for both VPATH builds and Meson. OK, thanks, will add this in the next version. Are there plans to fill out the test suite more? Since we should be able to control all the initial conditions, it'd be good to get fairly comprehensive coverage of the new code. Well, it's tested (as we know) by the backup manifest tests. During development, I tested by making the regular parser use the non-recursive parser (see FORCE_JSON_PSTACK). That doesn't test the incremental piece of it, but it does check that the rest of it is doing the right thing. We could probably extend the incremental test by making it output a stream of tokens and making sure that was correct. As an aside, I find the behavior of need_escapes=false to be completely counterintuitive. I know the previous code did this, but it seems like the opposite of "provides unescaped strings" should be "provides raw strings", not "all strings are now NULL". Yes, we could possibly call it "need_strings" or something like that. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Documentation to upgrade logical replication cluster
Hi Vignesh, I have no further comments. Patch v9 LGTM. == Kind Regards, Peter Smith. Fujitsu Australia
Re: Avoid stack frame setup in performance critical routines using tail calls
Hi, On 2024-02-23 00:46:26 +1300, David Rowley wrote: > I've rebased the 0001 patch and gone over it again and made a few > additional changes besides what I mentioned in my review. > > On Wed, 9 Aug 2023 at 20:44, David Rowley wrote: > > Here's a review of v2-0001: > > 2. Why do you need to add the NULL check here? > > > > #ifdef USE_VALGRIND > > - if (method != MCTX_ALIGNED_REDIRECT_ID) > > + if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID) > > VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size); > > #endif > > I removed this NULL check as we're calling the realloc function with > no flags, so it shouldn't return NULL as it'll error out from any OOM > errors. That was probably a copy-paste issue... > > 4. It would be good to see some API documentation in the > > MemoryContextMethods struct. This adds a lot of responsibility onto > > the context implementation without any extra documentation to explain > > what, for example, palloc is responsible for and what the alloc > > function needs to do itself. > > I've done that too. > > I also added header comments for MemoryContextAllocationFailure and > MemoryContextSizeFailure and added some comments to explain in places > like palloc() to warn people not to add checks after the 'alloc' call. > > The rebased patch is 0001 and all of my changes are in 0002. I will > rebase your original 0002 patch later. Thanks! > I think 0001 is much more important, as evident by the reported benchmarks > on this thread. I agree that it's good to tackle 0001 first. I don't understand the benchmark point though. Your benchmark seems to suggest that 0002 improves aset performance by *more* than 0001: for 8 byte aset allocs: time master: 8.86 0001: 8.12 0002: 7.02 So 0001 reduces time by 0.92x and 0002 by 0.86x. John's test shows basically no change for 0002 - which is unsurprising, as 0002 changes aset.c, but the test seems to solely exercise slab, as only SlabAlloc() shows up in the profile. As 0002 only touches aset.c it couldn't really have affected that test. > In absence of anyone else looking at this, I think it's ready to go. > If anyone is following along and wants to review or test it, please do > so soon. Makes sense! > @@ -1061,6 +1072,16 @@ MemoryContextAlloc(MemoryContext context, Size size) > > context->isReset = false; > For a moment this made me wonder if we could move the isReset handling into the allocator slow paths as well - it's annoying to write that bit (and thus dirty the cacheline) over and ove. But it'd be somewhat awkward due to pre-allocated blocks. So that'd be a larger change better done separately. Greetings, Andres Freund
Re: Improve readability by using designated initializers when possible
On Wed, 2024-02-21 at 16:03 +0100, Jelte Fennema-Nio wrote: > Usage of designated initializers came up in: > https://www.postgresql.org/message-id/flat/ZdWXhAt9Tz4d-lut%40paquier.xyz#9dc17e604e58569ad35643672bf74acc > > This converts all arrays that I could find that could clearly benefit > from this without any other code changes being necessary. Looking at the object_classes array and the ObjectClass enum, I don't quite understand the point. It seems like a way to write OCLASS_OPCLASS instead of OperatorClassRelationId, and similar? Am I missing something? Regards, Jeff Davis
Re: Add lookup table for replication slot invalidation causes
On Thu, Feb 22, 2024 at 5:56 PM Michael Paquier wrote: > > On Thu, Feb 22, 2024 at 05:30:08PM +1100, Peter Smith wrote: > > Oops. Perhaps I meant more like below -- in any case, the point was > > the same -- to ensure RS_INVAL_NONE is what returns if something > > unexpected happens. > > You are right that this could be a bit confusing, even if we should > never reach this state. How about avoiding to return the index of the > loop as result, as of the attached? Would you find that cleaner? > -- Hi, yes, it should never happen, but thanks for making the changes. I would've just removed every local variable instead of adding more of them. I also felt the iteration starting from RS_INVAL_NONE instead of 0 is asserting RS_INVAL_NONE must always be the first enum and can't be rearranged. Probably it will never happen, but why require it? -- ReplicationSlotInvalidationCause GetSlotInvalidationCause(const char *conflict_reason) { for (ReplicationSlotInvalidationCause cause = 0; cause <= RS_INVAL_MAX_CAUSES; cause++) if (strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0) return cause; Assert(0); return RS_INVAL_NONE; } -- But maybe those nits are a matter of personal choice. Your patch code addressed my main concern, so it LGTM. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]
Hi, On Wed, Jan 24, 2024 at 02:50:52PM -0500, Kirk Wolak wrote: > On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak wrote: > > On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson wrote: > >> > On 19 Jan 2024, at 23:09, Kirk Wolak wrote: > > Thank you, that made it possible to build and run... > > UNFORTUNATELY this has a CLEAR memory leak (visible in htop) > > I am watching it already consuming 6% of my system memory. > > > Daniel, > In the previous email, I made note that once the JIT was enabled, the > problem exists in 17Devel. > I re-included my script, which forced the JIT to be used... > > I attached an updated script that forced the settings. > But this is still leaking memory (outside of the > pg_backend_memory_context() calls). > Probably because it's at the LLVM level? And it does NOT happen from > planning/opening the query. It appears I have to fetch the rows to > see the problem. I had a look at this (and blogged about it here[1]) and was also wondering what was going on with 17devel and the recent back-branch releases, cause I could also reproduce those continuing memory leaks. Adding some debug logging when llvm_inline_reset_caches() is called solves the mystery: as you are calling a function, the fix that is in 17devel and the back-branch releases is not applicable and only after the function returns llvm_inline_reset_caches() is being called (as llvm_jit_context_in_use_count is greater than zero, presumably, so it never reaches the call-site of llvm_inline_reset_caches()). If you instead run your SQL in a DO-loop (as in the blog post) and not as a PL/PgSQL function, you should see that it no longer leaks. This might be obvious to some (and Andres mentioned it in https://www.postgresql.org/message-id/20210421002056.gjd6rpe6toumiqd6%40alap3.anarazel.de) but it took me a while to figure out/find. Michael [1] https://www.credativ.de/en/blog/postgresql-en/quick-benchmark-postgresql-2024q1-release-performance-improvements/
Re: speed up a logical replica setup
On Thu, Feb 22, 2024, at 9:43 AM, Hayato Kuroda (Fujitsu) wrote: > > The possible solution would be > > 1) allow to run pg_createsubscriber if standby is initially stopped . > > I observed that pg_logical_createsubscriber also uses this approach. > > 2) read GUCs via SHOW command and restore them when server restarts > > 3. add a config-file option. That's similar to what pg_rewind does. I expect that Debian-based installations will have this issue. > I also prefer the first solution. > Another reason why the standby should be stopped is for backup purpose. > Basically, the standby instance should be saved before running > pg_createsubscriber. > An easiest way is hard-copy, and the postmaster should be stopped at that > time. > I felt it is better that users can run the command immediately later the > copying. > Thought? It was not a good idea if you want to keep the postgresql.conf outside PGDATA. I mean you need extra steps that can be error prone (different settings between files). Shlok, I didn't read your previous email carefully. :-/ -- Euler Taveira EDB https://www.enterprisedb.com/
Re: WIP Incremental JSON Parser
On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan wrote: > Patch 5 in this series fixes those issues and > adjusts most of the tests to add some trailing junk to the pieces of > json, so we can be sure that this is done right. This fixes the test failure for me, thanks! I've attached my current mesonification diff, which just adds test_json_parser to the suite. It relies on the PATH that's set up, which appears to include the build directory for both VPATH builds and Meson. Are there plans to fill out the test suite more? Since we should be able to control all the initial conditions, it'd be good to get fairly comprehensive coverage of the new code. As an aside, I find the behavior of need_escapes=false to be completely counterintuitive. I know the previous code did this, but it seems like the opposite of "provides unescaped strings" should be "provides raw strings", not "all strings are now NULL". --Jacob commit 590ea7bec167058340624313d98c72976fa89d7a Author: Jacob Champion Date: Wed Feb 21 06:36:55 2024 -0800 WIP: mesonify diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index 8fbe742d38..e5c9bd10cf 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -21,6 +21,7 @@ subdir('test_dsm_registry') subdir('test_extensions') subdir('test_ginpostinglist') subdir('test_integerset') +subdir('test_json_parser') subdir('test_lfind') subdir('test_misc') subdir('test_oat_hooks') diff --git a/src/test/modules/test_json_parser/meson.build b/src/test/modules/test_json_parser/meson.build new file mode 100644 index 00..a5bedce94e --- /dev/null +++ b/src/test/modules/test_json_parser/meson.build @@ -0,0 +1,50 @@ +# Copyright (c) 2024, PostgreSQL Global Development Group + +test_json_parser_incremental_sources = files( + 'test_json_parser_incremental.c', +) + +if host_system == 'windows' + test_json_parser_incremental_sources += rc_bin_gen.process(win32ver_rc, extra_args: [ +'--NAME', 'test_json_parser_incremental', +'--FILEDESC', 'standalone json parser tester', + ]) +endif + +test_json_parser_incremental = executable('test_json_parser_incremental', + test_json_parser_incremental_sources, + dependencies: [frontend_code], + kwargs: default_bin_args + { +'install': false, + }, +) + +test_json_parser_perf_sources = files( + 'test_json_parser_perf.c', +) + +if host_system == 'windows' + test_json_parser_perf_sources += rc_bin_gen.process(win32ver_rc, extra_args: [ +'--NAME', 'test_json_parser_perf', +'--FILEDESC', 'standalone json parser tester', + ]) +endif + +test_json_parser_perf = executable('test_json_parser_perf', + test_json_parser_perf_sources, + dependencies: [frontend_code], + kwargs: default_bin_args + { +'install': false, + }, +) + +tests += { + 'name': 'test_json_parser', + 'sd': meson.current_source_dir(), + 'bd': meson.current_build_dir(), + 'tap': { +'tests': [ + 't/001_test_json_parser_incremental.pl', +], + }, +} diff --git a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl index fc9718baf3..8eeb7f5b91 100644 --- a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl +++ b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl @@ -8,7 +8,7 @@ use FindBin; my $test_file = "$FindBin::RealBin/../tiny.json"; -my $exe = "$ENV{TESTDATADIR}/../test_json_parser_incremental"; +my $exe = "test_json_parser_incremental"; my ($stdout, $stderr) = run_command( [$exe, $test_file] );
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-07, Dilip Kumar wrote: > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera wrote: > > Sure, but is that really what we want? > > So your question is do we want these buffers to be in multiple of > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > don't think it should create any problem logically. I mean we can > look again in the patch to see if we have made any such assumptions > but that should be fairly easy to fix, then maybe if we are going in > this way we should get rid of the check_slru_buffers() function as > well. Not really, I just don't think the macro should be in slru.h. Another thing I've been thinking is that perhaps it would be useful to make the banks smaller, when the total number of buffers is small. For example, if you have 16 or 32 buffers, it's not really clear to me that it makes sense to have just 1 bank or 2 banks. It might be more sensible to have 4 banks with 4 or 8 buffers instead. That should make the algorithm scale down as well as up ... I haven't done either of those things in the attached v19 version. I did go over the comments once again and rewrote the parts I was unhappy with, including some existing ones. I think it's OK now from that point of view ... at some point I thought about creating a separate README, but in the end I thought it not necessary. I did add a bunch of Assert()s to make sure the locks that are supposed to be held are actually held. This led me to testing the page status to be not EMPTY during SimpleLruWriteAll() before calling SlruInternalWritePage(), because the assert was firing. The previous code is not really *buggy*, but to me it's weird to call WritePage() on a slot with no contents. Another change was in TransactionGroupUpdateXidStatus: the original code had the leader doing pg_atomic_read_u32(>clogGroupFirst) to know which bank to lock. I changed it to simply be the page used by the leader process; this doesn't need an atomic read, and should be the same page anyway. (If it isn't, it's no big deal). But what's more: even if we do read ->clogGroupFirst at that point, there's no guarantee that this is going to be exactly for the same process that ends up being the first in the list, because since we have not set it to INVALID by the time we grab the bank lock, it is quite possible for more processes to add themselves to the list. I realized all this while rewriting the comments in a way that would let me understand what was going on ... so IMO the effort was worthwhile. Anyway, what I send now should be pretty much final, modulo the change to the check_slru_buffers routines and documentation additions to match pg_stat_slru to the new GUC names. > > Another painful point is that pg_stat_slru uses internal names in the > > data it outputs, which obviously do not match the new GUCs. > > Yeah, that's true, but I think this could be explained somewhere not > sure what is the right place for this. Or we can change those names in the view. > FYI, I have also repeated all the performance tests I performed in my > first email[1], and I am seeing a similar gain. Excellent, thanks for doing that. > Some comments on v18 in my first pass of the review. > > 1. > @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) > lsnindex = GetLSNIndex(slotno, xid); > *lsn = XactCtl->shared->group_lsn[lsnindex]; > > - LWLockRelease(XactSLRULock); > + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); > > Maybe here we can add an assert before releasing the lock for a safety check > > Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno))); Hmm, I think this would just throw a warning or error "you don't hold such lwlock", so it doesn't seem necessary. > 2. > + * > + * XXX could we make the LSNs to be bank-based? > */ > XLogRecPtr *group_lsn; > > IMHO, the flush still happens at the page level so up to which LSN > should be flush before flushing the particular clog page should also > be maintained at the page level. Yeah, this was a misguided thought, nevermind. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ It does it in a really, really complicated way why does it need to be complicated? Because it's MakeMaker. >From e1aabcbf5ce1417decfe24f513e5cfe8b6de77f2 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 22 Feb 2024 18:42:56 +0100 Subject: [PATCH v19] Make SLRU buffer sizes configurable Also, divide the slot array in banks, so that the LRU algorithm can be made more scalable. Also remove the centralized control lock for even better scalability. Authors: Dilip Kumar, Andrey Borodin --- doc/src/sgml/config.sgml | 139 +++ src/backend/access/transam/clog.c | 235 src/backend/access/transam/commit_ts.c| 89 +++-- src/backend/access/transam/multixact.c| 190 +++--- src/backend/access/transam/slru.c | 345 +-
Re: Running the fdw test from the terminal crashes into the core-dump
On 2024-Feb-20, Tom Lane wrote: > > So, this means we can fix this by simply requiring ACL_SELECT privileges > > on a DO NOTHING action. We don't need to request specific privileges on > > any particular column (perminfo->selectedCols continues to be the empty > > set) -- which means that any role that has privileges on *any* column > > would get a pass. > > LGTM. Thanks for looking! After having pushed that, I wonder if we should document this. It seems quite the minor thing, but I'm sure somebody will complain if we don't. I propose the attached. (Extra context so that the full paragraph can be read from the comfort of your email program.) (While at it, I found the placement of the previous-to-last sentence in that paragraph rather strange, so I moved it to the end.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones) diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml index 655f7dcc05..85938eda07 100644 --- a/doc/src/sgml/ref/merge.sgml +++ b/doc/src/sgml/ref/merge.sgml @@ -97,26 +97,29 @@ DELETE There is no separate MERGE privilege. If you specify an update action, you must have the UPDATE privilege on the column(s) of the target_table_name that are referred to in the SET clause. If you specify an insert action, you must have the INSERT privilege on the target_table_name. If you specify a delete action, you must have the DELETE privilege on the target_table_name. - Privileges are tested once at statement start and are checked - whether or not particular WHEN clauses are executed. You will require the SELECT privilege on any column(s) of the data_source and target_table_name referred to - in any condition or expression. + in any condition or expression; + in addition, if a DO NOTHING action is specified, you + will require the SELECT privilege on at least one column + of target_table_name. + Privileges are tested once at statement start and are checked + whether or not particular WHEN clauses are executed. MERGE is not supported if the target_table_name is a materialized view, foreign table, or if it has any rules defined on it.
Re: speed up a logical replica setup
Hello, On 2024-Feb-22, Hayato Kuroda (Fujitsu) wrote: > Dear Alvaro, > > Hmm, but doesn't this mean that the server will log an ugly message > > that "client closed connection unexpectedly"? I think it's nicer to > > close the connection before terminating the process (especially > > since the code for that is already written). > > OK. So we should disconnect properly even if the process exits. I > added the function call again. Note that PQclear() was not added > because it is only related with the application. Sounds about right, but I didn't verify the patches in detail. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)
Re: locked reads for atomics
On Thu, 2024-02-22 at 12:58 +0530, Bharath Rupireddy wrote: > There's some immediate use for reads/writes with barrier semantics - Is this mainly a convenience for safety/readability? Or is it faster in some cases than doing an atomic access with separate memory barriers? Regards, Jeff Davis
Re: LogwrtResult contended spinlock
On Thu, 2024-02-22 at 10:17 +0530, Robert Haas wrote: > I think the problems tend to be worst when you have some bit of data > that's being frequently modified by multiple backends. Every backend > that wants to modify the value needs to steal the cache line, and > eventually you spend most of your CPU time stealing cache lines from > other sockets and not much of it doing any actual work. If you have a > value that's just being read by a lot of backends without > modification, I think the cache line can be shared in read only mode > by all the CPUs and it's not too bad. That makes sense. I guess they'd be on the same cache line as well, which means a write to either will invalidate both. Some places (XLogWrite, XLogInsertRecord, XLogSetAsyncXactLSN, GetFlushRecPtr, GetXLogWriteRecPtr) already update it from shared memory anyway, so those are non-issues. The potential problem areas (unless I missed something) are: * AdvanceXLInsertBuffer reads it as an early check to see if a buffer is already written out. * XLogFlush / XLogNeedsFlush use it for an early return * WALReadFromBuffers reads it for an error check (currently an Assert, but we might want to make that an elog). * We are discussing adding a Copy pointer, which would be advanced by WaitXLogInsertionsToFinish(), and if we do replication-before-flush, we may want to be more eager about advancing it. That could cause an issue, as well. I don't see the global non-shared variable as a huge problem, so if it serves a purpose then I'm fine keeping it. Perhaps we could make it a bit safer by using some wrapper functions. Something like: bool IsWriteRecPtrAtLeast(XLogRecPtr recptr) { XLogRecPtr writeptr; if (LogwrtResult.Write >= recptr) return true; writeptr = GetXLogWriteRecPtr(); return (writeptr >= recptr); } That would reduce the number of direct references to LogwrtResult, callers would never see a stale value, and would avoid the cache miss problem that you're concerned about. Not sure if they'd need to be inline or not. Regards, Jeff Davis
Re: Unlinking Parallel Hash Join inner batch files sooner
On Thu, Feb 22, 2024 at 5:37 PM Andrei Lepikhov wrote: > On 22/2/2024 06:42, Thomas Munro wrote: > > extreme skew for one version of the problem, but even with zero/normal > > skewness and perfect estimation of the number of partitions, if you Sorry, I meant to write "but even with no duplicates" there (mention of "normal" was brain fade).
re: Speeding up COPY TO for uuids and arrays
Hi, On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote: > As a test case, I created a table with 1 rows, each of which > had an array of 1 uuids. The table resided in shared buffers. Can you share exactly script used to create a table? best regards, Ranier Vilela
Re: Optimize planner memory consumption for huge arrays
Hi! On 20.02.2024 07:41, Andrei Lepikhov wrote: On 20/2/2024 04:51, Tom Lane wrote: Tomas Vondra writes: On 2/19/24 16:45, Tom Lane wrote: Tomas Vondra writes: For example, I don't think we expect selectivity functions to allocate long-lived objects, right? So maybe we could run them in a dedicated memory context, and reset it aggressively (after each call). Here's a quick and probably-incomplete implementation of that idea. I've not tried to study its effects on memory consumption, just made sure it passes check-world. Thanks for the sketch. The trick with the planner_tmp_cxt_depth especially looks interesting. I think we should design small memory contexts - one per scalable direction of memory utilization, like selectivity or partitioning (appending ?). My coding experience shows that short-lived GEQO memory context forces people to learn on Postgres internals more precisely :). I think there was a problem in your patch when you freed up the memory of a variable in the eqsel_internal function, because we have a case where the variable was deleted by reference in the eval_const_expressions_mutator function (it is only for T_SubPlan and T_AlternativeSubPlan type of nodes. This query just causes an error in your case: create table a (id bigint, c1 bigint, primary key(id)); create table b (a_id bigint, b_id bigint, b2 bigint, primary key(a_id, b_id)); explain select id from a, b where id = a_id and b2 = (select min(b2) from b where id = a_id); drop table a; drop table b; We can return a copy of the variable or not release the memory of this variable. I attached two patch: the first one is removing your memory cleanup and another one returns the copy of variable. The author of the corrections is not only me, but also Daniil Anisimov. -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index edc25d712e9..ac560b1605e 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -2915,7 +2915,7 @@ eval_const_expressions_mutator(Node *node, * XXX should we ereport() here instead? Probably this routine * should never be invoked after SubPlan creation. */ - return node; + return CopyObject(node); case T_RelabelType: { RelabelType *relabel = (RelabelType *) node; diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index cea777e9d40..e5bad75ec1c 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -281,6 +281,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) selec = var_eq_non_const(, operator, collation, other, varonleft, negate); + pfree(other); ReleaseVariableStats(vardata); return selec; @@ -1961,15 +1962,15 @@ scalararraysel(PlannerInfo *root, { List *args; Selectivity s2; - - args = list_make2(leftop, - makeConst(nominal_element_type, - -1, - nominal_element_collation, - elmlen, - elem_values[i], - elem_nulls[i], - elmbyval)); + Const *c = makeConst(nominal_element_type, + -1, + nominal_element_collation, + elmlen, + elem_values[i], + elem_nulls[i], + elmbyval); + + args = list_make2(leftop, c); if (is_join_clause) s2 = DatumGetFloat8(FunctionCall5Coll(, clause->inputcollid, @@ -1985,7 +1986,8 @@ scalararraysel(PlannerInfo *root,
how to read table options during smgropen()
Dear pgsql hackers, I am developing custom storage for pgsql tables. I am using md* functions and smgrsw[] structure to switch between different magnetic disk access methods. I want to add some custom options while table created psql# create table t(...) with (my_option='value'); And thus I want to set "reln->smgr_which" conditionally during smgropen(). If myoption='value' i would use another smgr_which I am really stuck at this point. smgr.c: SMgrRelation smgropen(RelFileNode rnode, BackendId backend){ ... if ( HasOption(rnode, "my_option","value")){ //<< how to implement this check ? reln->smgr_which = 1; //new access method }else{ reln->smgr_which = 0; //old access method } ... } The question is --- can I read table options while the table is identified by "RelFileNode rnode" ?? The only available information is typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ Oid relNode; /* relation */ } RelFileNode; But there are no table options available directly from this structure. What is the best way to implement HasOption(rnode, "my_option","value") Thank you in advance for any ideas. Sincerely, Dmitry R
Re: Transaction timeout
> On 19 Feb 2024, at 15:17, Japin Li wrote: > > > +1 PFA patch set of 4 patches: 1. remove all potential flaky tests. BTW recently we had a bingo when 3 of them failed together [0] 2-3. waiting injection points patchset by Michael Paquier, intact v2 from nearby thread. 4. prototype of simple TAP tests for timeouts. I did not add a test for statement_timeout, because it still have good coverage in isolation tests. But added test for idle_sessoin_timeout. Maybe these tests could be implemented with NOTICE injection points (not requiring steps 2-3), but I'm afraid that they might be flaky too: FATALed connection might not send information necesary for test success (we will see something like "PQconsumeInput failed: server closed the connection unexpectedly" as in [1]). Best regards, Andrey Borodin. [0] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-02-20%2010%3A20%3A13 [1] https://www.postgresql.org/message-id/flat/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com 0001-Remove-flacky-isolation-tests-for-timeouts.patch Description: Binary data 0004-Add-timeouts-TAP-tests.patch Description: Binary data 0003-Add-regression-test-for-restart-points-during-promot.patch Description: Binary data 0002-injection_points-Add-routines-to-wait-and-wake-proce.patch Description: Binary data
Re: Experiments with Postgres and SSL
On 22/02/2024 01:43, Matthias van de Meent wrote: On Wed, 10 Jan 2024 at 09:31, Heikki Linnakangas wrote: 4. The number of combinations of sslmode, gssencmode and sslnegotiation settings is scary. And we have very few tests for them. Yeah, it's not great. We could easily automate this better though. I mean, can't we run the tests using a "cube" configuration, i.e. test every combination of parameters? We would use a mapping function of (psql connection parameter values -> expectations), which would be along the lines of the attached pl testfile. I feel it's a bit more approachable than the lists of manual option configurations, and makes it a bit easier to program the logic of which connection security option we should have used to connect. The attached file would be a drop-in replacement; it's tested to work with SSL only - without GSS - because I've been having issues getting GSS working on my machine. +1 testing all combinations. I don't think the 'mapper' function approach in your version is much better than the original though. Maybe it would be better with just one 'mapper' function that contains all the rules, along the lines of: (This isn't valid perl, just pseudo-code) sub expected_outcome { my ($user, $sslmode, $negotiation, $gssmode) = @_; my @possible_outcomes = { 'plain', 'ssl', 'gss' } delete $possible_outcomes{'plain'} if $sslmode eq 'require'; delete $possible_outcomes{'ssl'} if $sslmode eq 'disable'; delete $possible_outcomes{'plain'} if $user eq 'ssluser'; delete $possible_outcomes{'plain'} if $user eq 'ssluser'; if $sslmode eq 'allow' { # move 'plain' before 'ssl' in the list } if $sslmode eq 'prefer' { # move 'ssl' before 'plain' in the list } # more rules here # If there are no outcomes left in $possible_outcomes, return 'fail' # If there's exactly one outcome left, return that. # If there's more, return the first one. } Or maybe a table that lists all the combinations and the expected outcome. Something lieke this: nossluser nogssuser ssluser gssuser sslmode=require fail... sslmode=prefer plain sslmode=disable plain The problem is that there are more than two dimensions. So maybe an exhaustive list like this: usersslmode gssmode outcome nossluser require disable fail nossluser prefer disable plain nossluser disable disable plain ssluser require disable ssl ... I'm just throwing around ideas here, can you experiment with different approaches and see what looks best? ALPN Does the TLS ALPN spec allow protocol versions in the protocol tag? It would be very useful to detect clients with new capabilities at the first connection, rather than having to wait for one round trip, and would allow one avenue for changing the protocol version. Looking at the list of registered ALPN tags [0], I can see "http/0.9"; "http/1.0" and "http/1.1". I think we'd want to changing the major protocol version in a way that would introduce a new roundtrip, though. -- Heikki Linnakangas Neon (https://neon.tech)
Potential issue in ecpg-informix decimal converting functions
Hi, everyone! I found a potential bug in dectoint() and dectolong() functions from informix.c. "Informix Compatibility Mode" doc chapter says that ECPG_INFORMIX_NUM_OVERFLOW is returned if an overflow occurred. But check this line in dectoint() or dectolong() (it is present in both): if (ret == PGTYPES_NUM_OVERFLOW) - condition is always false because PGTYPESnumeric_to_int() and PGTYPESnumeric_to_long() functions return only 0 or -1. So ECPG_INFORMIX_NUM_OVERFLOW can never be returned. I think dectoint(), dectolong() and PGTYPESnumeric_to_int() functions should be a little bit different like in proposing patch. What do you think? The flaw was catched with the help of Svace static analyzer. https://svace.pages.ispras.ru/svace-website/en/ Thank you!diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml index 73351a9136..8390a61cf4 100644 --- a/doc/src/sgml/ecpg.sgml +++ b/doc/src/sgml/ecpg.sgml @@ -8882,7 +8882,7 @@ int dectodbl(decimal *np, double *dblp); dectoint -Convert a variable to type decimal to an integer. +Convert a variable of type decimal to an integer. int dectoint(decimal *np, int *ip); @@ -8892,8 +8892,9 @@ int dectoint(decimal *np, int *ip); On success, 0 is returned and a negative value if the conversion -failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW -is returned. +failed. If overflow or underflow occurred, the function returns +ECPG_INFORMIX_NUM_OVERFLOW or +ECPG_INFORMIX_NUM_UNDERFLOW respectively. Note that the ECPG implementation differs from the Informix @@ -8908,7 +8909,7 @@ int dectoint(decimal *np, int *ip); dectolong -Convert a variable to type decimal to a long integer. +Convert a variable of type decimal to a long integer. int dectolong(decimal *np, long *lngp); @@ -8918,8 +8919,9 @@ int dectolong(decimal *np, long *lngp); On success, 0 is returned and a negative value if the conversion -failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW -is returned. +failed. If overflow or underflow occurred, the function returns +ECPG_INFORMIX_NUM_OVERFLOW or +ECPG_INFORMIX_NUM_UNDERFLOW respectively. Note that the ECPG implementation differs from the Informix diff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c index 80d40aa3e0..73e84bc6e3 100644 --- a/src/interfaces/ecpg/compatlib/informix.c +++ b/src/interfaces/ecpg/compatlib/informix.c @@ -435,6 +435,7 @@ dectoint(decimal *np, int *ip) { int ret; numeric*nres = PGTYPESnumeric_new(); + int errnum = 0; if (nres == NULL) return ECPG_INFORMIX_OUT_OF_MEMORY; @@ -445,11 +446,18 @@ dectoint(decimal *np, int *ip) return ECPG_INFORMIX_OUT_OF_MEMORY; } + errno = 0; ret = PGTYPESnumeric_to_int(nres, ip); + errnum = errno; PGTYPESnumeric_free(nres); - if (ret == PGTYPES_NUM_OVERFLOW) - ret = ECPG_INFORMIX_NUM_OVERFLOW; + if (ret) + { + if (errnum == PGTYPES_NUM_UNDERFLOW) + ret = ECPG_INFORMIX_NUM_UNDERFLOW; + else if (errnum == PGTYPES_NUM_OVERFLOW) + ret = ECPG_INFORMIX_NUM_OVERFLOW; + } return ret; } @@ -459,6 +467,7 @@ dectolong(decimal *np, long *lngp) { int ret; numeric*nres = PGTYPESnumeric_new(); + int errnum = 0; if (nres == NULL) return ECPG_INFORMIX_OUT_OF_MEMORY; @@ -469,11 +478,18 @@ dectolong(decimal *np, long *lngp) return ECPG_INFORMIX_OUT_OF_MEMORY; } + errno = 0; ret = PGTYPESnumeric_to_long(nres, lngp); + errnum = errno; PGTYPESnumeric_free(nres); - if (ret == PGTYPES_NUM_OVERFLOW) - ret = ECPG_INFORMIX_NUM_OVERFLOW; + if (ret) + { + if (errnum == PGTYPES_NUM_UNDERFLOW) + ret = ECPG_INFORMIX_NUM_UNDERFLOW; + else if (errnum == PGTYPES_NUM_OVERFLOW) + ret = ECPG_INFORMIX_NUM_OVERFLOW; + } return ret; } diff --git a/src/interfaces/ecpg/pgtypeslib/numeric.c b/src/interfaces/ecpg/pgtypeslib/numeric.c index 35e7b92da4..52b49e1ce9 100644 --- a/src/interfaces/ecpg/pgtypeslib/numeric.c +++ b/src/interfaces/ecpg/pgtypeslib/numeric.c @@ -1502,7 +1502,12 @@ PGTYPESnumeric_to_int(numeric *nv, int *ip) /* silence compilers that might complain about useless tests */ #if SIZEOF_LONG > SIZEOF_INT - if (l < INT_MIN || l > INT_MAX) + if (l < INT_MIN) + { + errno = PGTYPES_NUM_UNDERFLOW; + return -1; + } + else if(l > INT_MAX) { errno = PGTYPES_NUM_OVERFLOW; return -1; diff --git a/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout b/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout index 1f8675b3f3..71a5afa4a7 100644 --- a/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout +++ b/src/interfaces/ecpg/test/expected/compat_informix-dec_test.stdout @@ -3,8 +3,8 @@ (no errno set) - dec[0,3]: r: -1, * (no errno set) - dec[0,4]: r: -1, * dec[0,5]: r: 0, 0.00
Re: Sequence Access Methods, round two
Hi Michael, I took a quick look at this patch series, mostly to understand how it works and how it might interact with the logical decoding patches discussed in a nearby thread. First, some general review comments: 0001 -- I think this bit in pg_proc.dat is not quite right: proallargtypes => '{regclass,bool,int8}', proargmodes => '{i,o,o}', proargnames => '{seqname,is_called,last_value}', the first argument should not be "seqname" but rather "seqid". 0002, 0003 seems fine, cosmetic changes 0004 -- I don't understand this bit in AlterSequence: last_value = newdataform->last_value; is_called = newdataform->is_called; UnlockReleaseBuffer(buf); /* Check and set new values */ init_params(pstate, stmt->options, stmt->for_identity, false, seqform, _value, _state, _called, _seq_rewrite, _by); Why set the values only to pass them to init_params(), which will just overwrite them anyway? Or do I get this wrong? Also, isn't "reset_state" just a different name for (original) log_cnt? 0005 -- I don't quite understand what "elt" stands for :-( stmt->tableElts = NIL; Do we need AT_AddColumnToSequence? It seems to work exactly like AT_AddColumn. OTOH we do have AT_AddColumnToView too ... Thinking about this code: case T_CreateSeqStmt: EventTriggerAlterTableStart(parsetree); address = DefineSequence(pstate, (CreateSeqStmt *) parsetree); /* stashed internally */ commandCollected = true; EventTriggerAlterTableEnd(); break; Does this actually make sense? I mean, are sequences really relations? Or was that just a side effect of storing the state in a heap table (which is more of an implementation detail)? 0006 -- no comment, just moving code 0007 -- I wonder why heap_create_with_catalog needs to do this (check that it's a sequence): if ((RELKIND_HAS_TABLE_AM(relkind) && relkind != RELKIND_TOASTVALUE) || relkind == RELKIND_SEQUENCE) Presumably this is to handle sequences that use heap to store the state? Maybe the comment should explain that. Also, will the other table AMs need to do something similar, just in case some sequence happens to use that table AM (which seems out of control of the table AM)? I don't understand why DefineSequence need to copy the string: stmt->accessMethod = seq->accessMethod ? pstrdup(seq->accessMethod) : NULL; RelationInitTableAccessMethod now does not need to handle sequences, or rather should not be asked to handle sequences. Is there a risk we'd pass a sequence to the function anyway? Maybe an assert / error would be appropriate? This bit in RelationBuildLocalRelation looks a bit weird ... if (RELKIND_HAS_TABLE_AM(relkind)) RelationInitTableAccessMethod(rel); else if (relkind == RELKIND_SEQUENCE) RelationInitSequenceAccessMethod(rel); It's not a fault of this patch, but shouldn't we now have something like RELKIND_HAS_SEQUENCE_AM()? 0008-0010 --- no comment logical decoding / replication Now, regarding the logical decoding / replication, would introducing the sequence AM interfere with that in some way? Either in general, or with respect to the nearby patch. That is, what would it take to support logical replication of sequences with some custom sequence AM? I believe that requires (a) synchronizing the initial value, and (b) decoding the sequence WAL and (c) apply the decoded changes. I don't think the sequence AM breaks any of this, as long as it allows selecting "current value", decoding the values from WAL, sending them to the subscriber, etc. I guess the decoding would be up to the RMGR, and this patch maintains the 1:1 mapping of sequences to relfilenodes, right? (That is, CREATE and ALTER SEQUENCE would still create a new relfilenode, which is rather important to decide if a sequence change is transactional.) It seems to me this does not change the non-transactional behavior of sequences, right? alternative sequence AMs -- I understand one of the reasons for adding sequence AMs is to allow stuff like global/distributed sequences, etc. But will people actually use that? For example, I believe Simon originally proposed this in 2016 because the plan was to implement distributed sequences in BDR on top of it. But I believe BDR ultimately went with a very different approach, not with custom sequence AMs. So I'm a bit skeptical this being suitable for other active-active systems ... Especially when the general consensus seems to be that for active-active systems it's much better to use e.g. UUID, because that does not require any coordination between the nodes, etc. I'm not claiming there are no use cases for sequence AMs, of course. For example the PRNG-based sequences mentioned by Mattias seems interesting. I don't know how widely useful that is, though, and if it's worth it (considering they managed to
Re: speed up a logical replica setup
On Thu, 22 Feb 2024 at 21:17, Hayato Kuroda (Fujitsu) wrote: > > Dear Vignesh, > > > Few comments on the tests: > > 1) If the dry run was successful because of some issue then the server > > will be stopped so we can check for "pg_ctl status" if the server is > > running otherwise the connection will fail in this case. Another way > > would be to check if it does not have "postmaster was stopped" > > messages in the stdout. > > + > > +# Check if node S is still a standby > > +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'), > > + 't', 'standby is in recovery'); > > Just to confirm - your suggestion is to add `pg_ctl status`, right? Added. Yes, that is correct. > > 2) Can we add verification of "postmaster was stopped" messages in > > the stdout for dry run without --databases testcase > > +# pg_createsubscriber can run without --databases option > > +command_ok( > > + [ > > + 'pg_createsubscriber', '--verbose', > > + '--dry-run', '--pgdata', > > + $node_s->data_dir, '--publisher-server', > > + $node_p->connstr('pg1'), '--subscriber-server', > > + $node_s->connstr('pg1') > > + ], > > + 'run pg_createsubscriber without --databases'); > > + > > Hmm, in case of --dry-run, the server would be never shut down. > See below part. > > ``` > if (!dry_run) > stop_standby_server(pg_ctl_path, opt.subscriber_dir); > ``` One way to differentiate whether the server is run in dry_run mode or not is to check if the server was stopped or not. So I mean we can check that the stdout does not have a "postmaster was stopped" message from the stdout. Can we add validation based on this code: + if (action) + pg_log_info("postmaster was started"); Or another way is to check pg_ctl status to see that the server is not shutdown. > > 3) This message "target server must be running" seems to be wrong, > > should it be cannot specify cascading replicating standby as standby > > node(this is for v22-0011 patch : > > + 'pg_createsubscriber', '--verbose', '--pgdata', > > $node_c->data_dir, > > + '--publisher-server', $node_s->connstr('postgres'), > > + '--port', $node_c->port, '--socketdir', $node_c->host, > > + '--database', 'postgres' > > ], > > - 'primary server is in recovery'); > > + 1, > > + [qr//], > > + [qr/primary server cannot be in recovery/], > > + 'target server must be running'); > > Fixed. > > > 4) Should this be "Wait for subscriber to catch up" > > +# Wait subscriber to catch up > > +$node_s->wait_for_subscription_sync($node_p, $subnames[0]); > > +$node_s->wait_for_subscription_sync($node_p, $subnames[1]); > > Fixed. > > > 5) Should this be 'Subscriptions has been created on all the specified > > databases' > > +); > > +is($result, qq(2), > > + 'Subscriptions has been created to all the specified databases' > > +); > > Fixed, but "has" should be "have". > > > 6) Add test to verify current_user is not a member of > > ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no > > permissions to execution replication origin advance functions > > > > 7) Add tests to verify insufficient max_logical_replication_workers, > > max_replication_slots and max_worker_processes for the subscription > > node > > > > 8) Add tests to verify invalid configuration of wal_level, > > max_replication_slots and max_wal_senders for the publisher node > > Hmm, but adding these checks may increase the test time. we should > measure the time and then decide. We can check and see if it does not take significantly more time, then we can have these tests. Regards, Vignesh
Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
Dean Rasheed writes: > Attached is a patch that prevents UNION ALL subquery pullup in MERGE only. I think that this is a band-aid that's just masking an error in the rowmarking logic: it's not doing the right thing for appendrels made from UNION ALL subqueries. I've not wrapped my head around exactly where it's going off the rails, but I feel like this ought to get fixed somewhere else. Please hold off pushing your patch for now. (The test case looks valuable though, thanks for working on that.) regards, tom lane
RE: speed up a logical replica setup
Dear Vignesh, > Few comments on the tests: > 1) If the dry run was successful because of some issue then the server > will be stopped so we can check for "pg_ctl status" if the server is > running otherwise the connection will fail in this case. Another way > would be to check if it does not have "postmaster was stopped" > messages in the stdout. > + > +# Check if node S is still a standby > +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'), > + 't', 'standby is in recovery'); Just to confirm - your suggestion is to add `pg_ctl status`, right? Added. > 2) Can we add verification of "postmaster was stopped" messages in > the stdout for dry run without --databases testcase > +# pg_createsubscriber can run without --databases option > +command_ok( > + [ > + 'pg_createsubscriber', '--verbose', > + '--dry-run', '--pgdata', > + $node_s->data_dir, '--publisher-server', > + $node_p->connstr('pg1'), '--subscriber-server', > + $node_s->connstr('pg1') > + ], > + 'run pg_createsubscriber without --databases'); > + Hmm, in case of --dry-run, the server would be never shut down. See below part. ``` if (!dry_run) stop_standby_server(pg_ctl_path, opt.subscriber_dir); ``` > 3) This message "target server must be running" seems to be wrong, > should it be cannot specify cascading replicating standby as standby > node(this is for v22-0011 patch : > + 'pg_createsubscriber', '--verbose', '--pgdata', > $node_c->data_dir, > + '--publisher-server', $node_s->connstr('postgres'), > + '--port', $node_c->port, '--socketdir', $node_c->host, > + '--database', 'postgres' > ], > - 'primary server is in recovery'); > + 1, > + [qr//], > + [qr/primary server cannot be in recovery/], > + 'target server must be running'); Fixed. > 4) Should this be "Wait for subscriber to catch up" > +# Wait subscriber to catch up > +$node_s->wait_for_subscription_sync($node_p, $subnames[0]); > +$node_s->wait_for_subscription_sync($node_p, $subnames[1]); Fixed. > 5) Should this be 'Subscriptions has been created on all the specified > databases' > +); > +is($result, qq(2), > + 'Subscriptions has been created to all the specified databases' > +); Fixed, but "has" should be "have". > 6) Add test to verify current_user is not a member of > ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no > permissions to execution replication origin advance functions > > 7) Add tests to verify insufficient max_logical_replication_workers, > max_replication_slots and max_worker_processes for the subscription > node > > 8) Add tests to verify invalid configuration of wal_level, > max_replication_slots and max_wal_senders for the publisher node Hmm, but adding these checks may increase the test time. we should measure the time and then decide. > 9) We can use the same node name in comment and for the variable > +# Set up node P as primary > +$node_p = PostgreSQL::Test::Cluster->new('node_p'); > +$node_p->init(allows_streaming => 'logical'); > +$node_p->start; Fixed. > 10) Similarly we can use node_f instead of F in the comments. > +# Set up node F as about-to-fail node > +# Force it to initialize a new cluster instead of copying a > +# previously initdb'd cluster. > +{ > + local $ENV{'INITDB_TEMPLATE'} = undef; > + > + $node_f = PostgreSQL::Test::Cluster->new('node_f'); > + $node_f->init(allows_streaming => 'logical'); > + $node_f->start; > Fixed. Also, recent commit [1] allows to run the initdb forcibly. So followed. New patch can be available in [2]. [1]: https://github.com/postgres/postgres/commit/ff9e1e764fcce9a34467d614611a34d4d2a91b50 [2]: https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, > Few comments: > 1) The below code can lead to assertion failure if the publisher is > stopped while dropping the replication slot: > + if (primary_slot_name != NULL) > + { > + conn = connect_database(dbinfo[0].pubconninfo); > + if (conn != NULL) > + { > + drop_replication_slot(conn, [0], > primary_slot_name); > + } > + else > + { > + pg_log_warning("could not drop replication > slot \"%s\" on primary", > + primary_slot_name); > + pg_log_warning_hint("Drop this replication > slot soon to avoid retention of WAL files."); > + } > + disconnect_database(conn); > + } > > pg_createsubscriber: error: connection to database failed: connection > to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or > directory > Is the server running locally and accepting connections on that socket? > pg_createsubscriber: warning: could not drop replication slot > "standby_1" on primary > pg_createsubscriber: hint: Drop this replication slot soon to avoid > retention of WAL files. > pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database: > Assertion `conn != ((void *)0)' failed. > Aborted (core dumped) > > This is happening because we are calling disconnect_database in case > of connection failure case too which has the following assert: > +static void > +disconnect_database(PGconn *conn) > +{ > + Assert(conn != NULL); > + > + PQfinish(conn); > +} Right. disconnect_database() was moved to if (conn != NULL) block. > 2) There is a CheckDataVersion function which does exactly this, will > we be able to use this: > + /* Check standby server version */ > + if ((ver_fd = fopen(versionfile, "r")) == NULL) > + pg_fatal("could not open file \"%s\" for reading: %m", > versionfile); > + > + /* Version number has to be the first line read */ > + if (!fgets(rawline, sizeof(rawline), ver_fd)) > + { > + if (!ferror(ver_fd)) > + pg_fatal("unexpected empty file \"%s\"", versionfile); > + else > + pg_fatal("could not read file \"%s\": %m", > versionfile); > + } > + > + /* Strip trailing newline and carriage return */ > + (void) pg_strip_crlf(rawline); > + > + if (strcmp(rawline, PG_MAJORVERSION) != 0) > + { > + pg_log_error("standby server is of wrong version"); > + pg_log_error_detail("File \"%s\" contains \"%s\", > which is not compatible with this program's version \"%s\".", > + versionfile, > rawline, PG_MAJORVERSION); > + exit(1); > + } > + > + fclose(ver_fd); > 3) Should this be added to typedefs.list: > +enum WaitPMResult > +{ > + POSTMASTER_READY, > + POSTMASTER_STILL_STARTING > +}; But the comment from Peter E. [1] was opposite. I did not handle this. > 4) pgCreateSubscriber should be mentioned after pg_controldata to keep > the ordering consistency: > diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml > index aa94f6adf6..c5edd244ef 100644 > --- a/doc/src/sgml/reference.sgml > +++ b/doc/src/sgml/reference.sgml > @@ -285,6 +285,7 @@ > > > > + > This has been already pointed out by Peter E. I did not handle this. > 5) Here pg_replication_slots should be pg_catalog.pg_replication_slots: > + if (primary_slot_name) > + { > + appendPQExpBuffer(str, > + "SELECT 1 FROM > pg_replication_slots " > + "WHERE active AND > slot_name = '%s'", > + primary_slot_name); Fixed. > 6) Here pg_settings should be pg_catalog.pg_settings: > +* - max_worker_processes >= 1 + number of dbs to be converted > + > * > +*/ > + res = PQexec(conn, > +"SELECT setting FROM pg_settings > WHERE name IN (" > +"'max_logical_replication_workers', " > +"'max_replication_slots', " > +"'max_worker_processes', " > +"'primary_slot_name') " > +"ORDER BY name"); Fixed. New version can be available in [2] [1]: https://www.postgresql.org/message-id/3ee79f2c-e8b3-4342-857c-a31b87e1afda%40eisentraut.org [2]: https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, > Few comments regarding the documentation: > 1) max_replication_slots information seems to be present couple of times: > > + > + The target instance must have > + linkend="guc-max-replication-slots">max_replication_slots me> > + and linkend="guc-max-logical-replication-workers">max_logical_replica > tion_workers > + configured to a value greater than or equal to the number of target > + databases. > + > > + > + > + The target instance must have > + linkend="guc-max-replication-slots">max_replication_slots me> > + configured to a value greater than or equal to the number of target > + databases and replication slots. > + > + Fixed. > 2) Can we add an id to prerequisites and use it instead of referring > to r1-app-pg_createsubscriber-1: > - pg_createsubscriber checks if the > given target data > - directory has the same system identifier than the source data directory. > - Since it uses the recovery process as one of the steps, it starts the > - target server as a replica from the source server. If the system > - identifier is not the same, > pg_createsubscriber will > - terminate with an error. > + Checks the target can be converted. In particular, things listed in > + above section > would be > + checked. If these are not met > pg_createsubscriber > + will terminate with an error. > Changed. > 3) The code also checks the following: > Verify if a PostgreSQL binary (progname) is available in the same > directory as pg_createsubscriber. > > But this is not present in the pre-requisites of documentation. I think it is quite trivial so that I did not add. > 4) Here we mention that the target server should be stopped, but the > same is not mentioned in prerequisites: > + Here is an example of using > pg_createsubscriber. > + Before running the command, please make sure target server is stopped. > + > +$ pg_ctl -D /usr/local/pgsql/data > stop > + > + Oh, it is opposite, it should NOT be stopped. Fixed. > 5) If there is an error during any of the pg_createsubscriber > operation like if create subscription fails, it might not be possible > to rollback to the earlier state which had physical-standby > replication. I felt we should document this and also add it to the > console message like how we do in case of pg_upgrade. Added. New version can be available in [1] [1]: https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Alvaro, > > 15. > > > > You said in case of failure, cleanups is not needed if the process exits > > soon [1]. > > But some functions call PQfinish() then exit(1) or pg_fatal(). Should we > > follow? > > Hmm, but doesn't this mean that the server will log an ugly message that > "client closed connection unexpectedly"? I think it's nicer to close > the connection before terminating the process (especially since the > code for that is already written). OK. So we should disconnect properly even if the process exits. I added the function call again. Note that PQclear() was not added because it is only related with the application. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: btree: downlink right separator/HIKEY optimization
On Sat, 6 Jan 2024 at 16:40, vignesh C wrote: > > CFBot shows the following compilation error at [1]: > [16:56:22.153] FAILED: > src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj > [...] > ../src/backend/access/nbtree/nbtsearch.c > [16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error > C2143: syntax error: missing ';' before 'type' > [16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning > C4091: ' ': ignored on left of 'int' when no variable is declared I forgot to address this in the previous patch, so here's v3 which fixes the issue warning. Kind regards, Matthias van de Meent Neon (https://neon.tech) v3-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch Description: Binary data
Re: Reducing output size of nodeToString
On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent wrote: > > On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent > wrote: > > Attached the updated version of the patch on top of 5497daf3, which > > incorporates this last round of feedback. > > Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b > and an issue in the previous patchset: I attached one too many v3-0001 > from a previous patch I worked on. ... and now with a fix for not overwriting newly deserialized location attributes with -1, which breaks test output for READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant changes since the patch of last Monday. Sorry for the noise, -Matthias v7-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch Description: Binary data v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch Description: Binary data v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch Description: Binary data v7-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch Description: Binary data v7-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch Description: Binary data v7-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch Description: Binary data v7-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch Description: Binary data v7-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch Description: Binary data
Re: Test to dump and restore objects left behind by regression
Peter Eisentraut writes: > The problem is, we don't really have any end-to-end coverage of > dump > restore > dump again > compare the two dumps > with a database with lots of interesting objects in it. I'm very much against adding another full run of the core regression tests to support this. But beyond the problem of not bloating the check-world test runtime, there is the question of what this would actually buy us. I doubt that it is worth very much, because it would not detect bugs-of-omission in pg_dump. As I remarked in the other thread, if pg_dump is blind to the existence of some feature or field, testing that the dumps compare equal will fail to reveal that it didn't restore that property. I'm not sure what we could do about that. One could imagine writing some test infrastructure that dumps out the contents of the system catalogs directly, and comparing that instead of pg_dump output. But that'd be a lot of infrastructure to write and maintain ... and it's not real clear why it wouldn't *also* suffer from I-forgot-to-add-this hazards. On balance, I think there are good reasons that we've not added such a test, and I don't believe those tradeoffs have changed. regards, tom lane
Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c
> On 6 Dec 2023, at 14:03, Nazir Bilal Yavuz wrote: > There is an ongoing thread [1] for adding missing SQL error codes to > PANIC and FATAL error reports in xlogrecovery.c file. I did the same > but for xlog.c and relcache.c files. - elog(PANIC, "space reserved for WAL record does not match what was written"); + ereport(PANIC, + (errcode(ERRCODE_DATA_CORRUPTED), +errmsg("space reserved for WAL record does not match what was written"))); elogs turned into ereports should use errmsg_internal() to keep the strings from being translated. - elog(FATAL, "could not write init file"); + ereport(FATAL, + (errcode_for_file_access(), +errmsg("could not write init file"))); Is it worthwhile adding %m on these to get a little more help when debugging errors that shouldn't happen? - elog(FATAL, "could not write init file"); + ereport(FATAL, + (errcode_for_file_access(), The extra parenthesis are no longer needed, I don't know if we have a policy to remove them when changing an ereport call but we should at least not introduce new ones. - elog(FATAL, "cannot read pg_class without having selected a database"); + ereport(FATAL, + (errcode(ERRCODE_INTERNAL_ERROR), ereport (and thus elog) already defaults to ERRCODE_INTERNAL_ERROR for ERROR or higher, so unless there is a better errcode an elog() call if preferrable here. > I couldn't find a suitable error code for the "cache lookup failed for > relation" error in relcache.c and this error comes up in many places. > Would it be reasonable to create a new error code specifically for > this? We use ERRCODE_UNDEFINED_OBJECT for similar errors elsewhere, perhaps we can use that for these as well? -- Daniel Gustafsson
Re: Documentation to upgrade logical replication cluster
On Thu, 22 Feb 2024 at 09:35, Peter Smith wrote: > > Here are some minor comments for patch v8-0001. > > == > doc/src/sgml/glossary.sgml > > 1. > + > + > + A set of publisher and subscriber instances with publisher instance > + replicating changes to the subscriber instance. > + > + > > /with publisher instance/with the publisher instance/ Modified > ~~~ > > 2. > There are 2 SQL fragments but they are wrapped differently (see > below). e.g. it is not clear why is the 2nd fragment wrapped since it > is shorter than the 1st. OTOH, maybe you want the 1st fragment to > wrap. Either way, consistency wrapping would be better. Modified Thanks for the comments, the attached v9 version patch has the changes for the same. I have added a commitfest entry for this: https://commitfest.postgresql.org/47/4848/ Regards, Vignesh From 3ff1910ec221c4e5f2e1eb8291848528d6ec479a Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Tue, 30 Jan 2024 08:55:20 +0530 Subject: [PATCH v9] Documentation for upgrading logical replication cluster. Documentation for upgrading logical replication cluster. --- doc/src/sgml/glossary.sgml| 10 + doc/src/sgml/logical-replication.sgml | 821 ++ doc/src/sgml/ref/pgupgrade.sgml | 131 +--- 3 files changed, 839 insertions(+), 123 deletions(-) diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml index 8c2f11480d..45cea16e8f 100644 --- a/doc/src/sgml/glossary.sgml +++ b/doc/src/sgml/glossary.sgml @@ -1068,6 +1068,16 @@ + + Logical replication cluster + + + A set of publisher and subscriber instances with the publisher instance + replicating changes to the subscriber instance. + + + + Log record diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index ec2130669e..d1eebe60e8 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -1926,6 +1926,827 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER + + Upgrade + + + Migration of logical replication clusters + is possible only when all the members of the old logical replication + clusters are version 17.0 or later. + + + + Prepare for publisher upgrades + + + attempts to migrate logical +slots. This helps avoid the need for manually defining the same +logical slots on the new publisher. Migration of logical slots is +only supported when the old cluster is version 17.0 or later. +Logical slots on clusters before version 17.0 will silently be +ignored. + + + +Before you start upgrading the publisher cluster, ensure that the +subscription is temporarily disabled to avoid the subscriber connection +failures during publisher upgrade, by executing +ALTER SUBSCRIPTION ... DISABLE. +Re-enable the subscription after the upgrade. + + + +The following prerequisites are required for pg_upgrade +to be able to upgrade the logical slots. If these are not met an error +will be reported. + + + + + + The new cluster must have + wal_level as + logical. + + + + + The new cluster must have + max_replication_slots + configured to a value greater than or equal to the number of slots + present in the old cluster. + + + + + The output plugins referenced by the slots on the old cluster must be + installed in the new PostgreSQL executable + directory. + + + + + The old cluster must have replicated all the transactions and logical + decoding messages to subscribers. + + + + + All slots on the old cluster must be usable, i.e., there are no slots + whose + pg_replication_slots.conflict_reason + is not NULL, e.g.: + +postgres=# SELECT slot_name +postgres-# FROM pg_replication_slots +postgres-# WHERE slot_type = 'logical' AND conflict_reason IS NOT NULL; + slot_name +--- +(0 rows) + + + + + + The new cluster must not have permanent logical slots, i.e., + there must be no slots listed with: + +SELECT count(*) + FROM pg_replication_slots + WHERE slot_type = 'logical' AND temporary IS false; + + + + + + + + Prepare for subscriber upgrades + + +Setup the +subscriber configurations in the new subscriber. + + + +pg_upgrade attempts to migrate subscription +dependencies which includes the subscription's table information present in +pg_subscription_rel +system catalog and also the subscription's replication origin. This allows +logical replication on the new subscriber to continue from where the +old subscriber was up to. Migration of subscription dependencies is only +supported when the old cluster is version 17.0 or later. Subscription +dependencies on clusters before
Re: btree: downlink right separator/HIKEY optimization
On Tue, 5 Dec 2023 at 08:43, Heikki Linnakangas wrote: > > On 01/11/2023 00:08, Matthias van de Meent wrote: > > By caching the right separator index tuple in _bt_search, we can > > compare the downlink's right separator and the HIKEY, and when they > > are equal (memcmp() == 0) we don't have to call _bt_compare - the > > HIKEY is known to be larger than the scan key, because our key is > > smaller than the right separator, and thus transitively also smaller > > than the HIKEY because it contains the same data. As _bt_compare can > > call expensive user-provided functions, this can be a large > > performance boon, especially when there are only a small number of > > column getting compared on each page (e.g. index tuples of many 100s > > of bytes, or dynamic prefix truncation is enabled). > > What would be the worst case scenario for this? One situation where the > memcmp() would not match is when there is a concurrent page split. I > think it's OK to pessimize that case. Are there any other situations? There is also concurrent page deletion which can cause downlinked pages to get removed from the set of accessible pages, but that's quite rare, too: arguably even more rare than page splits. > When the memcmp() matches, I think this is almost certainly not slower > than calling the datatype's comparison function. > > > if (offnum < PageGetMaxOffsetNumber(page)) > > [...] > > else if (!P_RIGHTMOST(opaque)) > > [...] > > } > > This could use a one-line comment above this, something like "Remember > the right separator of the downlink we follow, to speed up the next > _bt_moveright call". Done. > Should there be an "else rightsep = NULL;" here? Is it possible that we > follow the non-rightmost downlink on a higher level and rightmost > downlink on next level? Concurrent page deletion? While possible, the worst this could do is be less efficient in those fringe cases: The cached right separator is a key that is known to compare larger than the search key and thus always correct to use as an optimization for "is this HIKEY larger than my search key", as long as we don't clobber the data in that cache (which we don't). Null-ing the argument, while not incorrect, could be argued to be worse than useless here, as the only case where NULL may match an actual highkey is on the rightmost page, which we already special-case in _bt_moveright before hitting the 'compare the highkey' code. Removal of the value would thus remove any chance of using the optimization after hitting the rightmost page in a layer below. I've added a comment to explain this in an empty else block in the attached version 2 of the patch. > Please update the comment above _bt_moveright to describe the new > argument. Perhaps the text from README should go there, this feels like > a detail specific to _bt_search and _bt_moveright. Done. Thank you for the review. Kind regards, Matthias van de Meent Neon (https://neon.tech) v2-0001-btree-optimize-_bt_moveright-using-downlink-s-rig.patch Description: Binary data
Slow catchup of 2PC (twophase) transactions on replica in LR
Dear All, I'd like to present and talk about a problem when 2PC transactions are applied quite slowly on a replica during logical replication. There is a master and a replica with established logical replication from the master to the replica with twophase = true. With some load level on the master, the replica starts to lag behind the master, and the lag will be increasing. We have to significantly decrease the load on the master to allow replica to complete the catchup. Such problem may create significant difficulties in the production. The problem appears at least on REL_16_STABLE branch. To reproduce the problem: * Setup logical replication from master to replica with subscription parameter twophase = true. * Create some intermediate load on the master (use pgbench with custom sql with prepare+commit) * Optionally switch off the replica for some time (keep load on master). * Switch on the replica and wait until it reaches the master. The replica will never reach the master with even some low load on the master. If to remove the load, the replica will reach the master for much greater time, than expected. I tried the same for regular transactions, but such problem doesn't appear even with a decent load. I think, the main proplem of 2PC catchup bad performance - the lack of asynchronous commit support for 2PC. For regular transactions asynchronous commit is used on the replica by default (subscrition sycnronous_commit = off). It allows the replication worker process on the replica to avoid fsync (XLogFLush) and to utilize 100% CPU (the background wal writer or checkpointer will do fsync). I agree, 2PC are mostly used in multimaster configurations with two or more nodes which are performed synchronously, but when the node in catchup (node is not online in a multimaster cluster), asynchronous commit have to be used to speedup the catchup. There is another thing that affects on the disbalance of the master and replica performance. When the master executes requestes from multiple clients, there is a fsync optimization takes place in XLogFlush. It allows to decrease the number of fsync in case when a number of parallel backends write to the WAL simultaneously. The replica applies received transactions in one thread sequentially, such optimization is not applied. I see some possible solutions: * Implement asyncronous commit for 2PC transactions. * Do some hacking with enableFsync when it is possible. I think, asynchronous commit support for 2PC transactions should significantly increase replica performance and help to solve this problem. I tried to implement it (like for usual transactions) but I've found another problem: 2PC state is stored in WAL on prepare, on commit we have to read 2PC state from WAL but the read is delayed until WAL is flushed by the background wal writer (read LSN should be less than flush LSN). Storing 2PC state in a shared memory (as it proposed earlier) may help. I used the following query to monitor the catchup progress on the master:SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication; I used the following script for pgbench to the master:SELECT md5(random()::text) as mygid \gset BEGIN; DELETE FROM test WHERE v = pg_backend_pid(); INSERT INTO test(v) SELECT pg_backend_pid(); PREPARE TRANSACTION $$:mygid$$; COMMIT PREPARED $$:mygid$$; What do you think? With best regards, Vitaly Davydov
Re: When extended query protocol ends?
On Thu, 22 Feb 2024 at 13:01, Michael Zhilin wrote: > I would like to say this document states that "at completion... frontend > should issue a Sync message... causes the backend to close the current > transaction" > It looks like the sense of wording is "to complete transaction" at the > eventual end of traffic, but not "to switch to single protocol". > Otherwise, we can't use both protocols under same transaction that looks too > strict limitation. Sync only closes the current transaction when you didn't explicitly open one, i.e. you're using an implicit transaction (part of t. If you open an explicit transaction (, then you can use both extended and simple protocol messages in the same transaction. The way I understand it is: Multiple extended messages together form a single pipeline (easier understood as SQL statement), until a Sync is reached. So Parse-Bind-Execute-Parse-Bind-Execute-Sync counts as a single unit from postgres its visibility/rollback behaviour standpoint. And that's also where sending a Query instead of a Sync introduces a problem: What is supposed to happen if something fails. Let's take the simple example Bind-Execute-Query: If the Execute causes an error, is the Query still executed or not? And what about if the Query fails, is the Execute before committed or rolled back? That's why we have the Sync messages, clarify what happens when a failure occurs somewhere in the pipeline. And only extended protocol messages are allowed to be part of a pipeline: "Use of the extended query protocol allows pipelining"
RE: Synchronizing slots from primary to standby
On Thursday, February 22, 2024 8:41 PM Amit Kapila wrote: > > On Thu, Feb 22, 2024 at 5:23 PM Amit Kapila > wrote: > > > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > > wrote: > > > > > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > > > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > Thanks! > > > > > > > > > > Some random comments about v92_001 (Sorry if it has already been > > > > > discussed > > > > > up-thread): > > > > > > > > Thanks for the feedback. The patch is pushed 15 minutes back. > > > > > > Yeah, saw that after I send the comments ;-) > > > > > > > There is a BF failure. See > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-0 > 2-22%2010%3A13%3A03. > > > > The initial analysis suggests that for some reason, the primary went > > down after the slot sync worker was invoked the first time. See the > > below in the primary's LOG: > > > > The reason is that the test failed waiting on below LOG: > > ### Reloading node "standby1" > # Running: pg_ctl -D > /home/ec2-user/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_ > 040_standby_failover_slots_sync_standby1_data/pgdata > reload > server signaled > timed out waiting for match: (?^:LOG: slot sync worker started) at > t/040_standby_failover_slots_sync.pl line 376. > > Now, on standby, we see a LOG like 2024-02-22 10:57:35.432 UTC [2721638:1] > LOG: 0: slot sync worker started. Even then the test failed and the > reason is > that it has an extra before the actual message which is due to > log_error_verbosity = verbose in config. I think here the test's log matching > code needs to have a more robust log line matching code. Agreed. Here is a small patch to change the msg in wait_for_log so that it only search the message part. Best Regards, Hou zj 0001-Make-recovery-test-pass-with-log_error_verbosity-ver.patch Description: 0001-Make-recovery-test-pass-with-log_error_verbosity-ver.patch
RE: speed up a logical replica setup
Dear Shlok, > Hi, > > I have reviewed the v21 patch. And found an issue. > > Initially I started the standby server with a new postgresql.conf file > (not the default postgresql.conf that is present in the instance). > pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf" > > And I have made 'max_replication_slots = 1' in new postgresql.conf and > made 'max_replication_slots = 0' in the default postgresql.conf file. > Now when we run pg_createsubscriber on standby we get error: > pg_createsubscriber: error: could not set replication progress for the > subscription "pg_createsubscriber_5_242843": ERROR: cannot query or > manipulate replication origin when max_replication_slots = 0 > NOTICE: dropped replication slot "pg_createsubscriber_5_242843" on publisher > pg_createsubscriber: error: could not drop publication > "pg_createsubscriber_5" on database "postgres": ERROR: publication > "pg_createsubscriber_5" does not exist > pg_createsubscriber: error: could not drop replication slot > "pg_createsubscriber_5_242843" on database "postgres": ERROR: > replication slot "pg_createsubscriber_5_242843" does not exist > > I observed that when we run the pg_createsubscriber command, it will > stop the standby instance (the non-default postgres configuration) and > restart the standby instance which will now be started with default > postgresql.conf, where the 'max_replication_slot = 0' and > pg_createsubscriber will now fail with the error given above. > I have added the script file with which we can reproduce this issue. > Also similar issues can happen with other configurations such as port, etc. Possible. So the issue is that GUC settings might be changed after the restart so that the verification phase may not be enough. There are similar GUCs in [1] and they may have similar issues. E.g., if "hba_file" is set when the server started, the file cannot be seen after the restart, so pg_createsubscriber may not connect to it anymore. > The possible solution would be > 1) allow to run pg_createsubscriber if standby is initially stopped . > I observed that pg_logical_createsubscriber also uses this approach. > 2) read GUCs via SHOW command and restore them when server restarts > I also prefer the first solution. Another reason why the standby should be stopped is for backup purpose. Basically, the standby instance should be saved before running pg_createsubscriber. An easiest way is hard-copy, and the postmaster should be stopped at that time. I felt it is better that users can run the command immediately later the copying. Thought? [1]: https://www.postgresql.org/docs/current/storage-file-layout.html Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Synchronizing slots from primary to standby
On Thu, Feb 22, 2024 at 5:23 PM Amit Kapila wrote: > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > wrote: > > > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > > > wrote: > > > > > > > > Hi, > > > > > > > > Thanks! > > > > > > > > Some random comments about v92_001 (Sorry if it has already been > > > > discussed > > > > up-thread): > > > > > > Thanks for the feedback. The patch is pushed 15 minutes back. > > > > Yeah, saw that after I send the comments ;-) > > > > There is a BF failure. See > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-22%2010%3A13%3A03. > > The initial analysis suggests that for some reason, the primary went > down after the slot sync worker was invoked the first time. See the > below in the primary's LOG: > The reason is that the test failed waiting on below LOG: ### Reloading node "standby1" # Running: pg_ctl -D /home/ec2-user/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_040_standby_failover_slots_sync_standby1_data/pgdata reload server signaled timed out waiting for match: (?^:LOG: slot sync worker started) at t/040_standby_failover_slots_sync.pl line 376. Now, on standby, we see a LOG like 2024-02-22 10:57:35.432 UTC [2721638:1] LOG: 0: slot sync worker started. Even then the test failed and the reason is that it has an extra before the actual message which is due to log_error_verbosity = verbose in config. I think here the test's log matching code needs to have a more robust log line matching code. -- With Regards, Amit Kapila.
Re: Reducing output size of nodeToString
On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent wrote: > Attached the updated version of the patch on top of 5497daf3, which > incorporates this last round of feedback. Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b and an issue in the previous patchset: I attached one too many v3-0001 from a previous patch I worked on. Kind regards, Matthias van de Meent Neon (https://neon.tech) v6-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch Description: Binary data v6-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch Description: Binary data v6-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch Description: Binary data v6-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch Description: Binary data v6-0008-nodeToString-omit-serializing-0s-in-enum-typed-fi.patch Description: Binary data v6-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch Description: Binary data v6-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch Description: Binary data v6-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch Description: Binary data
Re: Porting PostgresSQL libraries for QNX710
> On 22 Feb 2024, at 11:35, Rajith Rao .B(App Software) > wrote: > I have been using the Qt IDE with C++ for database connection and query > execution, and unfortunately, I cannot share the code with you. No worries, I have no intention to work on this. > You mentioned that PostgreSQL support for QNX was removed starting from > version 8.2. Are there any alternative methods to port or build PostgreSQL > libraries for QNX 7.1.0? There is no other way to build any software on a new architecture than rolling up the sleeves and getting started. I suggest looking at commits f55808828569, a1675649e402 and 6f84b2da75d3 in the postgres repo as a starting point for research. The QNX support which was removed in 8.2 was targeting QNX4, so it may or may not be helpful. -- Daniel Gustafsson
Re: When extended query protocol ends?
Hi, On 2/22/24 14:09, Jelte Fennema-Nio wrote: Apparently, sending an extra message would increase the overhead of the protocol, thus reducing the efficiency of the application. What is the benefit of sending extra Sync? https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-MESSAGE-CONCEPTS suggests that is is fine to mix both simple and extended messages depending on the needs of the application. Yes, it's fine to mix and match extended and simple protocol. But the protocol docs quite clearly state that a sync is required before going back to the Simple protocol: "At completion of each series of extended-query messages, the frontend should issue a Sync message." https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY I would like to say this document states that "at completion... frontend should issue a Sync message... causes the backend to close the current transaction" It looks like the sense of wording is "to complete transaction" at the eventual end of traffic, but not "to switch to single protocol". Otherwise, we can't use both protocols under same transaction that looks too strict limitation. Terminating a sequence of extended messages with a Query message instead of a Sync message is definitely undefined behaviour. -- Michael Zhilin Postgres Professional +7(925)3366270 https://www.postgrespro.ru
Re: Synchronizing slots from primary to standby
On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot wrote: > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > > wrote: > > > > > > Hi, > > > > > > Thanks! > > > > > > Some random comments about v92_001 (Sorry if it has already been discussed > > > up-thread): > > > > Thanks for the feedback. The patch is pushed 15 minutes back. > > Yeah, saw that after I send the comments ;-) > There is a BF failure. See https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion=2024-02-22%2010%3A13%3A03. The initial analysis suggests that for some reason, the primary went down after the slot sync worker was invoked the first time. See the below in the primary's LOG: 2024-02-22 10:59:56.896 UTC [2721639:29] standby1_slotsync worker LOG: 0: statement: SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, failover, database, conflict_reason FROM pg_catalog.pg_replication_slots WHERE failover and NOT temporary 2024-02-22 10:59:56.896 UTC [2721639:30] standby1_slotsync worker LOCATION: exec_simple_query, postgres.c:1070 2024-02-22 11:00:26.967 UTC [2721639:31] standby1_slotsync worker LOG: 0: statement: SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, failover, database, conflict_reason FROM pg_catalog.pg_replication_slots WHERE failover and NOT temporary 2024-02-22 11:00:26.967 UTC [2721639:32] standby1_slotsync worker LOCATION: exec_simple_query, postgres.c:1070 2024-02-22 11:00:35.908 UTC [2721435:309] LOG: 0: received immediate shutdown request 2024-02-22 11:00:35.908 UTC [2721435:310] LOCATION: process_pm_shutdown_request, postmaster.c:2859 2024-02-22 11:00:35.911 UTC [2721435:311] LOG: 0: database system is shut down 2024-02-22 11:00:35.911 UTC [2721435:312] LOCATION: UnlinkLockFiles, miscinit.c:1138 -- With Regards, Amit Kapila.
Re: Avoid stack frame setup in performance critical routines using tail calls
I've rebased the 0001 patch and gone over it again and made a few additional changes besides what I mentioned in my review. On Wed, 9 Aug 2023 at 20:44, David Rowley wrote: > Here's a review of v2-0001: > 2. Why do you need to add the NULL check here? > > #ifdef USE_VALGRIND > - if (method != MCTX_ALIGNED_REDIRECT_ID) > + if (ret != NULL && method != MCTX_ALIGNED_REDIRECT_ID) > VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size); > #endif I removed this NULL check as we're calling the realloc function with no flags, so it shouldn't return NULL as it'll error out from any OOM errors. > 3. > > /* > * XXX: Probably no need to check for huge allocations, we only support > * one size? Which could theoretically be huge, but that'd not make > * sense... > */ > > They can't be huge per Assert(fullChunkSize <= MEMORYCHUNK_MAX_VALUE) > in SlabContextCreate(). I removed this comment and adjusted the comment just below that which checks the 'size' matches the expected slab chunk size. i.e. /* * Make sure we only allow correct request size. This doubles as the * MemoryContextCheckSize check. */ if (unlikely(size != slab->chunkSize)) > 4. It would be good to see some API documentation in the > MemoryContextMethods struct. This adds a lot of responsibility onto > the context implementation without any extra documentation to explain > what, for example, palloc is responsible for and what the alloc > function needs to do itself. I've done that too. I also added header comments for MemoryContextAllocationFailure and MemoryContextSizeFailure and added some comments to explain in places like palloc() to warn people not to add checks after the 'alloc' call. The rebased patch is 0001 and all of my changes are in 0002. I will rebase your original 0002 patch later. I think 0001 is much more important, as evident by the reported benchmarks on this thread. In absence of anyone else looking at this, I think it's ready to go. If anyone is following along and wants to review or test it, please do so soon. David From 854751aeb1d0b917bf6e96c30e80f8480f96e883 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 18 Jul 2023 18:55:58 -0700 Subject: [PATCH v3 1/2] Optimize palloc() etc to allow sibling calls The reason palloc() etc previously couldn't use sibling calls is that they did check the returned value (to e.g. raise an error when the allocation fails). Push the error handling down into the memory context implementation - they can avoid performing the check at all in the hot code paths. --- src/backend/utils/mmgr/alignedalloc.c | 11 +- src/backend/utils/mmgr/aset.c | 23 ++-- src/backend/utils/mmgr/generation.c | 14 ++- src/backend/utils/mmgr/mcxt.c | 172 ++ src/backend/utils/mmgr/slab.c | 12 +- src/include/nodes/memnodes.h | 4 +- src/include/utils/memutils_internal.h | 29 +++-- 7 files changed, 101 insertions(+), 164 deletions(-) diff --git a/src/backend/utils/mmgr/alignedalloc.c b/src/backend/utils/mmgr/alignedalloc.c index 7204fe64ae..c266fb3dbb 100644 --- a/src/backend/utils/mmgr/alignedalloc.c +++ b/src/backend/utils/mmgr/alignedalloc.c @@ -57,7 +57,7 @@ AlignedAllocFree(void *pointer) * memory will be uninitialized. */ void * -AlignedAllocRealloc(void *pointer, Size size) +AlignedAllocRealloc(void *pointer, Size size, int flags) { MemoryChunk *redirchunk = PointerGetMemoryChunk(pointer); Sizealignto; @@ -97,14 +97,17 @@ AlignedAllocRealloc(void *pointer, Size size) #endif ctx = GetMemoryChunkContext(unaligned); - newptr = MemoryContextAllocAligned(ctx, size, alignto, 0); + newptr = MemoryContextAllocAligned(ctx, size, alignto, flags); /* * We may memcpy beyond the end of the original allocation request size, * so we must mark the entire allocation as defined. */ - VALGRIND_MAKE_MEM_DEFINED(pointer, old_size); - memcpy(newptr, pointer, Min(size, old_size)); + if (likely(newptr != NULL)) + { + VALGRIND_MAKE_MEM_DEFINED(pointer, old_size); + memcpy(newptr, pointer, Min(size, old_size)); + } pfree(unaligned); return newptr; diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 2f99fa9a2f..81c3120c2b 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -700,7 +700,7 @@ AllocSetDelete(MemoryContext context) * return space that is marked NOACCESS - AllocSetRealloc has to beware! */ void * -AllocSetAlloc(MemoryContext context, Size size) +AllocSetAlloc(MemoryContext context, Size size, int flags) { AllocSetset = (AllocSet) context; AllocBlock block; @@ -717,6 +717,9 @@ AllocSetAlloc(MemoryContext context, Size size) */ if (size > set->allocChunkLimit) { + /* check size, only allocation path where the limits could be hit */ +
Re: When extended query protocol ends?
On Thu, 22 Feb 2024 at 10:28, Vladimir Sitnikov wrote: > > >When splitting a multi insert statement you're going to duplicate some work > > I do not know how this could be made more efficient as I execute parse only > once, and then I send bind+exec+bind+exec > without intermediate sync messages, so the data should flow nicely in TCP > packets. I agree you cannot change that flow to be more efficient, but I meant that your comparison was not fair: 1. Multi-insert vs multiple single inserts is actually executing different queries 2. Going from Query -> Parse+Bind+Exec for the same query, only changes protocol related things > Here are some measurements regarding savepoints for simple vs extended > Sure they are not very scientific, however, they show improvement for simple > protocol Alright, those improvements are not huge, but I agree it's clear that the extended protocol has some overhead. So probably you'd want to keep using the simple protocol to send the SAVEPOINT query. > Apparently, sending an extra message would increase the overhead of the > protocol, thus reducing the efficiency of the application. > What is the benefit of sending extra Sync? > > https://www.postgresql.org/docs/current/protocol-overview.html#PROTOCOL-MESSAGE-CONCEPTS > suggests that is is fine to mix both simple and extended messages > depending on the needs of the application. Yes, it's fine to mix and match extended and simple protocol. But the protocol docs quite clearly state that a sync is required before going back to the Simple protocol: "At completion of each series of extended-query messages, the frontend should issue a Sync message." https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY Terminating a sequence of extended messages with a Query message instead of a Sync message is definitely undefined behaviour.
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > wrote: > > > > Hi, > > > > Thanks! > > > > Some random comments about v92_001 (Sorry if it has already been discussed > > up-thread): > > Thanks for the feedback. The patch is pushed 15 minutes back. Yeah, saw that after I send the comments ;-) > I will > prepare a top-up patch for your comments. Thanks! > > 4 === > > > > + if (wal_level < WAL_LEVEL_LOGICAL) > > + { > > + ereport(ERROR, > > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("slot synchronization requires > > wal_level >= \"logical\"")); > > + return false; > > + } > > > > I think the return is not needed here as it won't be reached due to the > > "ERROR". > > Or should we use "elevel" instead of "ERROR"? > > It was suggested to raise ERROR for wal_level validation, please see > [1]. But yes, I will remove the return value. Yeah, thanks, ERROR makes sense here. > > 5 === > > > > +* operate as a superuser. This is safe because the slot sync > > worker does > > +* not interact with user tables, eliminating the risk of executing > > +* arbitrary code within triggers. > > > > Right. I did not check but if we are using operators in our remote SPI calls > > then it would be worth to ensure they are coming from the pg_catalog schema? > > Using something like "OPERATOR(pg_catalog.=)" using "=" as an example. > > Can you please elaborate this one, I am not sure if I understood it. Suppose that in synchronize_slots() the query would be: const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," " restart_lsn, catalog_xmin, two_phase, failover," " database, conflict_reason" " FROM pg_catalog.pg_replication_slots" " WHERE failover and NOT temporary and 1 = 1"; Then my comment is to rewrite it to: const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," " restart_lsn, catalog_xmin, two_phase, failover," " database, conflict_reason" " FROM pg_catalog.pg_replication_slots" " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; to ensure the operator "=" is coming from the pg_catalog schema. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
On Thu, 22 Feb 2024 at 03:46, zwj wrote: > > If I want to get the same results as Oracle, do I need to adjust the lock > behavior of the update and merge statements? > If I want to achieve the same results as Oracle, can I achieve exclusive > locking by adjusting update and merge? Do you have any suggestions? > I think that trying to get the same results in Oracle and Postgres may not always be possible. Each has their own (probably quite different) implementation of these features, that simply may not be compatible. In Postgres, MERGE aims to make UPDATE and DELETE actions behave in the same way as standalone UPDATE and DELETE commands under concurrent modifications. However, it does not attempt to prevent INSERT actions from inserting duplicates. In that context, the UNION ALL issue is a clear bug, and I'll aim to get that patch committed and back-patched sometime in the next few days, if there are no objections from other hackers. However, the issue with INSERT actions inserting duplicates is a design choice, rather than something that we regard as a bug. It's possible that a future version of Postgres might improve MERGE, providing some way round that issue, but there's no guarantee of that ever happening. Similarly, it sounds like Oracle also sometimes allows duplicates, as well as having other "bugs" like the one discussed in [1], that may be difficult for them to fix within their implementation. In Postgres, if the target table is subject to concurrent inserts (or primary key updates), it might be better to use INSERT ... ON CONFLICT DO UPDATE [2] instead of MERGE. That would avoid inserting duplicates (though I can't say how compatible that is with anything in Oracle). Regards, Dean [1] https://www.postgresql.org/message-id/CAEZATCV_6t5E57q7HsWQBX6a5YOjN5o7K-HicZ8a73EPzfwo=a...@mail.gmail.com [2] https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT
Re: About a recently-added message
On Thu, Feb 22, 2024 at 11:10 AM Kyotaro Horiguchi wrote: > > At Thu, 22 Feb 2024 10:51:07 +0530, Amit Kapila > wrote in > > > Do you think some additional tests for the rest of the messages are > > > worth the trouble? > > > > > > > We have discussed this during development and didn't find it worth > > adding tests for all misconfigured parameters. However, in the next > > patch where we are planning to add a slot sync worker that will > > automatically sync slots, we are adding a test for one more parameter. > > I am not against adding tests for all the parameters but it didn't > > appeal to add more test cycles for this. > > Thanks for the explanation. I'm fine with that. > Pushed. -- With Regards, Amit Kapila.
Re: speed up a logical replica setup
On Mon, 19 Feb 2024 at 11:15, Hayato Kuroda (Fujitsu) wrote: > > Dear hackers, > > > Since it may be useful, I will post top-up patch on Monday, if there are no > > updating. > > And here are top-up patches. Feel free to check and include. > > v22-0001: Same as v21-0001. > === rebased patches === > v22-0002: Update docs per recent changes. Same as v20-0002. > v22-0003: Add check versions of the target. Extracted from v20-0003. > v22-0004: Remove -S option. Mostly same as v20-0009, but commit massage was > slightly changed. > === Newbie === > V22-0005: Addressed my comments which seems to be trivial[1]. > Comments #1, 3, 4, 8, 10, 14, 17 were addressed here. > v22-0006: Consider the scenario when commands are failed after the recovery. > drop_subscription() is removed and some messages are added per [2]. > V22-0007: Revise server_is_in_recovery() per [1]. Comments #5, 6, 7, were > addressed here. > V22-0008: Fix a strange report when physical_primary_slot is null. Per > comment #9 [1]. > V22-0009: Prohibit reuse publications when it has already existed. Per > comments #11 and 12 [1]. > V22-0010: Avoid to call PQclear()/PQfinish()/pg_free() if the process exits > soon. Per comment #15 [1]. > V22-0011: Update testcode. Per comments #17- [1]. Few comments on the tests: 1) If the dry run was successful because of some issue then the server will be stopped so we can check for "pg_ctl status" if the server is running otherwise the connection will fail in this case. Another way would be to check if it does not have "postmaster was stopped" messages in the stdout. + +# Check if node S is still a standby +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'), + 't', 'standby is in recovery'); 2) Can we add verification of "postmaster was stopped" messages in the stdout for dry run without --databases testcase +# pg_createsubscriber can run without --databases option +command_ok( + [ + 'pg_createsubscriber', '--verbose', + '--dry-run', '--pgdata', + $node_s->data_dir, '--publisher-server', + $node_p->connstr('pg1'), '--subscriber-server', + $node_s->connstr('pg1') + ], + 'run pg_createsubscriber without --databases'); + 3) This message "target server must be running" seems to be wrong, should it be cannot specify cascading replicating standby as standby node(this is for v22-0011 patch : + 'pg_createsubscriber', '--verbose', '--pgdata', $node_c->data_dir, + '--publisher-server', $node_s->connstr('postgres'), + '--port', $node_c->port, '--socketdir', $node_c->host, + '--database', 'postgres' ], - 'primary server is in recovery'); + 1, + [qr//], + [qr/primary server cannot be in recovery/], + 'target server must be running'); 4) Should this be "Wait for subscriber to catch up" +# Wait subscriber to catch up +$node_s->wait_for_subscription_sync($node_p, $subnames[0]); +$node_s->wait_for_subscription_sync($node_p, $subnames[1]); 5) Should this be 'Subscriptions has been created on all the specified databases' +); +is($result, qq(2), + 'Subscriptions has been created to all the specified databases' +); 6) Add test to verify current_user is not a member of ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no permissions to execution replication origin advance functions 7) Add tests to verify insufficient max_logical_replication_workers, max_replication_slots and max_worker_processes for the subscription node 8) Add tests to verify invalid configuration of wal_level, max_replication_slots and max_wal_senders for the publisher node 9) We can use the same node name in comment and for the variable +# Set up node P as primary +$node_p = PostgreSQL::Test::Cluster->new('node_p'); +$node_p->init(allows_streaming => 'logical'); +$node_p->start; 10) Similarly we can use node_f instead of F in the comments. +# Set up node F as about-to-fail node +# Force it to initialize a new cluster instead of copying a +# previously initdb'd cluster. +{ + local $ENV{'INITDB_TEMPLATE'} = undef; + + $node_f = PostgreSQL::Test::Cluster->new('node_f'); + $node_f->init(allows_streaming => 'logical'); + $node_f->start; Regards, Vignesh
Re: Synchronizing slots from primary to standby
On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot wrote: > > Hi, > > Thanks! > > Some random comments about v92_001 (Sorry if it has already been discussed > up-thread): Thanks for the feedback. The patch is pushed 15 minutes back. I will prepare a top-up patch for your comments. > 1 === > > +* We do not update the 'synced' column from true to false here > > Worth to mention from which system view the 'synced' column belongs to? Sure, I will change it. > 2 === (Nit) > > +#define MIN_WORKER_NAPTIME_MS 200 > +#define MAX_WORKER_NAPTIME_MS 3 /* 30s */ > > [MIN|MAX]_SLOTSYNC_WORKER_NAPTIME_MS instead? It is used only in slotsync.c so > more a Nit. Okay, will change it, > 3 === > > res = walrcv_exec(wrconn, query, SLOTSYNC_COLUMN_COUNT, slotRow); > - > if (res->status != WALRCV_OK_TUPLES) > > Line removal intended? I feel the current style is better, where we do not have space between the function call and return value checking. > 4 === > > + if (wal_level < WAL_LEVEL_LOGICAL) > + { > + ereport(ERROR, > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("slot synchronization requires > wal_level >= \"logical\"")); > + return false; > + } > > I think the return is not needed here as it won't be reached due to the > "ERROR". > Or should we use "elevel" instead of "ERROR"? It was suggested to raise ERROR for wal_level validation, please see [1]. But yes, I will remove the return value. Thanks for catching this. > 5 === > > +* operate as a superuser. This is safe because the slot sync worker > does > +* not interact with user tables, eliminating the risk of executing > +* arbitrary code within triggers. > > Right. I did not check but if we are using operators in our remote SPI calls > then it would be worth to ensure they are coming from the pg_catalog schema? > Using something like "OPERATOR(pg_catalog.=)" using "=" as an example. Can you please elaborate this one, I am not sure if I understood it. [1]: https://www.postgresql.org/message-id/CAD21AoB2ipSzQb5-o5pEYKie4oTPJTsYR1ip9_wRVrF6HbBWDQ%40mail.gmail.com thanks Shveta
Re: Test to dump and restore objects left behind by regression
On 22.02.24 11:00, Daniel Gustafsson wrote: On 22 Feb 2024, at 10:55, Ashutosh Bapat wrote: On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson wrote: Somebody looking for dump/restore tests wouldn't search src/bin/pg_upgrade, I think. Quite possibly not, but pg_upgrade is already today an important testsuite for testing pg_dump in binary-upgrade mode so maybe more developers touching pg_dump should? Yeah, I think attaching this to the existing pg_upgrade test would be a good idea. Not only would it save test run time, it would probably also reduce code duplication.
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 22, 2024 at 12:16:34PM +0530, shveta malik wrote: > On Thu, Feb 22, 2024 at 10:31 AM shveta malik wrote: > There was a recent commit 801792e to improve error messaging in > slotsync.c which resulted in conflict. Thus rebased the patch. There > is no new change in the patch attached Thanks! Some random comments about v92_001 (Sorry if it has already been discussed up-thread): 1 === +* We do not update the 'synced' column from true to false here Worth to mention from which system view the 'synced' column belongs to? 2 === (Nit) +#define MIN_WORKER_NAPTIME_MS 200 +#define MAX_WORKER_NAPTIME_MS 3 /* 30s */ [MIN|MAX]_SLOTSYNC_WORKER_NAPTIME_MS instead? It is used only in slotsync.c so more a Nit. 3 === res = walrcv_exec(wrconn, query, SLOTSYNC_COLUMN_COUNT, slotRow); - if (res->status != WALRCV_OK_TUPLES) Line removal intended? 4 === + if (wal_level < WAL_LEVEL_LOGICAL) + { + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("slot synchronization requires wal_level >= \"logical\"")); + return false; + } I think the return is not needed here as it won't be reached due to the "ERROR". Or should we use "elevel" instead of "ERROR"? 5 === +* operate as a superuser. This is safe because the slot sync worker does +* not interact with user tables, eliminating the risk of executing +* arbitrary code within triggers. Right. I did not check but if we are using operators in our remote SPI calls then it would be worth to ensure they are coming from the pg_catalog schema? Using something like "OPERATOR(pg_catalog.=)" using "=" as an example. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Porting PostgresSQL libraries for QNX710
> On 22 Feb 2024, at 06:12, Rajith Rao .B(App Software) > wrote: > Hope my query is clear for you and expecting a resolution for this. There is no official port of libpq to QNX, so the short answer is that you're on your own. QNX support was removed in 8.2, so maybe looking at the code before that happened might give some insights on how to get started? -- Daniel Gustafsson
Porting PostgresSQL libraries for QNX710
Hello, I have been working on ubuntu 22.04 LTS with postgres in my applications and need to deploy that application on QNX710. I have a requirement to port postgresSQL 12.18 to QNX 7.1 ,is it possible to build/port postgreSQL libraries for QNX7.1 Intel and Aarch64 architectures. Hope my query is clear for you and expecting a resolution for this. Thanks & Regards, Ranjith Rao.B *** Disclaimer:The information contained in this e-mail and/or attachments to it may contain confidential data (or) privileged information of Medha. If you are not the intended recipient, any dissemination, use in any manner, review, distribution, printing, copying of the information contained in this e-mail and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify the sender and immediately delete the message and attachments (if any) permanently. "Please consider the environment before printing this message." ***
?????? bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)
Thanks your reply. I understand what you mean and have tried to correct this patch. According to the previous use case, the result obtained is as follows: id | name | year | xmax | xmin | ctid +--+--+--+--+--- 1 | liuwei | 20 | 0 | 859 | (0,1) 2 | zhangbin | 30 | 866 | 866 | (0,7) 3 | fuguo | 44 | 866 | 866 | (0,8) 4 | yihe | 33 | 0 | 865 | (0,6) 4 | yihe | 33 | 0 | 866 | (0,9) (5 rows) At present, the behavior of the number of rows for ??id?? 2 and 3 appears to be normal, but there is duplicate data in the data for ??id?? 4. According to what you said, this is a normal manifestation of transaction isolation level. But there are still differences between the results and those of Oracle(no duplicate data 'id' 4). After that I have tried several scenarios in Oracle and PG: 1??session1?? insert?? session2??merge into?? duplicate data may also occur ??pg and oracle consistent??. 2??session1: update + insert ,session2?? merge into?? there will be no duplicate data in oracle ??pg has duplicate data. It looks like there is an exclusive lock between the update statement and merge statement in oracle. After submitting both update and insert, merge will proceed with locking and execution. (Of course, this is just my guess.) However, it seems that both PG and Oracle have no obvious issues, and their respective designs are reasonable. If I want to get the same results as Oracle, do I need to adjust the lock behavior of the update and merge statements? If I want to achieve the same results as Oracle, can I achieve exclusive locking by adjusting update and merge? Do you have any suggestions? Regards, wenjiang zhang ---- ??: "Dean Rasheed" https://www.postgresql.org/docs/current/transaction-iso.html Regards, Dean
Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
Thanks for the very thorough comments! I've attached a new version of the patch. On Thu, Feb 15, 2024 at 4:17 PM Andres Freund wrote: > Hi, > > On 2024-02-11 13:19:00 -0500, David Benjamin wrote: > > I've attached a patch for the master branch to fix up the custom BIOs > used > > by PostgreSQL, in light of the issues with the OpenSSL update recently. > > While c82207a548db47623a2bfa2447babdaa630302b9 (switching from > BIO_get_data > > to BIO_get_app_data) resolved the immediate conflict, I don't think it > > addressed the root cause, which is that PostgreSQL was mixing up two BIO > > implementations, each of which expected to be driving the BIO. > > Yea, that's certainly not nice - and I think we've been somewhat lucky it > hasn't caused more issues. There's some nasty possibilities, e.g. > sock_ctrl() > partially enabling ktls without our initialization having called > ktls_enable(). Right now that just means ktls is unusable, but it's not > hard > to imagine accidentally ending up sending unencrypted data. > Yeah. Even if, say, the ktls bits work, given you all care enough about I/O to have wanted to wrap the BIO, I assume you'd want to pick up those features on your own terms, e.g. by implementing the BIO_CTRLs yourself. > I've in the past looked into not using a custom BIO [1], but I have my > doubts > about that being a good idea. I think medium term we want to be able to do > network IO asynchronously, which seems quite hard to do when using > openssl's > socket BIO. > Once we've done that, we're free to use BIO_set_data. While > BIO_set_app_data > > works fine, I've reverted back to BIO_set_data because it's more > commonly used. > > app_data depends on OpenSSL's "ex_data" mechanism, which is a tad > heavier under > > the hood. > > At first I was a bit wary of that, because it requires us to bring back the > fallback implementation. But you're right, it's noticeably heavier than > BIO_get_data(), and we do call BIO_get_app_data() fairly frequently. > TBH, I doubt it makes any real difference perf-wise. But I think BIO_get_data is a bit more common in this context. > > That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only > > operation that matters is BIO_CTRL_FLUSH, which is implemented as a > no-op. All > > other operations are unused. It's once again good that they're unused > because > > otherwise OpenSSL might mess with postgres's socket, break the > assumptions > > around interrupt handling, etc. > > How did you determine that only FLUSH is required? I didn't even really > find > documentation about what the intended semantics actually are. > The unhelpful answer is that my day job is working on BoringSSL, so I've spent a lot of time with this mess. :-) But, yeah, it's not well-documented at all. OpenSSL ends up calling BIO_flush at the end of each batch of writes in libssl. TBH, I suspect that was less intentional and more an emergent property of them internally layering a buffer BIO at one point in the process, but it's long been part of the (undocumented) API contract. Conversely, I don't think OpenSSL can possibly make libssl *require* a new BIO_CTRL because they'd break every custom BIO anyone has ever used with the library. > E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so > far, because we never set it, but is that right? Ah hmm, BIO_CTRL_EOF is... a bit of a mess. OpenSSL kind of messed things up. So, up until recently, I would have said that BIO_CTRL_EOF was not part of the interface here. OpenSSL 1.0.x did not implement it for sockets, and the BIO_read API *already* had a way to signal EOF: did you return zero or -1? Then OpenSSL 1.1.x introduced size_t-based BIO_read_ex APIs. However, in the process, they *forgot that EOF and error are different things* and made it impossible to detect EOFs if you use BIO_read_ex! They never noticed this, because they didn't actually convert their own code to their new API. See this discussion, which alas ended with OpenSSL deciding to ignore the problem and not even document their current interface. https://github.com/openssl/openssl/issues/8208 Though they never responded, they seem to have tacitly settled using the out-of-band BIO_eof function (which is BIO_CTRL_EOF) as the way to signal EOF for BIO_read_ex. This is kind of fiddly, but is at least a well-defined option. But the problem is no one's BIO_METHODs, including their own, are read_ex-based. They all implement the old read callback. But someone might call BIO_read_ex on a read-based BIO_METHOD. IMO, BIO_read_ex should be responsible for translating between the two EOF conventions. For example, it could automatically set a flag when the read callback returns 0 and then make BIO_ctrl check the flag and automatically implement BIO_CTRL_EOF for BIOs that don't do it themselves. Alas, OpenSSL did not do this and instead they just made their built-in BIOs implement BIO_CTRL_EOF, even though this new expectation is a compatibility
Re: Test to dump and restore objects left behind by regression
> On 22 Feb 2024, at 10:55, Ashutosh Bapat wrote: > On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson wrote: > Somebody looking for dump/restore tests wouldn't search > src/bin/pg_upgrade, I think. Quite possibly not, but pg_upgrade is already today an important testsuite for testing pg_dump in binary-upgrade mode so maybe more developers touching pg_dump should? >> When upgrading to the same version, we could perhaps also use this to test a >> scenario like: Dump A, restore into B, upgrade B into C, dump C and compare C >> to A. > > If comparison of C to A fails, we wouldn't know which step fails. I > would rather compare outputs of each step separately. To be clear, this wasn't intended to replace what you are proposing, but an idea for using it to test *more* scenarios. -- Daniel Gustafsson
Re: Test to dump and restore objects left behind by regression
On Thu, Feb 22, 2024 at 3:03 PM Daniel Gustafsson wrote: > > > On 22 Feb 2024, at 10:16, Peter Eisentraut wrote: > > > We have somewhat relied on the pg_upgrade test to provide this testing, but > > we have recently discovered that the dumps in binary-upgrade mode are > > different enough to not test the normal dumps well. > > > > Yes, this test is a bit expensive. We could save some time by doing the > > first dump at the end of the normal regress test and have the pg_dump test > > reuse that, but then that would make the regress test run a bit longer. Is > > that a better tradeoff? > > Something this expensive seems like what PG_TEST_EXTRA is intended for, we > already have important test suites there. That's ok with me. > > But. We know that the cluster has an interesting state when the pg_upgrade > test starts, could we use that to make a dump/restore test before continuing > with testing pg_upgrade? It can be argued that pg_upgrade shouldn't be > responsible for testing pg_dump, but it's already now a pretty important > testcase for pg_dump in binary upgrade mode so it's that far off. If pg_dump > has bugs then pg_upgrade risks subtly breaking. Somebody looking for dump/restore tests wouldn't search src/bin/pg_upgrade, I think. However if more people think we should just add this test 002_pg_upgrade.pl, I am fine with it. > > When upgrading to the same version, we could perhaps also use this to test a > scenario like: Dump A, restore into B, upgrade B into C, dump C and compare C > to A. If comparison of C to A fails, we wouldn't know which step fails. I would rather compare outputs of each step separately. -- Best Wishes, Ashutosh Bapat
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 22 Feb 2024 15:44:16 +0900, Michael Paquier wrote: > I was comparing what you have here, and what's been attached by Andres > at [1] and the top of the changes on my development branch at [2] > (v3-0008, mostly). And, it strikes me that there is no need to do any > major changes in any of the callbacks proposed up to v13 and v14 in > this thread, as all the changes proposed want to plug in more data > into each StateData for COPY FROM and COPY TO, the best part being > that v3-0008 can just reuse the proposed callbacks as-is. v1-0001 > from Sutou-san would need one slight tweak in the per-row callback, > still that's minor. I think so too. But I thought that some minor conflicts will be happen with this and the v15. So I worked on this before the v15. We agreed that this optimization doesn't block v15: [1] So we can work on the v15 without this optimization for now. [1] https://www.postgresql.org/message-id/flat/20240219195351.5vy7cdl3wxia66kg%40awork3.anarazel.de#20f9677e074fb0f8c5bb3994ef059a15 > I have been spending more time on the patch to introduce the COPY > APIs, leading me to the v15 attached, where I have replaced the > previous attribute callbacks for the output representation and the > reads with hardcoded routines that should be optimized by compilers, > and I have done more profiling with -O2. Thanks! I wanted to work on it but I didn't have enough time for it in a few days... I've reviewed the v15. > @@ -751,8 +751,9 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int > nbytes) > * > * NOTE: force_not_null option are not applied to the returned fields. > */ > -bool > -NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) > +static bool "inline" is missing here. > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, > + bool is_csv) > { > int fldct; How about adding "is_csv" to CopyReadline() and CopyReadLineText() too? diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 25b8d4bc52..79fabecc69 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -150,8 +150,8 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0"; /* non-export function prototypes */ -static bool CopyReadLine(CopyFromState cstate); -static bool CopyReadLineText(CopyFromState cstate); +static inline bool CopyReadLine(CopyFromState cstate, bool is_csv); +static inline bool CopyReadLineText(CopyFromState cstate, bool is_csv); static inline int CopyReadAttributesText(CopyFromState cstate); static inline int CopyReadAttributesCSV(CopyFromState cstate); static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo, @@ -770,7 +770,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, tupDesc = RelationGetDescr(cstate->rel); cstate->cur_lineno++; - done = CopyReadLine(cstate); + done = CopyReadLine(cstate, is_csv); if (cstate->opts.header_line == COPY_HEADER_MATCH) { @@ -823,7 +823,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, cstate->cur_lineno++; /* Actually read the line into memory here */ - done = CopyReadLine(cstate); + done = CopyReadLine(cstate, is_csv); /* * EOF at start of line means we're done. If we see EOF after some @@ -1133,8 +1133,8 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, * by newline. The terminating newline or EOF marker is not included * in the final value of line_buf. */ -static bool -CopyReadLine(CopyFromState cstate) +static inline bool +CopyReadLine(CopyFromState cstate, bool is_csv) { boolresult; @@ -1142,7 +1142,7 @@ CopyReadLine(CopyFromState cstate) cstate->line_buf_valid = false; /* Parse data and transfer into line_buf */ - result = CopyReadLineText(cstate); + result = CopyReadLineText(cstate, is_csv); if (result) { @@ -1209,8 +1209,8 @@ CopyReadLine(CopyFromState cstate) /* * CopyReadLineText - inner loop of CopyReadLine for text mode */ -static bool -CopyReadLineText(CopyFromState cstate) +static inline bool +CopyReadLineText(CopyFromState cstate, bool is_csv) { char *copy_input_buf; int input_buf_ptr; @@ -1226,7 +1226,7 @@ CopyReadLineText(CopyFromState cstate) charquotec = '\0'; charescapec = '\0'; - if (cstate->opts.csv_mode) + if (is_csv) { quotec = cstate->opts.quote[0]; escapec = cstate->opts.escape[0]; @@ -1306,7 +1306,7 @@ CopyReadLineText(CopyFromState cstate) prev_raw_ptr =
Re: Test to dump and restore objects left behind by regression
> On 22 Feb 2024, at 10:16, Peter Eisentraut wrote: > We have somewhat relied on the pg_upgrade test to provide this testing, but > we have recently discovered that the dumps in binary-upgrade mode are > different enough to not test the normal dumps well. > > Yes, this test is a bit expensive. We could save some time by doing the > first dump at the end of the normal regress test and have the pg_dump test > reuse that, but then that would make the regress test run a bit longer. Is > that a better tradeoff? Something this expensive seems like what PG_TEST_EXTRA is intended for, we already have important test suites there. But. We know that the cluster has an interesting state when the pg_upgrade test starts, could we use that to make a dump/restore test before continuing with testing pg_upgrade? It can be argued that pg_upgrade shouldn't be responsible for testing pg_dump, but it's already now a pretty important testcase for pg_dump in binary upgrade mode so it's that far off. If pg_dump has bugs then pg_upgrade risks subtly breaking. When upgrading to the same version, we could perhaps also use this to test a scenario like: Dump A, restore into B, upgrade B into C, dump C and compare C to A. -- Daniel Gustafsson
Re: When extended query protocol ends?
>When splitting a multi insert statement you're going to duplicate some work I do not understand why I am going to duplicate some work. I assume the database does its best to perform all the needed preparation when processing "parse" message, and it should perform only the minimum required work when processing bind+exec messages. Unfortunately, it is not completely the case, so using bind+exec+bind+exec is suboptimal even for trivial insert statements. Please, take into account the following sequence: One-time-only: parse S1 as insert into table(id, name) values(?,?) Some time later: bind S1 ... exec S1 bind S1 ... exec S1 bind S1 ... exec S1 bind S1 ... exec S1 sync I do not know how this could be made more efficient as I execute parse only once, and then I send bind+exec+bind+exec without intermediate sync messages, so the data should flow nicely in TCP packets. As I said above, the same flow for multi-value insert beats the bind+exec+bind+exec sequence at a cost of poor reporting. For instance, for multivalue insert we can't tell how many rows are generated for each statement. --- Here are some measurements regarding savepoints for simple vs extended Sure they are not very scientific, however, they show improvement for simple protocol $ cat svpnt BEGIN; SAVEPOINT PGJDBC_AUTOSAVE; RELEASE SAVEPOINT PGJDBC_AUTOSAVE; COMMIT; $ pgbench -f svpnt --protocol=extended --time=10 --progress=1 -r progress: 1.0 s, 4213.8 tps, lat 0.237 ms stddev 0.034, 0 failed progress: 2.0 s, 4367.9 tps, lat 0.229 ms stddev 0.024, 0 failed progress: 3.0 s, 4296.2 tps, lat 0.233 ms stddev 0.038, 0 failed progress: 4.0 s, 4382.0 tps, lat 0.228 ms stddev 0.026, 0 failed progress: 5.0 s, 4374.1 tps, lat 0.228 ms stddev 0.026, 0 failed progress: 6.0 s, 4305.7 tps, lat 0.232 ms stddev 0.035, 0 failed progress: 7.0 s, 4111.1 tps, lat 0.243 ms stddev 0.182, 0 failed progress: 8.0 s, 4245.0 tps, lat 0.235 ms stddev 0.042, 0 failed progress: 9.0 s, 4219.9 tps, lat 0.237 ms stddev 0.036, 0 failed progress: 10.0 s, 4231.1 tps, lat 0.236 ms stddev 0.031, 0 failed transaction type: svpnt scaling factor: 1 query mode: extended number of clients: 1 number of threads: 1 maximum number of tries: 1 duration: 10 s number of transactions actually processed: 42748 number of failed transactions: 0 (0.000%) latency average = 0.234 ms latency stddev = 0.065 ms initial connection time = 2.178 ms tps = 4275.562760 (without initial connection time) statement latencies in milliseconds and failures: 0.058 0 BEGIN; 0.058 0 SAVEPOINT PGJDBC_AUTOSAVE; 0.058 0 RELEASE SAVEPOINT PGJDBC_AUTOSAVE; 0.060 0 COMMIT; $ pgbench -f svpnt --protocol=simple --time=10 --progress=1 -r progress: 1.0 s, 4417.7 tps, lat 0.225 ms stddev 0.033, 0 failed progress: 2.0 s, 4446.0 tps, lat 0.225 ms stddev 0.079, 0 failed progress: 3.0 s, 4377.1 tps, lat 0.228 ms stddev 0.048, 0 failed progress: 4.0 s, 4485.0 tps, lat 0.223 ms stddev 0.024, 0 failed progress: 5.0 s, 4355.9 tps, lat 0.229 ms stddev 0.353, 0 failed progress: 6.0 s, .3 tps, lat 0.225 ms stddev 0.035, 0 failed progress: 7.0 s, 4530.7 tps, lat 0.220 ms stddev 0.020, 0 failed progress: 8.0 s, 4431.1 tps, lat 0.225 ms stddev 0.022, 0 failed progress: 9.0 s, 4497.1 tps, lat 0.222 ms stddev 0.027, 0 failed progress: 10.0 s, 4507.0 tps, lat 0.222 ms stddev 0.024, 0 failed transaction type: svpnt scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 maximum number of tries: 1 duration: 10 s number of transactions actually processed: 44493 number of failed transactions: 0 (0.000%) latency average = 0.224 ms latency stddev = 0.116 ms initial connection time = 2.690 ms tps = 4450.372095 (without initial connection time) statement latencies in milliseconds and failures: 0.056 0 BEGIN; 0.056 0 SAVEPOINT PGJDBC_AUTOSAVE; 0.056 0 RELEASE SAVEPOINT PGJDBC_AUTOSAVE; 0.057 0 COMMIT; Vladimir
Re: A problem about partitionwise join
On Wed, Feb 21, 2024 at 4:55 PM Richard Guo wrote: > > > On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion wrote: >> >> Once you think you've built up some community support and the patchset >> is ready for review, you (or any interested party) can resurrect the >> patch entry by visiting >> >> https://commitfest.postgresql.org/38/2266/ >> >> and changing the status to "Needs Review", and then changing the >> status again to "Move to next CF". (Don't forget the second step; >> hopefully we will have streamlined this in the near future!) > > > This patch was returned due to 'lack of interest'. However, upon > verification, it appears that the reported issue still exists, and the > proposed fix in the thread remains valid. Hence, resurrect this patch > after rebasing it on master. I've also written a detailed commit > message which hopefully can help people review the changes more > effectively. The concept looks useful. The SQL statement added in the test looks cooked though (it outputs data that has same value for two columns which is equal to primary key of other table - when would somebody do that?). Is there some real life example of this? The patch uses restrictclauses as well as EC's. Tom has proposed to make EC work with outer joins sensibly. Has that happened? Can this patch leverage it rather than having two loops? -- Best Wishes, Ashutosh Bapat
Re: Test to dump and restore objects left behind by regression
On 22.02.24 02:01, Michael Paquier wrote: On Wed, Feb 21, 2024 at 12:18:45PM +0530, Ashutosh Bapat wrote: Even with 1 and 2 the test is useful to detect dump/restore anomalies. I think we should improve 3, but I don't have a good and simpler solution. I didn't find any way to compare two given clusters in our TAP test framework. Building it will be a lot of work. Not sure if it's worth it. + my $rc = + system($ENV{PG_REGRESS} + . " $extra_opts " + . "--dlpath=\"$dlpath\" " + . "--bindir= " + . "--host=" + . $node->host . " " + . "--port=" + . $node->port . " " + . "--schedule=$srcdir/src/test/regress/parallel_schedule " + . "--max-concurrent-tests=20 " + . "--inputdir=\"$inputdir\" " + . "--outputdir=\"$outputdir\""); I am not sure that it is a good idea to add a full regression test cycle while we have already 027_stream_regress.pl that would be enough to test some dump scenarios. These are very expensive and easy to notice even with a high level of parallelization of the tests. The problem is, we don't really have any end-to-end coverage of dump restore dump again compare the two dumps with a database with lots of interesting objects in it. Note that each of these steps could fail. We have somewhat relied on the pg_upgrade test to provide this testing, but we have recently discovered that the dumps in binary-upgrade mode are different enough to not test the normal dumps well. Yes, this test is a bit expensive. We could save some time by doing the first dump at the end of the normal regress test and have the pg_dump test reuse that, but then that would make the regress test run a bit longer. Is that a better tradeoff? I have done some timing tests: master: pg_dump check: 22s pg_dump check -j8: 8s check-world -j8: 2min44s patched: pg_dump check: 34s pg_dump check -j8: 13s check-world -j8: 2min46s So overall it doesn't seem that bad.
Re: When extended query protocol ends?
On Wed, 21 Feb 2024 at 17:07, Vladimir Sitnikov wrote: > From many measurements we know that insert into table(id, name) > values(?,?),(?,?),(?,?) is much more efficient than > sending individual bind-exec-bind-exec-bind-exec-sync messages like "insert > into table(id, name) values(?,?)" > For instance, here are some measurements: > https://www.baeldung.com/spring-jdbc-batch-inserts#performance-comparisons > Based on that measurements I assume there's a non-trivial per-message > overhead. That's quite a different case. When splitting a multi insert statement you're going to duplicate some work, e.g. executor initialization and possibly even planning. But when replacing one Query packet with Parse-Bind-Exec-Sync, these 4 packets are not duplicating such expensive work. The only thing they should be doing extra is a bit of packet parsing, which is very cheap.
Re: Test to dump and restore objects left behind by regression
On Thu, Feb 22, 2024 at 6:32 AM Michael Paquier wrote: > > On Wed, Feb 21, 2024 at 12:18:45PM +0530, Ashutosh Bapat wrote: > > Even with 1 and 2 the test is useful to detect dump/restore anomalies. > > I think we should improve 3, but I don't have a good and simpler > > solution. I didn't find any way to compare two given clusters in our > > TAP test framework. Building it will be a lot of work. Not sure if > > it's worth it. > > + my $rc = > + system($ENV{PG_REGRESS} > + . " $extra_opts " > + . "--dlpath=\"$dlpath\" " > + . "--bindir= " > + . "--host=" > + . $node->host . " " > + . "--port=" > + . $node->port . " " > + . "--schedule=$srcdir/src/test/regress/parallel_schedule " > + . "--max-concurrent-tests=20 " > + . "--inputdir=\"$inputdir\" " > + . "--outputdir=\"$outputdir\""); > > I am not sure that it is a good idea to add a full regression test > cycle while we have already 027_stream_regress.pl that would be enough > to test some dump scenarios. That test *uses* pg_dump as a way to test whether the two clusters are in sync. The test might change in future to use some other method to make sure the two clusters are consistent. Adding the test here to that test will make that change much harder. It's not the dump, but restore, we are interested in here. No test that runs PG_REGRESS also runs pg_restore in non-binary mode. Also we need to keep this test near other pg_dump tests, not far from them. > These are very expensive and easy to > notice even with a high level of parallelization of the tests. I agree, but I didn't find a suitable test to ride on. -- Best Wishes, Ashutosh Bapat
Re: Removing unneeded self joins
On 21/2/2024 14:26, Richard Guo wrote: I think the right fix for these issues is to introduce a new element 'sublevels_up' in ReplaceVarnoContext, and enhance replace_varno_walker to 1) recurse into subselects with sublevels_up increased, and 2) perform the replacement only when varlevelsup is equal to sublevels_up. This code looks good. No idea how we have lost it before. While writing the fix, I noticed some outdated comments. Such as in remove_rel_from_query, the first for loop updates otherrel's attr_needed as well as lateral_vars, but the comment only mentions attr_needed. So this patch also fixes some outdated comments. Thanks, looks good. While writing the test cases, I found that the test cases for SJE are quite messy. Below are what I have noticed: * There are several test cases using catalog tables like pg_class, pg_stats, pg_index, etc. for testing join removal. I don't see a reason why we need to use catalog tables, and I think this just raises the risk of instability. I see only one unusual query with the pg_class involved. * In many test cases, a mix of uppercase and lowercase keywords is used in one query. I think it'd better to maintain consistency by using either all uppercase or all lowercase keywords in one query. I see uppercase -> lowercase change: select t1.*, t2.a as ax from sj t1 join sj t2 and lowercase -> uppercase in many other cases: explain (costs off) I guess it is a matter of taste, so give up for the committer decision. Technically, it's OK. * In most situations, we verify the plan and the output of a query like: explain (costs off) select ...; select ...; The two select queries are supposed to be the same. But in the SJE test cases, I have noticed instances where the two select queries differ from each other. This patch also includes some cosmetic tweaks for SJE test cases. It does not change the test cases using catalog tables though. I think that should be a seperate patch. I can't assess the necessity of changing these dozens of lines of code because I follow another commenting style, but technically, it's still OK. -- regards, Andrei Lepikhov Postgres Professional
Re: Reduce useless changes before reassembly during logical replication
Hi, > > This is all true but note that in successful cases (where the table is > published) all the work done by FilterByTable(accessing caches, > transaction-related stuff) can add noticeable overhead as anyway we do > that later in pgoutput_change(). I think I gave the same comment > earlier as well but didn't see any satisfactory answer or performance > data for successful cases to back this proposal. I did some benchmark yesterday at [1] and found it adds 20% cpu time. then come out a basic idea, I think it deserves a share. "transaction related stuff" comes from the syscache/systable access except the HistorySansphot. and the syscache is required in the following sistuations: 1. relfilenode (from wal) -> relid. 2. relid -> namespaceid (to check if the relid is a toast relation). 3. if toast, get its origianl relid. 4. access the data from pg_publication_tables. 5. see if the relid is a partition, if yes, we may get its root relation. Acutally we already has a RelationSyncCache for #4, and it *only* need to access syscache when replicate_valid is false, I think this case should be rare, but the caller doesn't know it, so the caller must prepare the transaction stuff in advance even in the most case they are not used. So I think we can get a optimization here. then the attached patch is made. Author: yizhi.fzh Date: Wed Feb 21 18:40:03 2024 +0800 Make get_rel_sync_entry less depending on transaction state. get_rel_sync_entry needs transaction only a replicate_valid = false entry is found, this should be some rare case. However the caller can't know if a entry is valid, so they have to prepare the transaction state before calling this function. Such preparation is expensive. This patch makes the get_rel_sync_entry can manage a transaction stage only if necessary. so the callers don't need to prepare it blindly. Then comes to #1, acutally we have RelfilenumberMapHash as a cache, when the cache is hit (suppose this is a usual case), no transaction stuff related. I have two ideas then: 1. Optimize the cache hit sistuation like what we just did for get_rel_sync_entry for the all the 5 kinds of data and only pay the effort for cache miss case. for the data for #2, #3, #5, all the keys are relid, so I think a same HTAB should be OK. 2. add the content for #1, #2, #3, #5 to wal when wal_level is set to logical. In either case, the changes for get_rel_sync_entry should be needed. > Note, users can > configure to stream_in_progress transactions in which case they > shouldn't see such a big problem. People would see the changes is spilled to disk, but the CPU cost for Reorder should be still paid I think. [1] https://www.postgresql.org/message-id/87o7cadqj3.fsf%40163.com -- Best Regards Andy Fan >From 90b8df330df049bd1d7e881dc6e9b108c17b0924 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Wed, 21 Feb 2024 18:40:03 +0800 Subject: [PATCH v1 1/1] Make get_rel_sync_entry less depending on transaction state. get_rel_sync_entry needs transaction only a replicate_valid = false entry is found, this should be some rare case. However the caller can't know if a entry is valid, so they have to prepare the transaction state before calling this function. Such preparation is expensive. This patch makes the get_rel_sync_entry can manage a transaction stage only if necessary. so the callers don't need to prepare it blindly. --- src/backend/replication/pgoutput/pgoutput.c | 60 - 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 998f92d671..25e55590a2 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -13,6 +13,7 @@ #include "postgres.h" #include "access/tupconvert.h" +#include "access/relation.h" #include "catalog/partition.h" #include "catalog/pg_publication.h" #include "catalog/pg_publication_rel.h" @@ -214,6 +215,11 @@ static void init_rel_sync_cache(MemoryContext cachectx); static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit); static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Relation relation); +static RelationSyncEntry *get_rel_sync_entry_by_relid(PGOutputData *data, + Oid relid); +static RelationSyncEntry *get_rel_sync_entry_internal(PGOutputData *data, + Relation relation, + Oid relid); static void rel_sync_cache_relation_cb(Datum arg, Oid relid); static void rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue); @@ -1962,11 +1968,29 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) */ static RelationSyncEntry * get_rel_sync_entry(PGOutputData *data, Relation relation) +{ + return get_rel_sync_entry_internal(data, relation, InvalidOid); +} + +static RelationSyncEntry * +__attribute__ ((unused))
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi, On Wed, Feb 21, 2024 at 08:10:00PM +0530, Bharath Rupireddy wrote: > On Wed, Feb 21, 2024 at 5:55 PM Bertrand Drouvot > wrote: > > My initial thought was to put "conflict" value in this new field in case of > > conflict (not to mention the conflict reason in it). With the current > > proposal > > invalidation_reason could report the same as conflict_reason, which sounds > > weird > > to me. > > > > Does that make sense to you to use "conflict" as value in > > "invalidation_reason" > > when the slot has "conflict_reason" not NULL? > > I'm thinking the other way around - how about we revert > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=007693f2a3ac2ac19affcb03ad43cdb36ccff5b5, > that is, put in place "conflict" as a boolean and introduce > invalidation_reason the text form. So, for logical slots, whenever the > "conflict" column is true, the reason is found in invaldiation_reason > column? How does it sound? Yeah, I think that looks fine too. We would need more change (like take care of ddd5f4f54a for example). CC'ing Amit, Hou-San and Shveta to get their point of view (as the ones behind 007693f2a3 and ddd5f4f54a). Regarding, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Injection points: some tools to wait and wake
Hi, On Thu, Feb 22, 2024 at 12:02:01PM +0900, Michael Paquier wrote: > On Wed, Feb 21, 2024 at 11:50:21AM +, Bertrand Drouvot wrote: > > A few comments: > > > > 1 === > > I think "up" is missing at several places in the patch where "wake" is used. > > I could be wrong as non native english speaker though. > > Patched up three places to be more consisten. Thanks! > > 5 === > > > > +PG_FUNCTION_INFO_V1(injection_points_wakeup); > > +Datum > > +injection_points_wakeup(PG_FUNCTION_ARGS) > > > > Empty line missing before "Datum"? > > I've used the same style as anywhere else. humm, looking at src/test/regress/regress.c for example, I can see an empty line between PG_FUNCTION_INFO_V1 and Datum for all the "PG_FUNCTION_INFO_V1" ones. > > While at it, should we add a second injection wait point in > > t/041_invalid_checkpoint_after_promote.pl to check that they are wake up > > individually? > > I'm not sure that's a good idea for the sake of this test, TBH, as > that would provide coverage outside the original scope of the > restartpoint/promote check. Yeah, that was my doubt too. > I have also looked at if it would be possible to get an isolation test > out of that, but the arbitrary wait does not make that possible with > the existing isolation APIs, see GetSafeSnapshotBlockingPids() used in > pg_isolation_test_session_is_blocked(). isolation/README seems to be > a bit off here, actually, mentioning pg_locks.. We could use some > tricks with transactions or locks internally, but I'm not really > tempted to make the wait callback more complicated for the sake of > more coverage as I'd rather keep it generic for anybody who wants to > control the order of events across processes. Makes sense to me, let's keep it as it is. > > Attaching a v3 for reference with everything that has been mentioned > up to now. Thanks! Sorry, I missed those ones previously: 1 === +/* Maximum number of wait usable in injection points at once */ s/Maximum number of wait/Maximum number of waits/ ? 2 === +# Check the logs that the restart point has started on standby. This is +# optional, but let's be sure. +my $log = slurp_file($node_standby->logfile, $logstart); +my $checkpoint_start = 0; +if ($log =~ m/restartpoint starting: immediate wait/) +{ + $checkpoint_start = 1; +} +is($checkpoint_start, 1, 'restartpoint has started'); what about? ok( $node_standby->log_contains( "restartpoint starting: immediate wait", $logstart), "restartpoint has started"); Except for the above, v3 looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com