Re: Synchronizing slots from primary to standby
On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila wrote: > > Thanks for noticing this. I have pushed all your debug patches. Let's > hope if there is a BF failure next time, we can gather enough > information to know the reason of the same. > There is a new BF failure [1] after adding these LOGs and I think I know what is going wrong. First, let's look at standby LOGs: 2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG: segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4 available 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG: xmin required by slots: data 0, catalog 741 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote could not sync slot information as reslot precedes local slot: remote slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN (0/4000168), catalog xmin (741) So, from the above LOG, it is clear that the remote slot's catalog xmin (739) precedes the local catalog xmin (741) which makes the sync on standby to not complete. Next, let's look at the LOG from the primary during the nearby time: 2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG: analyzing "pg_catalog.pg_depend" 2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG: "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0 dead rows; 1709 rows in sample, 1709 estimated total rows ... 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG: Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes, cost_limit=200, cost_delay=2 active=yes failsafe=no) 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG: analyzing "information_schema.sql_features" 2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG: "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0 dead rows; 756 rows in sample, 756 estimated total rows It shows us that autovacuum worker has analyzed catalog table and for updating its statistics in pg_statistic table, it would have acquired a new transaction id. Now, after the slot creation, a new transaction id that has updated the catalog is generated on primary and would have been replication to standby. Due to this catalog_xmin of primary's slot would precede standby's catalog_xmin and we see this failure. As per this theory, we should disable autovacuum on primary to avoid updates to catalog_xmin values. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-02-16%2006%3A12%3A59 -- With Regards, Amit Kapila.
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Wed, Feb 14, 2024 at 6:59 AM Jeff Davis wrote: > > Attached 2 patches. > > Per Andres's suggestion, 0001 adds an: > Assert(startptr + count <= LogwrtResult.Write) > > Though if we want to allow the caller (e.g. in an extension) to > determine the valid range, perhaps using WaitXLogInsertionsToFinish(), > then the check is wrong. Right. > Maybe we should just get rid of that code > entirely and trust the caller to request a reasonable range? I'd suggest we strike a balance here - error out in assert builds if startptr+count is past the current insert position and trust the callers for production builds. It has a couple of advantages over doing just Assert(startptr + count <= LogwrtResult.Write): 1) It allows the caller to read unflushed WAL directly from WAL buffers, see the attached 0005 for an example. 2) All the existing callers where WALReadFromBuffers() is thought to be used are ensuring WAL availability by reading upto the flush position so no problem with it. Also, a note before WALRead() stating the caller must request the WAL at least that's written out (upto LogwrtResult.Write). I'm not so sure about this, perhaps, we don't need this comment at all. Here, I'm with v23 patch set: 0001 - Adds assertion in WALReadFromBuffers() to ensure the requested WAL isn't beyond the current insert position. 0002 - Adds a new test module to demonstrate how one can use WALReadFromBuffers() ensuring WaitXLogInsertionsToFinish() if need be. 0003 - Uses WALReadFromBuffers in more places like logical walsenders and backends. 0004 - Removes zero-padding related stuff as discussed in https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=aeezpjmqhpeobg...@mail.gmail.com. This is needed in this patch set otherwise the assertion added in 0001 fails after 0003. 0005 - Adds a page_read callback for reading from WAL buffers in the new test module added in 0002. Also, adds tests. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 5c0f95acda494904d02593e6ba305717b61c44b5 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Fri, 16 Feb 2024 06:54:53 + Subject: [PATCH v23 2/5] Add test module for verifying read from WAL buffers --- src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + .../modules/read_wal_from_buffers/.gitignore | 4 ++ .../modules/read_wal_from_buffers/Makefile| 23 ++ .../modules/read_wal_from_buffers/meson.build | 33 + .../read_wal_from_buffers--1.0.sql| 14 .../read_wal_from_buffers.c | 54 ++ .../read_wal_from_buffers.control | 4 ++ .../read_wal_from_buffers/t/001_basic.pl | 72 +++ 9 files changed, 206 insertions(+) create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore create mode 100644 src/test/modules/read_wal_from_buffers/Makefile create mode 100644 src/test/modules/read_wal_from_buffers/meson.build create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index 89aa41b5e3..864a3dd72b 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -12,6 +12,7 @@ SUBDIRS = \ dummy_seclabel \ libpq_pipeline \ plsample \ + read_wal_from_buffers \ spgist_name_ops \ test_bloomfilter \ test_copy_callbacks \ diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index 8fbe742d38..4f3dd69e58 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -33,6 +33,7 @@ subdir('test_resowner') subdir('test_rls_hooks') subdir('test_shm_mq') subdir('test_slru') +subdir('read_wal_from_buffers') subdir('unsafe_tests') subdir('worker_spi') subdir('xid_wraparound') diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore new file mode 100644 index 00..5dcb3ff972 --- /dev/null +++ b/src/test/modules/read_wal_from_buffers/.gitignore @@ -0,0 +1,4 @@ +# Generated subdirectories +/log/ +/results/ +/tmp_check/ diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile new file mode 100644 index 00..9e57a837f9 --- /dev/null +++ b/src/test/modules/read_wal_from_buffers/Makefile @@ -0,0 +1,23 @@ +# src/test/modules/read_wal_from_buffers/Makefile + +MODULE_big = read_wal_from_buffers +OBJS = \ + $(WIN32RES) \ + read_wal_from_buffers.o +PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers" + +EXTENSION = read_wal_from_buffers +DATA = read_wal_from_buffers--1.0.sql +
Re: Fix a typo in pg_rotate_logfile
On Thu, Feb 15, 2024 at 2:18 AM Daniel Gustafsson wrote: > > >>> Searching on Github and Debian Codesearch I cannot find any reference to > >>> anyone > >>> using any function from adminpack. With pgAdminIII being EOL it might be > >>> to > >>> remove it now rather than be on the hook to maintain it for another 5 > >>> years > >>> until v17 goes EOL. It'll still be around for years in V16->. > >> > >> Works for me. > >> > >>> Attached is a diff to show what it would look like to remove adminpack > >>> (catalog > >>> version bump omitted on purpose to avoid conflicts until commit). > >> > >> I don't see any references you missed, so +1. > > > > Seems reasonable to me, too. > > Thanks! I'll put this in the next CF to keep it open for comments a bit > longer, but will close it early in the CF. LGTM. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Fri, Feb 16, 2024 at 12:32:45AM +, Zhijie Hou (Fujitsu) wrote: > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it > could provide more information about communication between primary and > standby. > This also doesn't increase noticeable testing time on my machine for debug > build. Same here, and there is no big diff as far the amount of log generated: Without the patch: $ du -sh ./src/test/recovery/tmp_check/log/*040* 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_cascading_standby.log 24K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_publisher.log 16K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_standby1.log 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_subscriber1.log 12K ./src/test/recovery/tmp_check/log/regress_log_040_standby_failover_slots_sync With the patch: $ du -sh ./src/test/recovery/tmp_check/log/*040* 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_cascading_standby.log 36K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_publisher.log 48K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_standby1.log 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_subscriber1.log 12K ./src/test/recovery/tmp_check/log/regress_log_040_standby_failover_slots_sync Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: speed up a logical replica setup
Dear Euler, Thanks for updating the patch! Before reviewing deeply, here are replies for your comments. > > > Points raised by me [1] are not solved yet. > > > > * What if the target version is PG16-? pg_ctl and pg_resetwal won't work. $ pg_ctl start -D /tmp/blah waiting for server to start 2024-02-15 23:50:03.448 -03 [364610] FATAL: database files are incompatible with server 2024-02-15 23:50:03.448 -03 [364610] DETAIL: The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17devel. stopped waiting pg_ctl: could not start server Examine the log output. $ pg_resetwal -D /tmp/blah pg_resetwal: error: data directory is of wrong version pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible with this program's version "17". > > * What if the found executables have diffent version with > > pg_createsubscriber? The new code take care of it. > I preferred to have a common path and test one by one, but agreed this worked well. Let's keep it and hear opinions from others. > > > * What if the target is sending WAL to another server? > >I.e., there are clusters like `node1->node2-.node3`, and the target is > > node2. The new code detects if the server is in recovery and aborts as you suggested. A new option can be added to ignore the fact there are servers receiving WAL from it. > Confirmed it can detect. > > > * Can we really cleanup the standby in case of failure? > >Shouldn't we suggest to remove the target once? If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid it. However, we should provide instructions to inform the user that it should create a fresh standby and try again. > I think the cleanup function looks not sufficient. In v21, recovery_ended is kept to true even after drop_publication() is done in setup_subscriber(). I think that made_subscription is not needed, and the function should output some messages when recovery_ended is on. Besides, in case of pg_upgrade, they always report below messages before starting the migration. I think this is more helpful for users. ``` pg_log(PG_REPORT, "\n" "If pg_upgrade fails after this point, you must re-initdb the\n" "new cluster before continuing."); ``` > > > * Can we move outputs to stdout? Are you suggesting to use another logging framework? It is not a good idea because each client program is already using common/logging.c. > Hmm, indeed. Other programs in pg_basebackup seem to use the framework. > v20-0011: Do we really want to interrupt the recovery if the network was momentarily interrupted or if the OS killed walsender? Recovery is critical for the process. I think we should do our best to be resilient and deliver all changes required by the new subscriber. > It might be too strict to raise an ERROR as soon as we met a disconnection. And at least 0011 was wrong - it should be PQgetvalue(res, 0, 1) for still_alive. > The proposal is not correct because the query return no tuples if it is disconnected so you cannot PQgetvalue(). > Sorry for misunderstanding, but you might be confused. pg_createsubcriber sends a query to target, and we are discussing the disconnection between the target and source instances. I think the connection which pg_createsubscriber has is stil alive so PQgetvalue() can get a value. (BTW, callers of server_is_in_recovery() has not considered a disconnection from the target...) > The retry interval (the time that ServerLoop() will create another walreceiver) is determined by DetermineSleepTime() and it is a maximum of 5 seconds (SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up using the pg_stat_wal_receiver query. Do you have a better plan? > It's good to determine the threshold. It can define the number of acceptable loss of walreceiver during the loop. I think we should retry at least the postmaster revives the walreceiver. The checking would work once per seconds, so more than 5 (or 10) may be better. Thought? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
Re: 035_standby_logical_decoding unbounded hang
Hi, On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote: > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote: > > What about creating a sub, say wait_for_restart_lsn_calculation() in > > Cluster.pm > > and then make use of it in create_logical_slot_on_standby() and above? > > (something > > like wait_for_restart_lsn_calculation-v1.patch attached). > > Waiting for restart_lsn is just a prerequisite for calling > pg_log_standby_snapshot(), so I wouldn't separate those two. Yeah, makes sense. > If we're > extracting a sub, I would move the pg_log_standby_snapshot() call into the sub > and make the API like one of these: > > $standby->wait_for_subscription_starting_point($primary, $slot_name); > $primary->log_standby_snapshot($standby, $slot_name); > > Would you like to finish the patch in such a way? Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used log_standby_snapshot() as it seems more generic to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 07da74cf56..5d3174d0b5 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -3181,6 +3181,37 @@ $SIG{TERM} = $SIG{INT} = sub { =pod +=item $node->log_standby_snapshot(self, standby, slot_name) + +Log a standby snapshot on primary once the slot restart_lsn is determined on +the standby. + +=cut + +sub log_standby_snapshot +{ + my ($self, $standby, $slot_name) = @_; + my ($stdout, $stderr); + + # Once the slot's restart_lsn is determined, the standby looks for + # xl_running_xacts WAL record from the restart_lsn onwards. First wait + # until the slot restart_lsn is determined. + + $standby->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' + ]) + or die + "timed out waiting for logical slot to calculate its restart_lsn"; + + # Then arrange for the xl_running_xacts record for which the standby is + # waiting. + $self->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()'); +} + +=pod + =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname) Create logical replication slot on given standby @@ -3206,21 +3237,9 @@ sub create_logical_slot_on_standby '2>', \$stderr); - # Once the slot's restart_lsn is determined, the standby looks for - # xl_running_xacts WAL record from the restart_lsn onwards. First wait - # until the slot restart_lsn is determined. - - $self->poll_query_until( - 'postgres', qq[ - SELECT restart_lsn IS NOT NULL - FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name' - ]) - or die - "timed out waiting for logical slot to calculate its restart_lsn"; - - # Then arrange for the xl_running_xacts record for which pg_recvlogical is + # Arrange for the xl_running_xacts record for which pg_recvlogical is # waiting. - $primary->safe_psql('postgres', 'SELECT pg_log_standby_snapshot()'); + $primary->log_standby_snapshot($self, $slot_name); $handle->finish(); diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index b020361b29..0710bccc17 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -465,8 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n"; $psql_subscriber{run}->pump_nb(); -# Speed up the subscription creation -$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); +# Log the standby snapshot to speed up the subscription creation +$node_primary->log_standby_snapshot($node_standby, 'tap_sub'); # Explicitly shut down psql instance gracefully - to avoid hangs # or worse on windows
Re: Synchronizing slots from primary to standby
On Fri, Feb 16, 2024 at 11:12 AM Zhijie Hou (Fujitsu) wrote: > > On Friday, February 16, 2024 8:33 AM Zhijie Hou (Fujitsu) > wrote: > > > > > > Yeah, we can consider outputting some information via this function > > > like how many slots are synced and persisted but not sure what would > > > be appropriate here. Because one can anyway find that or more > > > information by querying pg_replication_slots. I think we can keep > > > discussing what makes more sense as a return value but let's commit > > > the debug/log patches as they will be helpful to analyze BF failures or > > > any > > other issues reported. > > > > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it > > could > > provide more information about communication between primary and standby. > > This also doesn't increase noticeable testing time on my machine for debug > > build. > > Sorry, there was a miss in the DEBUG message, I should have used > UINT64_FORMAT for XLogSegNo(uint64) instead of %ld. Here is a small patch > to fix this. > Thanks for noticing this. I have pushed all your debug patches. Let's hope if there is a BF failure next time, we can gather enough information to know the reason of the same. -- With Regards, Amit Kapila.
RE: Synchronizing slots from primary to standby
On Friday, February 16, 2024 8:33 AM Zhijie Hou (Fujitsu) wrote: > On Thursday, February 15, 2024 8:29 PM Amit Kapila > wrote: > > > > > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot > > wrote: > > > > > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > > > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > Attach the v2 patch here. > > > > > > > > > > Apart from the new log message. I think we can add one more > > > > > debug message in reserve_wal_for_local_slot, this could be > > > > > useful to analyze the > > failure. > > > > > > > > Yeah, that can also be helpful, but the added message looks naive to me. > > > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, > > > > + segno); > > > > > > > > Instead of the above, how about something like: "segno: %ld of > > > > purposed restart_lsn for the synced slot, oldest_segno: %ld > > > > available"? > > > > > > Looks good to me. I'm not sure if it would make more sense to elog > > > only if segno < oldest_segno means just before the > > XLogSegNoOffsetToRecPtr() call? > > > > > > But I'm fine with the proposed location too. > > > > > > > I am also fine either way but the current location gives required > > information in more number of cases and could be helpful in debugging this > new facility. > > > > > > > > > > > And we > > > > > can also enable the DEBUG log in the 040 tap-test, I see we have > > > > > similar setting in 010_logical_decoding_timline and logging > > > > > debug1 message doesn't increase noticable time on my machine. > > > > > These are done > > in 0002. > > > > > > > > > > > > > I haven't tested it but I think this can help in debugging BF > > > > failures, if any. I am not sure if to keep it always like that but > > > > till the time these tests are stabilized, this sounds like a good > > > > idea. So, how, about just making test changes as a separate patch > > > > so that later if required we can revert/remove it easily? > > > > Bertrand, do you have any thoughts on this? > > > > > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I > > > +think we > > > took the same approach for 035_standby_logical_decoding.pl IIRC) and > > > then revert it back. > > > > > > > Good to know! > > > > > Also I was thinking: what about adding an output to > > pg_sync_replication_slots()? > > > The output could be the number of sync slots that have been created > > > and are not considered as sync-ready during the execution. > > > > > > > Yeah, we can consider outputting some information via this function > > like how many slots are synced and persisted but not sure what would > > be appropriate here. Because one can anyway find that or more > > information by querying pg_replication_slots. I think we can keep > > discussing what makes more sense as a return value but let's commit > > the debug/log patches as they will be helpful to analyze BF failures or any > other issues reported. > > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it > could > provide more information about communication between primary and standby. > This also doesn't increase noticeable testing time on my machine for debug > build. Sorry, there was a miss in the DEBUG message, I should have used UINT64_FORMAT for XLogSegNo(uint64) instead of %ld. Here is a small patch to fix this. Best Regards, Hou zj 0001-fix-the-format-for-uint64.patch Description: 0001-fix-the-format-for-uint64.patch
Re: POC, WIP: OR-clause support for indexes
On 16/2/2024 07:00, jian he wrote: On Wed, Feb 14, 2024 at 11:21 AM Andrei Lepikhov wrote: My OS: Ubuntu 22.04.3 LTS I already set the max_parallel_workers_per_gather to 10. So for all cases, it should use parallelism first? a better question would be: how to make the number of OR less than 29 still faster when enable_or_transformation is ON by only set parameters? In my test environment this example gives some subtle supremacy to ORs over ANY with only 3 ors and less. Please, provide next EXPLAIN ANALYZE results for the case you want to discuss here: 1. with enable_or_transformation enabled 2. with enable_or_transformation disabled 3. with enable_or_transformation disabled but with manual transformation OR -> ANY done, to check the overhead of this optimization. -- regards, Andrei Lepikhov Postgres Professional
RE: pg_upgrade and logical replication
Dear Vignesh, Thanks for reviewing! PSA new version. > > Thanks for the updated patch, few suggestions: > 1) Can we use a new publication for this subscription too so that the > publication and subscription naming will become consistent throughout > the test case: > +# Table will be in 'd' (data is being copied) state as table sync will fail > +# because of primary key constraint error. > +my $started_query = > + "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'"; > +$old_sub->poll_query_until('postgres', $started_query) > + or die > + "Timed out while waiting for the table state to become 'd' (datasync)"; > + > +# Create another subscription and drop the subscription's replication origin > +$old_sub->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr' > PUBLICATION regress_pub2 WITH (enabled = false)" > +); > > So after the change it will become like subscription regress_sub3 for > publication regress_pub3, subscription regress_sub4 for publication > regress_pub4 and subscription regress_sub5 for publication > regress_pub5. A new publication was defined. > 2) The tab_upgraded1 table can be created along with create > publication and create subscription itself: > $publisher->safe_psql('postgres', > "CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1"); > $old_sub->safe_psql('postgres', > "CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION > regress_pub3 WITH (failover = true)" > ); The definition of tab_upgraded1 was moved to the place you pointed. > 3) The tab_upgraded2 table can be created along with create > publication and create subscription itself to keep it consistent: > $publisher->safe_psql('postgres', > - "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2"); > + "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2"); > $old_sub->safe_psql('postgres', > - "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION"); > + "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr' > PUBLICATION regress_pub4" > +); Ditto. > With above fixes, the following can be removed: > # Initial setup > $publisher->safe_psql( > 'postgres', qq[ > CREATE TABLE tab_upgraded1(id int); > CREATE TABLE tab_upgraded2(id int); > ]); > $old_sub->safe_psql( > 'postgres', qq[ > CREATE TABLE tab_upgraded1(id int); > CREATE TABLE tab_upgraded2(id int); > ]); Yes, earlier definitions were removed instead. Also, some comments were adjusted based on these fixes. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v4-0001-Fix-testcase.patch Description: v4-0001-Fix-testcase.patch
Re: pg_upgrade and logical replication
On Fri, 16 Feb 2024 at 08:22, Hayato Kuroda (Fujitsu) wrote: > > Dear Amit, > > > This sounds like a reasonable way to address the reported problem. > > OK, thanks! > > > Justin, do let me know if you think otherwise? > > > > Comment: > > === > > * > > -# Setup an enabled subscription to verify that the running status and > > failover > > -# option are retained after the upgrade. > > +# Setup a subscription to verify that the failover option are retained > > after > > +# the upgrade. > > $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1"); > > $old_sub->safe_psql('postgres', > > - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION > > regress_pub1 WITH (failover = true)" > > + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' > > PUBLICATION > > regress_pub1 WITH (failover = true, enabled = false)" > > ); > > > > I think it is better not to create a subscription in the early stage > > which we wanted to use for the success case. Let's have separate > > subscriptions for failure and success cases. I think that will avoid > > the newly added ALTER statements in the patch. > > I made a patch to avoid creating objects as much as possible, but it > may lead some confusion. I recreated a patch for creating pub/sub > and dropping them at cleanup for every test cases. > > PSA a new version. Thanks for the updated patch, few suggestions: 1) Can we use a new publication for this subscription too so that the publication and subscription naming will become consistent throughout the test case: +# Table will be in 'd' (data is being copied) state as table sync will fail +# because of primary key constraint error. +my $started_query = + "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'"; +$old_sub->poll_query_until('postgres', $started_query) + or die + "Timed out while waiting for the table state to become 'd' (datasync)"; + +# Create another subscription and drop the subscription's replication origin +$old_sub->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr' PUBLICATION regress_pub2 WITH (enabled = false)" +); So after the change it will become like subscription regress_sub3 for publication regress_pub3, subscription regress_sub4 for publication regress_pub4 and subscription regress_sub5 for publication regress_pub5. 2) The tab_upgraded1 table can be created along with create publication and create subscription itself: $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1"); $old_sub->safe_psql('postgres', "CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION regress_pub3 WITH (failover = true)" ); 3) The tab_upgraded2 table can be created along with create publication and create subscription itself to keep it consistent: $publisher->safe_psql('postgres', - "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2"); + "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2"); $old_sub->safe_psql('postgres', - "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION"); + "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr' PUBLICATION regress_pub4" +); With above fixes, the following can be removed: # Initial setup $publisher->safe_psql( 'postgres', qq[ CREATE TABLE tab_upgraded1(id int); CREATE TABLE tab_upgraded2(id int); ]); $old_sub->safe_psql( 'postgres', qq[ CREATE TABLE tab_upgraded1(id int); CREATE TABLE tab_upgraded2(id int); ]); Regards, Vignesh
PGC_SIGHUP shared_buffers?
Hi, I remember Magnus making a comment many years ago to the effect that every setting that is PGC_POSTMASTER is a bug, but some of those bugs are very difficult to fix. Perhaps the use of the word bug is arguable, but I think the sentiment is apt, especially with regard to shared_buffers. Changing without a server restart would be really nice, but it's hard to figure out how to do it. I can think of a few basic approaches, and I'd like to know (a) which ones people think are good and which ones people think suck (maybe they all suck) and (b) if anybody's got any other ideas not mentioned here. 1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock() is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means that we're expecting to find all of the buffers in a single giant array. Years ago, somebody proposed changing the implementation to essentially WhereIsTheBuffer[buffer], which was heavily criticized on performance grounds, because it requires an extra memory access. A gentler version of this might be something like WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ; i.e. instead of allowing every single buffer to be at some random address, manage chunks of the buffer pool. This makes the lookup array potentially quite a lot smaller, which might mitigate performance concerns. For example, if you had one chunk per GB of shared_buffers, your mapping array would need only a handful of cache lines, or a few handfuls on really big systems. (I am here ignoring the difficulties of how to orchestrate addition of or removal of chunks as a SMOP[1]. Feel free to criticize that hand-waving, but as of this writing, I feel like moderate determination would suffice.) 2. Make a Buffer just a disguised pointer. Imagine something like typedef struct { Page bp; } *buffer. WIth this approach, BufferGetBlock() becomes trivial. The tricky part with this approach is that you still need a cheap way of finding the buffer header. What I imagine might work here is to again have some kind of chunked representation of shared_buffers, where each chunk contains a bunch of buffer headers at, say, the beginning, followed by a bunch of buffers. Theoretically, if the chunks are sufficiently strong-aligned, you can figure out what offset you're at within the chunk without any additional information and the whole process of locating the buffer header is just math, with no memory access. But in practice, getting the chunks to be sufficiently strongly aligned sounds hard, and this also makes a Buffer 64 bits rather than the current 32. A variant on this concept might be to make the Buffer even wider and include two pointers in it i.e. typedef struct { Page bp; BufferDesc *bd; } Buffer. 3. Reserve lots of address space and then only use some of it. I hear rumors that some forks of PG have implemented something like this. The idea is that you convince the OS to give you a whole bunch of address space, but you try to avoid having all of it be backed by physical memory. If you later want to increase shared_buffers, you then get the OS to back more of it by physical memory, and if you later want to decrease shared_buffers, you hopefully have some way of giving the OS the memory back. As compared with the previous two approaches, this seems less likely to be noticeable to most PG code. Problems include (1) you have to somehow figure out how much address space to reserve, and that forms an upper bound on how big shared_buffers can grow at runtime and (2) you have to figure out ways to reserve address space and back more or less of it with physical memory that will work on all of the platforms that we currently support or might want to support in the future. 4. Give up on actually changing the size of shared_buffer per se, but stick some kind of resizable secondary cache in front of it. Data that is going to be manipulated gets brought into a (perhaps small?) "real" shared_buffers that behaves just like today, but you have some larger data structure which is designed to be easier to resize and maybe simpler in some other ways that sits between shared_buffers and the OS cache. This doesn't seem super-appealing because it requires a lot of data copying, but maybe it's worth considering as a last resort. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com [1] https://en.wikipedia.org/wiki/Small_matter_of_programming
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada wrote: > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > reduce memory regressions, so should be considered. Up to 3 tids can > > be stored in the last level child pointer. It's not polished, but I'll > > only proceed with that if we think we need this. "flags" iis called > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > the future, in addition to reserving space for the pointer tag. Note: > > I hacked the tests to only have 2 offsets per block to demo, but of > > course both paths should be tested. > > Interesting. I've run the same benchmark tests we did[1][2] (the > median of 3 runs): > > monotonically ordered int column index: > > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s Hmm, that's strange -- this test is intended to delete all records from the last 20% of the blocks, so I wouldn't expect any improvement here, only in the sparse case. Maybe something is wrong. All the more reason to put it off... > I'm happy to see a huge improvement. While it's really fascinating to > me, I'm concerned about the time left until the feature freeze. We > need to polish both tidstore and vacuum integration patches in 5 > weeks. Personally I'd like to have it as a separate patch for now, and > focus on completing the main three patches since we might face some > issues after pushing these patches. I think with 0007 patch it's a big > win but it's still a win even without 0007 patch. Agreed to not consider it for initial commit. I'll hold on to it for some future time. > > 2. Management of memory contexts. It's pretty verbose and messy. I > > think the abstraction could be better: > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > > can't destroy or reset it. That means we have to do a lot of manual > > work. > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > > it seems the wrong responsibility. Not all uses will have a > > work_mem-type limit, I'm guessing. We only use it for limiting the max > > block size, and aset's default 8MB is already plenty small for > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so > > smaller, and there it makes sense to limit the max blocksize this way. > > C: The context for values has complex #ifdefs based on the value > > length/varlen, but it's both too much and not enough. If we get a bump > > context, how would we shoehorn that in for values for vacuum but not > > for tidbitmap? > > > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to > > TidStoreCreate(), and then to RT_CREATE. That context will contain the > > values (for local mem), and the node slabs will be children of the > > value context. That way, measuring memory usage and free-ing can just > > call with this parent context, and let recursion handle the rest. > > Perhaps the passed context can also hold the radix-tree struct, but > > I'm not sure since I haven't tried it. What do you think? > > If I understand your idea correctly, RT_CREATE() creates the context > for values as a child of the passed context and the node slabs as > children of the value context. That way, measuring memory usage can > just call with the value context. It sounds like a good idea. But it > was not clear to me how to address point B and C. For B & C, vacuum would create a context to pass to TidStoreCreate, and it wouldn't need to bother changing max block size. RT_CREATE would use that directly for leaves (if any), and would only create child slab contexts under it. It would not need to know about max_bytes. Modifyng your diagram a bit, something like: - caller-supplied radix tree memory context (the 3 structs -- and leaves, if any) (aset (or future bump?)) - node slab contexts This might only be workable with aset, if we need to individually free the structs. (I haven't studied this, it was a recent idea) It's simpler, because with small fixed length values, we don't need to detect that and avoid creating a leaf context. All leaves would live in the same context as the structs. > Another variant of this idea would be that RT_CREATE() creates the > parent context of the value context to store radix-tree struct. That > is, the hierarchy would be like: > > A MemoryContext (passed by vacuum through tidstore) > - radix tree memory context (store radx-tree struct, control > struct, and iterator) > - value context (aset, slab, or bump) > - node slab contexts The template handling the value context here is complex, and is what I meant by 'C' above. Most fixed length allocations in all of the backend are aset, so it seems fine to use it always. > Freeing can just call with the radix tree memory context. And perhaps > it works even if tidstore passes CurrentMemo
RE: pg_upgrade and logical replication
Dear Justin, Thanks for replying! > What optimizations? I can't see them, and since the patch is described > as rearranging test cases (and therefore already difficult to read), I > guess they should be a separate patch, or the optimizations described. The basic idea was to reduce number of CREATE/DROP statement, but it was changed for now - publications and subscriptions were created and dropped per testcases. E.g., In case of successful upgrade, below steps were done: 1. create two publications 2. create a subscription with failover = true 3. avoid further initial sync by setting max_logical_replication_workers = 0 4. create another subscription 5. confirm statuses of tables are either of 'i' or 'r' 6. run pg_upgrade 7. confirm table statuses are preserved 8. confirm replication origins are preserved. New patch is available in [1]. [1]: https://www.postgresql.org/message-id/TYCPR01MB12077B16EEDA360BA645B96F8F54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: speed up a logical replica setup
On Thu, Feb 15, 2024, at 8:23 AM, Hayato Kuroda (Fujitsu) wrote: > > Points raised by me [1] are not solved yet. > > > > * What if the target version is PG16-? pg_ctl and pg_resetwal won't work. $ pg_ctl start -D /tmp/blah waiting for server to start 2024-02-15 23:50:03.448 -03 [364610] FATAL: database files are incompatible with server 2024-02-15 23:50:03.448 -03 [364610] DETAIL: The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17devel. stopped waiting pg_ctl: could not start server Examine the log output. $ pg_resetwal -D /tmp/blah pg_resetwal: error: data directory is of wrong version pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible with this program's version "17". > > * What if the found executables have diffent version with > > pg_createsubscriber? The new code take care of it. > > * What if the target is sending WAL to another server? > >I.e., there are clusters like `node1->node2-.node3`, and the target is > > node2. The new code detects if the server is in recovery and aborts as you suggested. A new option can be added to ignore the fact there are servers receiving WAL from it. > > * Can we really cleanup the standby in case of failure? > >Shouldn't we suggest to remove the target once? If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid it. However, we should provide instructions to inform the user that it should create a fresh standby and try again. > > * Can we move outputs to stdout? Are you suggesting to use another logging framework? It is not a good idea because each client program is already using common/logging.c. > 1. > Cfbot got angry [1]. This is because WIFEXITED and others are defined in > , > but the inclusion was removed per comment. Added the inclusion again. Ok. > 2. > As Shubham pointed out [3], when we convert an intermediate node of cascading > replication, > the last node would stuck. This is because a walreciever process requires > nodes have the same > system identifier (in WalReceiverMain), but it would be changed by > pg_createsubscriebr. Hopefully it was fixed. > 3. > Moreover, when we convert a last node of cascade, it won't work well. Because > we cannot create > publications on the standby node. Ditto. > 4. > If the standby server was initialized as PG16-, this command would fail. > Because the API of pg_logical_create_replication_slot() were changed. See comment above. > 5. > Also, used pg_ctl commands must have same versions with the instance. > I think we should require all the executables and servers must be a same > major version. It is enforced by the new code. See find_other_exec() in get_exec_path(). > Based on them, below part describes attached ones: Thanks for another review. I'm sharing a new patch to merge a bunch of improvements and fixes. Comments are below. v20-0002: I did some extensive documentation changes (including some of them related to the changes in the new patch). I will defer its update to check v20-0002. It will be included in the next one. v20-0003: I included most of it. There are a few things that pgindent reverted so I didn't apply. I also didn't like some SQL commands that were broken into multiple lines with spaces at the beginning. It seems nice in the code but it is not in the output. v20-0004: Nice catch. Applied. v20-0005: Applied. v20-0006: I prefer to remove the comment. v20-0007: I partially applied it. I only removed the states that were not used and propose another dry run mode message. Maybe it is clear than it was. v20-0008: I refactored the get_bin_directory code. Under reflection, I reverted the unified binary directory that we agreed a few days ago. The main reason is to provide a specific error message for each program it is using. The get_exec_path will check if the program is available in the same directory as pg_createsubscriber and if it has the same version. An absolute path is returned and is used by some functions. v20-0009: to be reviewed. v20-0010: As I said above, this code was refactored so I didn't apply this one. v20-0011: Do we really want to interrupt the recovery if the network was momentarily interrupted or if the OS killed walsender? Recovery is critical for the process. I think we should do our best to be resilient and deliver all changes required by the new subscriber. The proposal is not correct because the query return no tuples if it is disconnected so you cannot PQgetvalue(). The retry interval (the time that ServerLoop() will create another walreceiver) is determined by DetermineSleepTime() and it is a maximum of 5 seconds (SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up using the pg_stat_wal_receiver query. Do you have a better plan? v20-0012: I applied a different patch to accomplish the same thing. I included a refactor around pg_is_in_recovery() function to be used in other 2 points. Besides
Re: Memory consumed by paths during partitionwise join planning
On 15/2/2024 19:06, Ashutosh Bapat wrote: On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov But I'm not sure about freeing unreferenced paths. I would have to see alternatives in the pathlist. I didn't understand this. Can you please elaborate? A path in any pathlist is referenced. An unreferenced path should not be in any pathlist. I mean that at some point, an extension can reconsider the path tree after building the top node of this path. I vaguely recall that we already have (or had) kind of such optimization in the core where part of the plan changes after it has been built. Live example: right now, I am working on the code like MSSQL has - a combination of NestLoop and HashJoin paths and switching between them in real-time. It requires both paths in the path list at the moment when extensions are coming. Even if one of them isn't referenced from the upper pathlist, it may still be helpful for the extension. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. filter or free? Filter. I meant that Postres tries to apply IndexScan, BitmapScan, IndexOnlyScan, and other strategies, passing throughout the partition indexes. The optimizer spends a lot of time doing that. So, why not introduce a symmetrical strategy and give away from the search some indexes of types of scan based on the pathifying experience of previous partitions of the same table: if you have dozens of partitions, Is it beneficial for the system to find a bit more optimal IndexScan on one partition having SeqScans on 999 other? -- regards, Andrei Lepikhov Postgres Professional
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 15, 2024 at 8:26 PM John Naylor wrote: > > On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada > wrote: > > > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor wrote: > > > I've also run the same scripts in my environment just in case and got > > similar results: > > Thanks for testing, looks good as well. > > > > There are still some micro-benchmarks we could do on tidstore, and > > > it'd be good to find out worse-case memory use (1 dead tuple each on > > > spread-out pages), but this is decent demonstration. > > > > I've tested a simple case where vacuum removes 33k dead tuples spread > > about every 10 pages. > > > > master: > > 198,000 bytes (=33000 * 6) > > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > > > v-59: > > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s > > The memory usage for the sparse case may be a concern, although it's > not bad -- a multiple of something small is probably not huge in > practice. See below for an option we have for this. > > > > > > I'm pretty sure there's an > > > > > accidental memset call that crept in there, but I'm running out of > > > > > steam today. > > > > > > I have just a little bit of work to add for v59: > > > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > > and there are none between 64 and 128, so not a huge deal but I think > > > it's a bit nicer in this patch. > > > > LGTM. > > Okay, I've squashed this. > > > I've drafted the commit message. > > Thanks, this is a good start. > > > I've run regression tests with valgrind and run the coverity scan, and > > I don't see critical issues. > > Great! > > Now, I think we're in pretty good shape. There are a couple of things > that might be objectionable, so I want to try to improve them in the > little time we have: > > 1. Memory use for the sparse case. I shared an idea a few months ago > of how runtime-embeddable values (true combined pointer-value slots) > could work for tids. I don't think this is a must-have, but it's not a > lot of code, and I have this working: > > v61-0006: Preparatory refactoring -- I think we should do this anyway, > since the intent seems more clear to me. Looks good refactoring to me. > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > reduce memory regressions, so should be considered. Up to 3 tids can > be stored in the last level child pointer. It's not polished, but I'll > only proceed with that if we think we need this. "flags" iis called > that because it could hold tidbitmap.c booleans (recheck, lossy) in > the future, in addition to reserving space for the pointer tag. Note: > I hacked the tests to only have 2 offsets per block to demo, but of > course both paths should be tested. Interesting. I've run the same benchmark tests we did[1][2] (the median of 3 runs): monotonically ordered int column index: master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s uuid column index: master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s int & uuid indexes in parallel: master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s sparse case: master: 198,000 bytes (=33000 * 6) system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s v-59: 2,834,432 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s v-62: 729,088 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s I'm happy to see a huge improvement. While it's really fascinating to me, I'm concerned about the time left until the feature freeze. We need to polish both tidstore and vacuum integration patches in 5 weeks. Personally I'd like to have it as a separate patch for now, and focus on completing the main three patches since we might face some issues after pushing these patches. I think with 0007 patch it's a big win but it's still a win even without 0007 patch. > > 2. Management of memory contexts. It's pretty verbose and messy. I > think the abstraction could be better: > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > can't destroy or reset it. That means we have to do a lot of manual > work. > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > it seems the wrong responsibility. Not all uses will have a > work_mem-type limit, I'm guessing
RE: pg_upgrade and logical replication
Dear Amit, > This sounds like a reasonable way to address the reported problem. OK, thanks! > Justin, do let me know if you think otherwise? > > Comment: > === > * > -# Setup an enabled subscription to verify that the running status and > failover > -# option are retained after the upgrade. > +# Setup a subscription to verify that the failover option are retained after > +# the upgrade. > $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1"); > $old_sub->safe_psql('postgres', > - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION > regress_pub1 WITH (failover = true)" > + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' > PUBLICATION > regress_pub1 WITH (failover = true, enabled = false)" > ); > > I think it is better not to create a subscription in the early stage > which we wanted to use for the success case. Let's have separate > subscriptions for failure and success cases. I think that will avoid > the newly added ALTER statements in the patch. I made a patch to avoid creating objects as much as possible, but it may lead some confusion. I recreated a patch for creating pub/sub and dropping them at cleanup for every test cases. PSA a new version. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v3-0001-Fix-testcase.patch Description: v3-0001-Fix-testcase.patch
Re: make add_paths_to_append_rel aware of startup cost
On Fri, 16 Feb 2024 at 01:09, David Rowley wrote: > > On Thu, 15 Feb 2024 at 21:42, Andy Fan wrote: > > I found the both plans have the same cost, I can't get the accurate > > cause of this after some hours research, but it is pretty similar with > > 7516056c584e3, so I uses a similar strategy to stable it. is it > > acceptable? > > It's pretty hard to say. I can only guess why this test would be > flapping like this. I see it's happened before on mylodon, so probably > not a cosmic ray. It's not like add_path() chooses a random path when > the costs are the same, so I wondered if something similar is going on > here that was going on that led to f03a9ca4. In particular, see [1]. While it's not conclusive proof, the following demonstrates that relpages dropping by just 1 page causes the join order to change. regression=# explain regression-# select t1.unique1 from tenk1 t1 regression-# inner join tenk2 t2 on t1.tenthous = t2.tenthous regression-# union all regression-# (values(1)) limit 1; QUERY PLAN -- Limit (cost=0.00..150.08 rows=1 width=4) -> Append (cost=0.00..1500965.01 rows=10001 width=4) -> Nested Loop (cost=0.00..1500915.00 rows=1 width=4) Join Filter: (t1.tenthous = t2.tenthous) -> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=1 width=8) -> Materialize (cost=0.00..495.00 rows=1 width=4) -> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=1 width=4) -> Result (cost=0.00..0.01 rows=1 width=4) regression=# update pg_class set relpages=relpages - 1 where relname = 'tenk2'; UPDATE 1 regression=# explain regression-# select t1.unique1 from tenk1 t1 regression-# inner join tenk2 t2 on t1.tenthous = t2.tenthous regression-# union all regression-# (values(1)) limit 1; QUERY PLAN -- Limit (cost=0.00..150.52 rows=1 width=4) -> Append (cost=0.00..1505315.30 rows=10001 width=4) -> Nested Loop (cost=0.00..1505265.29 rows=1 width=4) Join Filter: (t1.tenthous = t2.tenthous) -> Seq Scan on tenk2 t2 (cost=0.00..445.29 rows=10029 width=4) -> Materialize (cost=0.00..495.00 rows=1 width=8) -> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=1 width=8) -> Result (cost=0.00..0.01 rows=1 width=4) I tried this with the proposed changes to the test and the plan did not change. I've pushed the change now. David
Re: Do away with zero-padding assumption before WALRead()
At Tue, 13 Feb 2024 11:47:06 +0530, Bharath Rupireddy wrote in > Hi, > > I noticed an assumption [1] at WALRead() call sites expecting the > flushed WAL page to be zero-padded after the flush LSN. I think this > can't always be true as the WAL can get flushed after determining the > flush LSN before reading it from the WAL file using WALRead(). I've > hacked the code up a bit to check if that's true - Good catch! The comment seems wrong also to me. The subsequent bytes can be written simultaneously, and it's very normal that there are unflushed bytes are in OS's page buffer. The objective of the comment seems to be to declare that there's no need to clear out the remaining bytes, here. I agree that it's not a problem for now. However, I think we need two fixes here. 1. It's useless to copy the whole page regardless of the 'count'. It's enough to copy only up to the 'count'. The patch looks good in this regard. 2. Maybe we need a comment that states the page_read callback functions leave garbage bytes beyond the returned count, due to partial copying without clearing the unused portion. > I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ > despite knowing exactly how much to read. Is it to tell the OS to > explicitly fetch the whole page from the disk? If yes, the OS will do > that anyway because the page transfers from disk to OS page cache are > always in terms of disk block sizes, no? If I understand your question correctly, I guess that the whole-page copy was expected to clear out the remaining bytes, as I mentioned earlier. > Although, there's no immediate problem with it right now, the > assumption is going to be incorrect when reading WAL from WAL buffers > using WALReadFromBuffers - > https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com. > > If we have no reason, can the WALRead() callers just read how much > they want like walsender for physical replication? Attached a patch > for the change. > > Thoughts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Fix incorrect PG_GETARG in pgcrypto
On Tue, Feb 13, 2024 at 7:08 PM Michael Paquier wrote: > > On Tue, Feb 13, 2024 at 05:36:36PM +0900, Michael Paquier wrote: > > You've indeed grabbed some historical inconsistencies here. Please > > note that your patch has reversed diffs (for example, the SQL > > definition of pgp_sym_encrypt_bytea uses bytea,text,text as arguments > > and your resulting patch shows how HEAD does the job with > > bytea,bytea,bytea), but perhaps you have generated it with a command > > like `git diff -R`? > > The reversed part of the patch put aside aside, I've double-checked > your patch and the inconsistencies seem to be all addressed in this > area. Thanks for fixing and merging this patch, I appreciate it! Thanks, Shihao > The symmetry that we have now between the bytea and text versions of > the functions is stunning, but I cannot really get excited about > merging all of them either as it would imply a bump of pgcrypto to > update the prosrc of these functions, and we have to maintain runtime > compatibility with older versions. > -- > Michael
Re: make add_paths_to_append_rel aware of startup cost
David Rowley writes: > I'd be more happy using this one as percentage-wise, the cost > difference is much larger. +1 for the percentage-wise. > > I checked that the t2.thounsand = 0 query still tests the cheap > startup paths in add_paths_to_append_rel(). I get the same conclusion here. > > So, if nobody has any better ideas, I'm just going to push the " and > t2.thousand = 0" adjustment. LGTM. -- Best Regards Andy Fan
Re: When extended query protocol ends?
Hi Dave, Oh, I see. > Hi Tatsuo, > > Actually no need, I figured it out. > > I don't have a solution yet though. > > Dave Cramer > www.postgres.rocks > > > On Thu, 15 Feb 2024 at 19:43, Tatsuo Ishii wrote: > >> > Can you ask the OP what they are doing in the startup. I'm trying to >> > replicate their situation. >> > Looks like possibly 'setReadOnly' and 'select version()' >> >> Sure I will. By the way 'select version()' may be issued by Pgpool-II >> itself. In this case it should be 'SELECT version()', not 'select >> version()'. >> >> Best reagards, >> -- >> Tatsuo Ishii >> SRA OSS LLC >> English: http://www.sraoss.co.jp/index_en/ >> Japanese:http://www.sraoss.co.jp >>
Re: [PoC] Improve dead tuple storage for lazy vacuum
v61 had a brown-paper-bag bug in the embedded tids patch that didn't present in the tidstore test, but caused vacuum to fail, fixed in v62. v62-ART.tar.gz Description: application/gzip
Re: When extended query protocol ends?
Hi Tatsuo, Actually no need, I figured it out. I don't have a solution yet though. Dave Cramer www.postgres.rocks On Thu, 15 Feb 2024 at 19:43, Tatsuo Ishii wrote: > > Can you ask the OP what they are doing in the startup. I'm trying to > > replicate their situation. > > Looks like possibly 'setReadOnly' and 'select version()' > > Sure I will. By the way 'select version()' may be issued by Pgpool-II > itself. In this case it should be 'SELECT version()', not 'select > version()'. > > Best reagards, > -- > Tatsuo Ishii > SRA OSS LLC > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp >
Re: When extended query protocol ends?
> Can you ask the OP what they are doing in the startup. I'm trying to > replicate their situation. > Looks like possibly 'setReadOnly' and 'select version()' Sure I will. By the way 'select version()' may be issued by Pgpool-II itself. In this case it should be 'SELECT version()', not 'select version()'. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
RE: Synchronizing slots from primary to standby
On Thursday, February 15, 2024 8:29 PM Amit Kapila wrote: > > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot > wrote: > > > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > > > wrote: > > > > Attach the v2 patch here. > > > > > > > > Apart from the new log message. I think we can add one more debug > > > > message in reserve_wal_for_local_slot, this could be useful to analyze > > > > the > failure. > > > > > > Yeah, that can also be helpful, but the added message looks naive to me. > > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); > > > > > > Instead of the above, how about something like: "segno: %ld of > > > purposed restart_lsn for the synced slot, oldest_segno: %ld > > > available"? > > > > Looks good to me. I'm not sure if it would make more sense to elog > > only if segno < oldest_segno means just before the > XLogSegNoOffsetToRecPtr() call? > > > > But I'm fine with the proposed location too. > > > > I am also fine either way but the current location gives required information > in > more number of cases and could be helpful in debugging this new facility. > > > > > > > > And we > > > > can also enable the DEBUG log in the 040 tap-test, I see we have > > > > similar setting in 010_logical_decoding_timline and logging debug1 > > > > message doesn't increase noticable time on my machine. These are done > in 0002. > > > > > > > > > > I haven't tested it but I think this can help in debugging BF > > > failures, if any. I am not sure if to keep it always like that but > > > till the time these tests are stabilized, this sounds like a good > > > idea. So, how, about just making test changes as a separate patch so > > > that later if required we can revert/remove it easily? Bertrand, do > > > you have any thoughts on this? > > > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I > > +think we > > took the same approach for 035_standby_logical_decoding.pl IIRC) and > > then revert it back. > > > > Good to know! > > > Also I was thinking: what about adding an output to > pg_sync_replication_slots()? > > The output could be the number of sync slots that have been created > > and are not considered as sync-ready during the execution. > > > > Yeah, we can consider outputting some information via this function like how > many slots are synced and persisted but not sure what would be appropriate > here. Because one can anyway find that or more information by querying > pg_replication_slots. I think we can keep discussing what makes more sense as > a > return value but let's commit the debug/log patches as they will be helpful to > analyze BF failures or any other issues reported. Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it could provide more information about communication between primary and standby. This also doesn't increase noticeable testing time on my machine for debug build. Best Regards, Hou zj v3-0002-Add-a-DEBUG1-message-for-slot-sync.patch Description: v3-0002-Add-a-DEBUG1-message-for-slot-sync.patch v3-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch Description: v3-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch v3-0003-change-the-log-level-of-sync-slot-test-to-debug2.patch Description: v3-0003-change-the-log-level-of-sync-slot-test-to-debug2.patch
Re: POC, WIP: OR-clause support for indexes
On Wed, Feb 14, 2024 at 11:21 AM Andrei Lepikhov wrote: > > So, this example is more about the subtle balance between > parallel/sequential execution, which can vary from one platform to another. > Hi, here I attached two files, expression_num_or_1_100.sql, expression_num_or_1_1.sql it has step by step test cases, also with the tests output. For both sql files, I already set the max_parallel_workers_per_gather to 10, work_mem to 4GB. I think the parameters setting should be fine. in expression_num_or_1_100.sql: main test table: create table test_1_100 as (select (random()*1000)::int x, (random()*1000) y from generate_series(1,1_000_000) i); if the number of OR exceeds 29, the performance with enable_or_transformation (ON) begins to outpace enable_or_transformation (OFF). if the number of OR less than 29, the performance with enable_or_transformation (OFF) is better than enable_or_transformation (ON). expression_num_or_1_1.sql enable_or_transformation (ON) is always better than enable_or_transformation (OFF). My OS: Ubuntu 22.04.3 LTS I already set the max_parallel_workers_per_gather to 10. So for all cases, it should use parallelism first? a better question would be: how to make the number of OR less than 29 still faster when enable_or_transformation is ON by only set parameters? expression_num_or_1_100.sql Description: application/sql expression_num_or_1_1.sql Description: application/sql
Re: glibc qsort() vulnerability
Here is what I have staged for commit. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 8e5a66fb3a0787f15a900a89742862c89da38a1d Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 15 Feb 2024 10:53:11 -0600 Subject: [PATCH v8 1/3] Remove direct calls to pg_qsort(). Calls to this function might give the idea that pg_qsort() is somehow different than qsort(), when in fact all calls to qsort() in the Postgres tree are redirected to pg_qsort() via a macro. This commit removes all direct calls to pg_qsort() in favor of letting the macro handle the conversions. Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com --- contrib/pg_prewarm/autoprewarm.c| 4 ++-- src/backend/access/brin/brin_minmax_multi.c | 2 +- src/backend/optimizer/plan/analyzejoins.c | 4 ++-- src/backend/storage/buffer/bufmgr.c | 4 ++-- src/backend/utils/cache/syscache.c | 10 +- src/include/port.h | 3 +++ src/port/qsort.c| 3 +++ 7 files changed, 18 insertions(+), 12 deletions(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index 06ee21d496..248b9914a3 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -346,8 +346,8 @@ apw_load_buffers(void) FreeFile(file); /* Sort the blocks to be loaded. */ - pg_qsort(blkinfo, num_elements, sizeof(BlockInfoRecord), - apw_compare_blockinfo); + qsort(blkinfo, num_elements, sizeof(BlockInfoRecord), + apw_compare_blockinfo); /* Populate shared memory state. */ apw_state->block_info_handle = dsm_segment_handle(seg); diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c index 3ffaad3e42..2c29aa3d4e 100644 --- a/src/backend/access/brin/brin_minmax_multi.c +++ b/src/backend/access/brin/brin_minmax_multi.c @@ -1369,7 +1369,7 @@ build_distances(FmgrInfo *distanceFn, Oid colloid, * Sort the distances in descending order, so that the longest gaps are at * the front. */ - pg_qsort(distances, ndistances, sizeof(DistanceValue), compare_distances); + qsort(distances, ndistances, sizeof(DistanceValue), compare_distances); return distances; } diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 7dcb74572a..e494acd51a 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -2307,8 +2307,8 @@ remove_self_joins_recurse(PlannerInfo *root, List *joinlist, Relids toRemove) j++; } - pg_qsort(candidates, numRels, sizeof(SelfJoinCandidate), - self_join_candidates_cmp); + qsort(candidates, numRels, sizeof(SelfJoinCandidate), + self_join_candidates_cmp); /* * Iteratively form a group of relation indexes with the same oid and diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index eb1ec3b86d..07575ef312 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3915,7 +3915,7 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators) /* sort the list of rlocators if necessary */ if (use_bsearch) - pg_qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator); + qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator); for (i = 0; i < NBuffers; i++) { @@ -4269,7 +4269,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels) /* sort the list of SMgrRelations if necessary */ if (use_bsearch) - pg_qsort(srels, nrels, sizeof(SMgrSortArray), rlocator_comparator); + qsort(srels, nrels, sizeof(SMgrSortArray), rlocator_comparator); for (i = 0; i < NBuffers; i++) { diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 162855b158..662f930864 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -146,14 +146,14 @@ InitCatalogCache(void) Assert(SysCacheSupportingRelOidSize <= lengthof(SysCacheSupportingRelOid)); /* Sort and de-dup OID arrays, so we can use binary search. */ - pg_qsort(SysCacheRelationOid, SysCacheRelationOidSize, - sizeof(Oid), oid_compare); + qsort(SysCacheRelationOid, SysCacheRelationOidSize, + sizeof(Oid), oid_compare); SysCacheRelationOidSize = qunique(SysCacheRelationOid, SysCacheRelationOidSize, sizeof(Oid), oid_compare); - pg_qsort(SysCacheSupportingRelOid, SysCacheSupportingRelOidSize, - sizeof(Oid), oid_compare); + qsort(SysCacheSupportingRelOid, SysCacheSupportingRelOidSize, + sizeof(Oid), oid_compare); SysCacheSupportingRelOidSize = qunique(SysCacheSupportingRelOid, SysCacheSupportingRelOidSize, sizeof(Oid), oid_compare); @@ -668,7 +668,7 @@ RelationSupportsSysCache(Oid relid) /* - * OID comparator for pg_qsort + * OID comparator for qsort */ static int oid_compare(const void *a, const void *b) diff --git a/src/include/port.h b/
Re: Transaction timeout
Hi, On 2024-02-13 23:42:35 +0200, Alexander Korotkov wrote: > diff --git a/src/backend/access/transam/xact.c > b/src/backend/access/transam/xact.c > index 464858117e0..a124ba59330 100644 > --- a/src/backend/access/transam/xact.c > +++ b/src/backend/access/transam/xact.c > @@ -2139,6 +2139,10 @@ StartTransaction(void) >*/ > s->state = TRANS_INPROGRESS; > > + /* Schedule transaction timeout */ > + if (TransactionTimeout > 0) > + enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout); > + > ShowTransactionState("StartTransaction"); > } Isn't it a problem that all uses of StartTransaction() trigger a timeout, but transaction commit/abort don't? What if e.g. logical replication apply starts a transaction, commits it, and then goes idle? The timer would still be active, afaict? I don't think it works well to enable timeouts in xact.c and to disable them in PostgresMain(). > @@ -4491,12 +4511,18 @@ PostgresMain(const char *dbname, const char *username) > > pgstat_report_activity(STATE_IDLEINTRANSACTION_ABORTED, NULL); > > /* Start the idle-in-transaction timer */ > - if (IdleInTransactionSessionTimeout > 0) > + if (IdleInTransactionSessionTimeout > 0 > + && (IdleInTransactionSessionTimeout < > TransactionTimeout || TransactionTimeout == 0)) > { > idle_in_transaction_timeout_enabled = > true; > > enable_timeout_after(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, > > IdleInTransactionSessionTimeout); > } > + > + /* Schedule or reschedule transaction timeout */ > + if (TransactionTimeout > 0 && > !get_timeout_active(TRANSACTION_TIMEOUT)) > + > enable_timeout_after(TRANSACTION_TIMEOUT, > + > TransactionTimeout); > } > else if (IsTransactionOrTransactionBlock()) > { > @@ -4504,12 +4530,18 @@ PostgresMain(const char *dbname, const char *username) > pgstat_report_activity(STATE_IDLEINTRANSACTION, > NULL); > > /* Start the idle-in-transaction timer */ > - if (IdleInTransactionSessionTimeout > 0) > + if (IdleInTransactionSessionTimeout > 0 > + && (IdleInTransactionSessionTimeout < > TransactionTimeout || TransactionTimeout == 0)) > { > idle_in_transaction_timeout_enabled = > true; > > enable_timeout_after(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, > > IdleInTransactionSessionTimeout); > } > + > + /* Schedule or reschedule transaction timeout */ > + if (TransactionTimeout > 0 && > !get_timeout_active(TRANSACTION_TIMEOUT)) > + > enable_timeout_after(TRANSACTION_TIMEOUT, > + > TransactionTimeout); > } > else > { Why do we need to do anything in these cases if the timer is started in StartTransaction()? > new file mode 100644 > index 000..ce2c9a43011 > --- /dev/null > +++ b/src/test/isolation/specs/timeouts-long.spec > @@ -0,0 +1,35 @@ > +# Tests for transaction timeout that require long wait times > + > +session s7 > +step s7_begin > +{ > +BEGIN ISOLATION LEVEL READ COMMITTED; > +SET transaction_timeout = '1s'; > +} > +step s7_commit_and_chain { COMMIT AND CHAIN; } > +step s7_sleep{ SELECT pg_sleep(0.6); } > +step s7_abort{ ABORT; } > + > +session s8 > +step s8_begin > +{ > +BEGIN ISOLATION LEVEL READ COMMITTED; > +SET transaction_timeout = '900ms'; > +} > +# to test that quick query does not restart transaction_timeout > +step s8_select_1 { SELECT 1; } > +step s8_sleep{ SELECT pg_sleep(0.6); } > + > +session checker > +step checker_sleep { SELECT pg_sleep(0.3); } Isn't this test going to be very fragile on busy / slow machines? What if the pg_sleep() takes one second, because there were other tasks to schedule? I'd be surprised if this didn't fail under valgrind, for example. Greetings, Andres Freund
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent wrote: > Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental > on top of your earlier set, and both don't allocate additional memory > in the merge operation in non-assertion builds. > > patch-a is a trivial and clean implementation of mergesort, which > tends to reduce the total number of compare operations if both arrays > are of similar size and value range. It writes data directly back into > the main array on non-assertion builds, and with assertions it reuses > your binary search join on scratch space for algorithm validation. This patch fails some of my tests on non-assert builds only (assert builds pass all my tests, though). I'm using the first patch on its own here. While I tend to be relatively in favor of complicated assertions (I tend to think the risk of introducing side-effects is worth it), it looks like you're only performing certain steps in release builds. This is evident from just looking at the code (there is an #else block just for the release build in the loop). Note also that "Assert(_bt_compare_array_elements(&merged[merged_nelems++], orig, &cxt) == 0)" has side-effects in assert-enabled builds only (it increments merged_nelems). While it's true that you *also* increment merged_nelems *outside* of the assertion (or in an #else block used during non-assert builds), that is conditioned on some other thing (so it's in no way equivalent to the debug #ifdef USE_ASSERT_CHECKING case). It's also just really hard to understand what's going on here. If I was going to do this kind of thing, I'd use two completely separate loops, that were obviously completely separate (maybe even two functions). I'd then memcmp() each array at the end. -- Peter Geoghegan
RE: Psql meta-command conninfo+
Hi! (v16) In this version, I made a small adjustment to the indentation of the \conninfo code and described the columns as returned by \conninfo+ as suggested by Jim Jones. Regards, Maiquel Grassi. v16-0001-psql-meta-command-conninfo-plus.patch Description: v16-0001-psql-meta-command-conninfo-plus.patch
Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
Hi, On 2024-02-11 13:19:00 -0500, David Benjamin wrote: > I've attached a patch for the master branch to fix up the custom BIOs used > by PostgreSQL, in light of the issues with the OpenSSL update recently. > While c82207a548db47623a2bfa2447babdaa630302b9 (switching from BIO_get_data > to BIO_get_app_data) resolved the immediate conflict, I don't think it > addressed the root cause, which is that PostgreSQL was mixing up two BIO > implementations, each of which expected to be driving the BIO. Yea, that's certainly not nice - and I think we've been somewhat lucky it hasn't caused more issues. There's some nasty possibilities, e.g. sock_ctrl() partially enabling ktls without our initialization having called ktls_enable(). Right now that just means ktls is unusable, but it's not hard to imagine accidentally ending up sending unencrypted data. I've in the past looked into not using a custom BIO [1], but I have my doubts about that being a good idea. I think medium term we want to be able to do network IO asynchronously, which seems quite hard to do when using openssl's socket BIO. > Once we've done that, we're free to use BIO_set_data. While BIO_set_app_data > works fine, I've reverted back to BIO_set_data because it's more commonly > used. > app_data depends on OpenSSL's "ex_data" mechanism, which is a tad heavier > under > the hood. At first I was a bit wary of that, because it requires us to bring back the fallback implementation. But you're right, it's noticeably heavier than BIO_get_data(), and we do call BIO_get_app_data() fairly frequently. > That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only > operation that matters is BIO_CTRL_FLUSH, which is implemented as a no-op. All > other operations are unused. It's once again good that they're unused because > otherwise OpenSSL might mess with postgres's socket, break the assumptions > around interrupt handling, etc. How did you determine that only FLUSH is required? I didn't even really find documentation about what the intended semantics actually are. E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so far, because we never set it, but is that right? What about BIO_CTRL_GET_CLOSE/BIO_CTRL_SET_CLOSE? Another issue is that 0 doesn't actually seem like the universal error return - e.g. BIO_C_GET_FD seems to return -1, because 0 is a valid fd. As of your patch the bio doesn't actually have an FD anymore, should it still set BIO_TYPE_DESCRIPTOR? > +static long > +my_sock_ctrl(BIO *h, int cmd, long num, void *ptr) > +{ > + longres = 0; > + > + switch (cmd) > + { > + case BIO_CTRL_FLUSH: > + /* libssl expects all BIOs to support BIO_flush. */ > + res = 1; > + break; > + } > + > + return res; > +} I'd move the res = 0 into a default: block. That way the compiler can warn if some case doesn't set it in all branches. > static BIO_METHOD * > my_BIO_s_socket(void) > { Wonder if we should rename this. It's pretty odd that we still call it's not really related to s_socket anymore, and doesn't even really implement the same interface (e.g. get_fd doesn't work anymore). Similarly, my_SSL_set_fd() doesn't actually call set_fd() anymore, which sure seems odd. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20210715021747.h2hih7mw56ivyntt%40alap3.anarazel.de
Re: 035_standby_logical_decoding unbounded hang
On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote: > On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote: > > The 035_standby_logical_decoding.pl hang is > > a race condition arising from an event sequence like this: > > > > - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. > > - Test script calls pg_log_standby_snapshot() on primary. Emits > > XLOG_RUNNING_XACTS. > > - checkpoint_timeout makes a primary checkpoint finish. Emits > > XLOG_RUNNING_XACTS. > > - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits > > XLOG_RUNNING_XACTS. > > - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. > > > > Other test code already has a solution for this, so the attached patches > > add a > > timeout and copy the existing solution. I'm also attaching the hack that > > makes it 100% reproducible. > I did a few tests and confirm that the proposed solution fixes the corner > case. Thanks for reviewing. > What about creating a sub, say wait_for_restart_lsn_calculation() in > Cluster.pm > and then make use of it in create_logical_slot_on_standby() and above? > (something > like wait_for_restart_lsn_calculation-v1.patch attached). Waiting for restart_lsn is just a prerequisite for calling pg_log_standby_snapshot(), so I wouldn't separate those two. If we're extracting a sub, I would move the pg_log_standby_snapshot() call into the sub and make the API like one of these: $standby->wait_for_subscription_starting_point($primary, $slot_name); $primary->log_standby_snapshot($standby, $slot_name); Would you like to finish the patch in such a way?
Re: index prefetching
On Thu, Feb 15, 2024 at 3:13 PM Andres Freund wrote: > > This is why I don't think that the tuples with lower page offset > > numbers are in any way significant here. The significant part is > > whether or not you'll actually need to visit more than one leaf page > > in the first place (plus the penalty from not being able to reorder > > the work across page boundaries in your initial v1 of prefetching). > > To me this your phrasing just seems to reformulate the issue. What I said to Tomas seems very obvious to me. I think that there might have been some kind of miscommunication (not a real disagreement). I was just trying to work through that. > In practical terms you'll have to wait for the full IO latency when fetching > the table tuple corresponding to the first tid on a leaf page. Of course > that's also the moment you had to visit another leaf page. Whether the stall > is due to visit another leaf page or due to processing the first entry on such > a leaf page is a distinction without a difference. I don't think anybody said otherwise? > > > That's certainly true / helpful, and it makes the "first entry" issue > > > much less common. But the issue is still there. Of course, this says > > > nothing about the importance of the issue - the impact may easily be so > > > small it's not worth worrying about. > > > > Right. And I want to be clear: I'm really *not* sure how much it > > matters. I just doubt that it's worth worrying about in v1 -- time > > grows short. Although I agree that we should commit a v1 that leaves > > the door open to improving matters in this area in v2. > > I somewhat doubt that it's realistic to aim for 17 at this point. That's a fair point. Tomas? > We seem to > still be doing fairly fundamental architectual work. I think it might be the > right thing even for 18 to go for the simpler only-a-single-leaf-page > approach though. I definitely think it's a good idea to have that as a fall back option. And to not commit ourselves to having something better than that for v1 (though we probably should commit to making that possible in v2). > I wonder if there are prerequisites that can be tackled for 17. One idea is to > work on infrastructure to provide executor nodes with information about the > number of tuples likely to be fetched - I suspect we'll trigger regressions > without that in place. I don't think that there'll be regressions if we just take the simpler only-a-single-leaf-page approach. At least it seems much less likely. > One way to *sometimes* process more than a single leaf page, without having to > redesign kill_prior_tuple, would be to use the visibilitymap to check if the > target pages are all-visible. If all the table pages on a leaf page are > all-visible, we know that we don't need to kill index entries, and thus can > move on to the next leaf page It's possible that we'll need a variety of different strategies. nbtree already has two such strategies in _bt_killitems(), in a way. Though its "Modified while not pinned means hinting is not safe" path (LSN doesn't match canary value path) seems pretty naive. The prefetching stuff might present us with a good opportunity to replace that with something fundamentally better. -- Peter Geoghegan
Re: logical decoding and replication of sequences, take 2
On 2/15/24 05:16, Robert Haas wrote: > On Wed, Feb 14, 2024 at 10:21 PM Tomas Vondra > wrote: >> The way I think about non-transactional sequence changes is as if they >> were tiny transactions that happen "fully" (including commit) at the LSN >> where the LSN change is logged. > > 100% this. > >> It does not mean we can arbitrarily reorder the changes, it only means >> the changes are applied as if they were independent transactions (but in >> the same order as they were executed originally). Both with respect to >> the other non-transactional changes, and to "commits" of other stuff. > > Right, this is very important and I agree completely. > > I'm feeling more confident about this now that I heard you say that > stuff -- this is really the key issue I've been worried about since I > first looked at this, and I wasn't sure that you were in agreement, > but it sounds like you are. I think we should (a) fix the locking bug > I found (but that can be independent of this patch) and (b) make sure > that this patch documents the points from the quoted material above so > that everyone who reads the code (and maybe tries to enhance it) is > clear on what the assumptions are. > > (I haven't checked whether it documents that stuff or not. I'm just > saying it should, because I think it's a subtlety that someone might > miss.) > Thanks for thinking about these issues with reordering events. Good we seem to be in agreement and that you feel more confident about this. I'll check if there's a good place to document this. For me, the part that I feel most uneasy about is the decoding while the snapshot is still being built (and can flip to consistent snapshot between the relfilenode creation and sequence change, confusing the logic that decides which changes are transactional). It seems "a bit weird" that we either keep the "simple" logic that may end up with incorrect "non-transactional" result, but happens to then work fine because we immediately discard the change. But it still feels better than the alternative, which requires us to start decoding stuff (relfilenode creation) before building a proper snapshot is consistent, which we didn't do before - or at least not in this particular way. While I don't have a practical example where it would cause trouble now, I have a nagging feeling it might easily cause trouble in the future by making some new features harder to implement. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: index prefetching
Hi, On 2024-02-15 12:53:10 -0500, Peter Geoghegan wrote: > On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra > wrote: > > I may be missing something, but it seems fairly self-evident to me an > > entry at the beginning of an index page won't get prefetched (assuming > > the page-at-a-time thing). > > Sure, if the first item on the page is also the first item that we > need the scan to return (having just descended the tree), then it > won't get prefetched under a scheme that sticks with the current > page-at-a-time behavior (at least in v1). Just like when the first > item that we need the scan to return is from the middle of the page, > or more towards the end of the page. > > It is of course also true that we can't prefetch the next page's > first item until we actually visit the next page -- clearly that's > suboptimal. Just like we can't prefetch any other, later tuples from > the next page (until such time as we have determined for sure that > there really will be a next page, and have called _bt_readpage for > that next page.) > > This is why I don't think that the tuples with lower page offset > numbers are in any way significant here. The significant part is > whether or not you'll actually need to visit more than one leaf page > in the first place (plus the penalty from not being able to reorder > the work across page boundaries in your initial v1 of prefetching). To me this your phrasing just seems to reformulate the issue. In practical terms you'll have to wait for the full IO latency when fetching the table tuple corresponding to the first tid on a leaf page. Of course that's also the moment you had to visit another leaf page. Whether the stall is due to visit another leaf page or due to processing the first entry on such a leaf page is a distinction without a difference. > > That's certainly true / helpful, and it makes the "first entry" issue > > much less common. But the issue is still there. Of course, this says > > nothing about the importance of the issue - the impact may easily be so > > small it's not worth worrying about. > > Right. And I want to be clear: I'm really *not* sure how much it > matters. I just doubt that it's worth worrying about in v1 -- time > grows short. Although I agree that we should commit a v1 that leaves > the door open to improving matters in this area in v2. I somewhat doubt that it's realistic to aim for 17 at this point. We seem to still be doing fairly fundamental architectual work. I think it might be the right thing even for 18 to go for the simpler only-a-single-leaf-page approach though. I wonder if there are prerequisites that can be tackled for 17. One idea is to work on infrastructure to provide executor nodes with information about the number of tuples likely to be fetched - I suspect we'll trigger regressions without that in place. One way to *sometimes* process more than a single leaf page, without having to redesign kill_prior_tuple, would be to use the visibilitymap to check if the target pages are all-visible. If all the table pages on a leaf page are all-visible, we know that we don't need to kill index entries, and thus can move on to the next leaf page Greetings, Andres Freund
RE: Psql meta-command conninfo+
>v14 applies cleanly and the SSL info is now shown as previously >suggested. Here is a more comprehensive test: > > >$ /usr/local/postgres-dev/bin/psql -x "\ >host=server.uni-muenster.de >hostaddr=172.19.42.1 >user=jim dbname=postgres >sslrootcert=server-certificates/server.crt >sslcert=jim-certificates/jim.crt >sslkey=jim-certificates/jim.key" > >psql (17devel) >SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, >compression: off) >Type "help" for help. > >postgres=# SET ROLE foo; >SET >postgres=> \conninfo+ >Current Connection Information >-[ RECORD 1 >]--+--- >Database | postgres >Authenticated User | jim >System User| cert:emailAddress=j...@uni-muenster.de,CN=jim,OU=WWU >IT,O=Universitaet Muenster,L=Muenster,ST=Nordrhein-Westfalen,C=DE >Current User | foo >Session User | jim >Backend PID| 278294 >Server Address | 172.19.42.1 >Server Port| 5432 >Client Address | 192.168.178.27 >Client Port| 54948 >Socket Directory | >Host | server.uni-muenster.de >Encryption | SSL >Protocol | TLSv1.3 >Cipher | TLS_AES_256_GCM_SHA384 >Compression| off > >postgres=> \conninfo >You are connected to database "postgres" as user "jim" on host >"server.uni-muenster.de" (address "127.0.0.1") at port "5432". >SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, >compression: off) >A little odd is that "Server Address" of \conninfo+ differs from >"address" of \conninfo... // Hi Jim! Tests performed on CentOS Linux 7. I made some further observations and concluded that there will be more cases where the "address" from \conninfo will differ from the "Server Address" from \conninfo+. Below is a more detailed example. The input of "hostaddr" or "host" in the psql call can be any pointing to "loopback local" and the connection will still be established. For example, all of these are accepted: Case (inet): psql -x --host 0 psql -x --host 0.0.0.0 psql -x hostaddr=0 psql -x hostaddr=0.0.0.0 All these examples will have "Server Address" = 127.0.0.1 Case (inet6): psql -x --host :: psql -x --host * (this entry is not accepted) psql -x --host \* psql -x --host "*" psql -x hostaddr=:: psql -x hostaddr=* All these examples will have "Server Address" = ::1 Thus, the inet_server_addr() function will return 127.0.0.1 or ::1 which in some cases will differ from the "address" from \conninfo. [postgres@localhost bin]$ ./psql -x hostaddr=0 Password for user postgres: psql (17devel) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, compression: off) Type "help" for help. postgres=# SET ROLE maiquel; SET postgres=> \conninfo You are connected to database "postgres" as user "postgres" on host "0" (address "0.0.0.0") at port "5432". SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, compression: off) postgres=> \conninfo+ Current Connection Information -[ RECORD 1 ]--+ Database | postgres Authenticated User | postgres System User| scram-sha-256:postgres Current User | maiquel Session User | postgres Backend PID| 15205 Server Address | 127.0.0.1 Server Port| 5432 Client Address | 127.0.0.1 Client Port| 57598 Socket Directory | Host | 0 Encryption | SSL Protocol | TLSv1.2 Cipher | ECDHE-RSA-AES256-GCM-SHA384 Compression| off postgres=> \q [postgres@localhost bin]$ ping 0.0.0.0 PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.061 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.069 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.071 ms 64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.107 ms ^C --- 0.0.0.0 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3003ms rtt min/avg/max/mdev = 0.061/0.077/0.107/0.017 ms As demonstrated above, "address" = 0.0.0.0 and "Server Address" = 127.0.0.1 are divergent. In practice, these IPs are the "same", and the ping from the example proves it. However, we are concerned here with the psql user, and this may seem confusing to them at first glance. I would like to propose a change in "address" so that it always returns the same as "Server Address", that is, to use the inet_server_address() function in "address". Result: [postgres@localhost bin]$ ./psql -x hostaddr=0 psql (17devel) Type "help" for help. postgres=# \conninfo You are connected to database "postgres" as user "postgres" on host "0" (address "127.0.0.1") at port "5432". postgres=# \conninfo+ Current Connection Information -[ RECORD 1 ]--+-- Database | postgres Authenticated User | postgres System User| Current User | postgres Sessi
Re: RFC: Logging plan of the running query
Hi, On 2024-02-15 14:42:11 +0530, Robert Haas wrote: > I think the issue is very general. We have lots of subsystems that > both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS(). > If we process an interrupt while that code is in the middle of > manipulating its global variables and then again re-enter that code, > bad things might happen. We could try to rule that out by analyzing > all such subsystems and all places where CHECK_FOR_INTERRUPTS() may > appear, but that's very difficult. Suppose we took the alternative > approach recommended by Andrey Lepikhov in > http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b5...@app.fastmail.com > and instead set a flag that gets handled in some suitable place in the > executor code, like ExecProcNode(). If we did that, then we would know > that we're not in the middle of a syscache lookup, a catcache lookup, > a heavyweight lock acquisition, an ereport, or any other low-level > subsystem call that might create problems for the EXPLAIN code. > > In that design, the hack shown above would go away, and we could be > much more certain that we don't need any other similar hacks for other > subsystems. The only downside is that we might experience a slightly > longer delay before a requested EXPLAIN plan actually shows up, but > that seems like a pretty small price to pay for being able to reason > about the behavior of the system. I am very wary of adding overhead to ExecProcNode() - I'm quite sure that adding code there would trigger visible overhead for query times. If we went with something like tht approach, I think we'd have to do something like redirecting node->ExecProcNode to a wrapper, presumably from within a CFI. That wrapper could then implement the explain support, without slowing down the normal execution path. > I don't *think* there are any cases where we run in the executor for a > particularly long time without a new call to ExecProcNode(). I guess it depends on what you call a long time. A large sort, for example, could spend a fair amount of time inside tuplesort, similarly, a gather node might need to wait for a worker for a while etc. > It's really hard for me to accept that the heavyweight lock problem > for which the patch contains a workaround is the only one that exists. > I can't see any reason why that should be true. I suspect you're right. Greetings, Andres Freund
Re: Document efficient self-joins / UPDATE LIMIT techniques.
> > > As for whether it's commonplace, when I was a consultant I had a number > > of customers that I had who bemoaned how large updates caused big > > replica lag, basically punishing access to records they did care about > > in order to properly archive or backfill records they don't care about. > > I used the technique a lot, putting the update/delete in a loop, and > > often running multiple copies of the same script at times when I/O > > contention was low, but if load levels rose it was trivial to just kill > > a few of the scripts until things calmed down. > > I've also used the technique quite a lot, but only using the PK, > didn't know about the ctid trick, so many thanks for documenting it. tid-scans only became a thing a few versions ago (12?). Prior to that, PK was the only way to go.
Re: partitioning and identity column
Hello Ashutosh, 24.01.2024 09:34, Ashutosh Bapat wrote: There's another thing I found. The file isn't using check_stack_depth() in the function which traverse inheritance hierarchies. This isn't just a problem of the identity related function but most of the functions in that file. Do you think it's worth fixing it? I suppose the number of inheritance levels is usually not a problem for stack depth? Practically it should not. I would rethink the application design if it requires so many inheritance or partition levels. But functions in optimizer like try_partitionwise_join() and set_append_rel_size() call /* Guard against stack overflow due to overly deep inheritance tree. */ check_stack_depth(); I am fine if we want to skip this. I've managed to reach stack overflow inside ATExecSetIdentity() with the following script: (echo "CREATE TABLE tp0 (a int PRIMARY KEY, b int GENERATED ALWAYS AS IDENTITY) PARTITION BY RANGE (a);"; for ((i=1;i<=8;i++)); do echo "CREATE TABLE tp$i PARTITION OF tp$(( $i - 1 )) FOR VALUES FROM ($i) TO (100) PARTITION BY RANGE (a);"; done; echo "ALTER TABLE tp0 ALTER COLUMN b SET GENERATED BY DEFAULT;") | psql >psql.log (with max_locks_per_transaction = 400 in the config) It runs about 15 minutes for me and ends with: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169 1169 { (gdb) bt #0 0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169 #1 0x55a8cea0342d in WALInsertLockAcquire () at xlog.c:1389 #2 XLogInsertRecord (rdata=0x55a8cf1ccee8 , fpw_lsn=fpw_lsn@entry=1261347512, flags=0 '\000', num_fpi=num_fpi@entry=0, topxid_included=false) at xlog.c:817 #3 0x55a8cea1396e in XLogInsert (rmid=rmid@entry=11 '\v', info=) at xloginsert.c:524 #4 0x55a8ce9c1541 in _bt_insertonpg (rel=0x7faeb8478c98, heaprel=0x7faecf63d378, itup_key=itup_key@entry=0x55a8d5064678, buf=3210, cbuf=cbuf@entry=0, stack=stack@entry=0x55a8d1063d08, itup=0x55a8d5064658, itemsz=16, newitemoff=, postingoff=0, split_only_page=) at nbtinsert.c:1389 #5 0x55a8ce9bf9a7 in _bt_doinsert (rel=, rel@entry=0x7faeb8478c98, itup=, itup@entry=0x55a8d5064658, checkUnique=, checkUnique@entry=UNIQUE_CHECK_YES, indexUnchanged=, heapRel=, heapRel@entry=0x7faecf63d378) at nbtinsert.c:260 #6 0x55a8ce9c92ad in btinsert (rel=0x7faeb8478c98, values=, isnull=, ht_ctid=0x55a8d50643cc, heapRel=0x7faecf63d378, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=, indexInfo=) at nbtree.c:205 #7 0x55a8cea41391 in CatalogIndexInsert (indstate=indstate@entry=0x55a8d0fc03e8, heapTuple=, heapTuple@entry=0x55a8d50643c8, updateIndexes=) at indexing.c:170 #8 0x55a8cea4172c in CatalogTupleUpdate (heapRel=heapRel@entry=0x7faecf63d378, otid=0x55a8d50643cc, tup=tup@entry=0x55a8d50643c8) at indexing.c:324 #9 0x55a8ceb18173 in ATExecSetIdentity (rel=0x7faeab1288a8, colName=colName@entry=0x55a8d0fbc2b8 "b", def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8307 #10 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab127f28, colName=colName@entry=0x55a8d0fbc2b8 "b", def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8337 #11 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab1275a8, colName=colName@entry=0x55a8d0fbc2b8 "b", def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8337 #12 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab126c28, colName=colName@entry=0x55a8d0fbc2b8 "b", def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, recursing=) at tablecmds.c:8337 ... Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too, so I think they can be exploited as well. Best regards, Alexander
Re: index prefetching
On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra wrote: > I may be missing something, but it seems fairly self-evident to me an > entry at the beginning of an index page won't get prefetched (assuming > the page-at-a-time thing). Sure, if the first item on the page is also the first item that we need the scan to return (having just descended the tree), then it won't get prefetched under a scheme that sticks with the current page-at-a-time behavior (at least in v1). Just like when the first item that we need the scan to return is from the middle of the page, or more towards the end of the page. It is of course also true that we can't prefetch the next page's first item until we actually visit the next page -- clearly that's suboptimal. Just like we can't prefetch any other, later tuples from the next page (until such time as we have determined for sure that there really will be a next page, and have called _bt_readpage for that next page.) This is why I don't think that the tuples with lower page offset numbers are in any way significant here. The significant part is whether or not you'll actually need to visit more than one leaf page in the first place (plus the penalty from not being able to reorder the work across page boundaries in your initial v1 of prefetching). > If I understand your point about boundary cases / suffix truncation, > that helps us by (a) picking the split in a way to minimize a single key > spanning multiple pages, if possible and (b) increasing the number of > entries that fit onto a single index page. More like it makes the boundaries between leaf pages (i.e. high keys) align with the "natural boundaries of the key space". Simple point queries should practically never require more than a single leaf page access as a result. Even somewhat complicated index scans that are reasonably selective (think tens to low hundreds of matches) don't tend to need to read more than a single leaf page match, at least with equality type scan keys for the index qual. > That's certainly true / helpful, and it makes the "first entry" issue > much less common. But the issue is still there. Of course, this says > nothing about the importance of the issue - the impact may easily be so > small it's not worth worrying about. Right. And I want to be clear: I'm really *not* sure how much it matters. I just doubt that it's worth worrying about in v1 -- time grows short. Although I agree that we should commit a v1 that leaves the door open to improving matters in this area in v2. > One case I've been thinking about is sorting using index, where we often > read large part of the index. That definitely seems like a case where reordering work/desynchronization of the heap and index scans might be relatively important. > > I think that there is no question that this will need to not > > completely disable kill_prior_tuple -- I'd be surprised if one single > > person disagreed with me on this point. There is also a more nuanced > > way of describing this same restriction, but we don't necessarily need > > to agree on what exactly that is right now. > > > > Even for the page-at-a-time approach? Or are you talking about the v2? I meant that the current kill_prior_tuple behavior isn't sacred, and can be revised in v2, for the benefit of lifting the restriction on prefetching. But that's going to involve a trade-off of some kind. And not a particularly simple one. > Yeah. The basic idea was that by moving this above index AM it will work > for all indexes automatically - but given the current discussion about > kill_prior_tuple, locking etc. I'm not sure that's really feasible. > > The index AM clearly needs to have more control over this. Cool. I think that that makes the layering question a lot clearer, then. -- Peter Geoghegan
Re: Add trim_trailing_whitespace to editorconfig file
On Thu, 15 Feb 2024 at 16:57, Peter Eisentraut wrote: > Is there a command-line tool to verify the syntax of .editorconfig and > check compliance of existing files? > > I'm worried that expanding .editorconfig with detailed per-file rules > will lead to a lot of mistakes and blind editing, if we don't have > verification tooling. I tried this one just now: https://github.com/editorconfig-checker/editorconfig-checker.javascript I fixed all the issues by updating my patchset to use "unset" for insert_final_newline instead of "false". All other files were already clean, which makes sense because the new editorconfig rules are exactly the same as gitattributes (which I'm guessing we are checking in CI/buildfarm). So I don't think it makes sense to introduce another tool to check the same thing again. v3-0002-Require-final-newline-in-.po-files.patch Description: Binary data v3-0004-Add-note-about-keeping-.editorconfig-and-.gitattr.patch Description: Binary data v3-0001-Remove-non-existing-file-from-.gitattributes.patch Description: Binary data v3-0005-Add-indent-information-about-gitattributes-to-edi.patch Description: Binary data v3-0003-Bring-editorconfig-in-line-with-gitattributes.patch Description: Binary data
Re: index prefetching
On 2/15/24 17:42, Peter Geoghegan wrote: > On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra > wrote: >> On 2/15/24 00:06, Peter Geoghegan wrote: >>> I suppose that it might be much more important than I imagine it is >>> right now, but it'd be nice to have something a bit more concrete to >>> go on. >>> >> >> This probably depends on which corner cases are considered important. >> >> The page-at-a-time approach essentially means index items at the >> beginning of the page won't get prefetched (or vice versa, prefetch >> distance drops to 0 when we get to end of index page). > > I don't think that's true. At least not for nbtree scans. > > As I went into last year, you'd get the benefit of the work I've done > on "boundary cases" (most recently in commit c9c0589f from just a > couple of months back), which helps us get the most out of suffix > truncation. This maximizes the chances of only having to scan a single > index leaf page in many important cases. So I can see no reason why > index items at the beginning of the page are at any particular > disadvantage (compared to those from the middle or the end of the > page). > I may be missing something, but it seems fairly self-evident to me an entry at the beginning of an index page won't get prefetched (assuming the page-at-a-time thing). If I understand your point about boundary cases / suffix truncation, that helps us by (a) picking the split in a way to minimize a single key spanning multiple pages, if possible and (b) increasing the number of entries that fit onto a single index page. That's certainly true / helpful, and it makes the "first entry" issue much less common. But the issue is still there. Of course, this says nothing about the importance of the issue - the impact may easily be so small it's not worth worrying about. > Where you might have a problem is cases where it's just inherently > necessary to visit more than a single leaf page, despite the best > efforts of the nbtsplitloc.c logic -- cases where the scan just > inherently needs to return tuples that "straddle the boundary between > two neighboring pages". That isn't a particularly natural restriction, > but it's also not obvious that it's all that much of a disadvantage in > practice. > One case I've been thinking about is sorting using index, where we often read large part of the index. >> It certainly was a great improvement, no doubt about that. I dislike the >> restriction, but that's partially for aesthetic reasons - it just seems >> it'd be nice to not have this. >> >> That being said, I'd be OK with having this restriction if it makes v1 >> feasible. For me, the big question is whether it'd mean we're stuck with >> this restriction forever, or whether there's a viable way to improve >> this in v2. > > I think that there is no question that this will need to not > completely disable kill_prior_tuple -- I'd be surprised if one single > person disagreed with me on this point. There is also a more nuanced > way of describing this same restriction, but we don't necessarily need > to agree on what exactly that is right now. > Even for the page-at-a-time approach? Or are you talking about the v2? >> And I don't have answer to that :-( I got completely lost in the ongoing >> discussion about the locking implications (which I happily ignored while >> working on the PoC patch), layering tensions and questions which part >> should be "in control". > > Honestly, I always thought that it made sense to do things on the > index AM side. When you went the other way I was surprised. Perhaps I > should have said more about that, sooner, but I'd already said quite a > bit at that point, so... > > Anyway, I think that it's pretty clear that "naive desynchronization" > is just not acceptable, because that'll disable kill_prior_tuple > altogether. So you're going to have to do this in a way that more or > less preserves something like the current kill_prior_tuple behavior. > It's going to have some downsides, but those can be managed. They can > be managed from within the index AM itself, a bit like the > _bt_killitems() no-pin stuff does things already. > > Obviously this interpretation suggests that doing things at the index > AM level is indeed the right way to go, layering-wise. Does it make > sense to you, though? > Yeah. The basic idea was that by moving this above index AM it will work for all indexes automatically - but given the current discussion about kill_prior_tuple, locking etc. I'm not sure that's really feasible. The index AM clearly needs to have more control over this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 15, 2024 at 05:58:47PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot > wrote: > > Also I was thinking: what about adding an output to > > pg_sync_replication_slots()? > > The output could be the number of sync slots that have been created and are > > not considered as sync-ready during the execution. > > > > Yeah, we can consider outputting some information via this function > like how many slots are synced and persisted but not sure what would > be appropriate here. Because one can anyway find that or more > information by querying pg_replication_slots. Right, so maybe just return a bool that would indicate that at least one new created slot(s) is/are not sync-ready? (If so, then the details could be found in pg_replication_slots). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 15, 2024 at 06:13:38PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 12:07 PM Zhijie Hou (Fujitsu) > wrote: > > > > Since the slotsync function is committed, I rebased remaining patches. > > And here is the V88 patch set. > > Thanks! > > Please find the improvements in some of the comments in v88_0001* > attached. Kindly include these in next version, if you are okay with > it. Looking at v88_0001, random comments: 1 === Commit message "Be enabling slot synchronization" Typo? s:Be/By 2 === +It enables a physical standby to synchronize logical failover slots +from the primary server so that logical subscribers are not blocked +after failover. Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? (was discussed in [1]) 3 === +#define SlotSyncWorkerAllowed()\ + (sync_replication_slots && pmState == PM_HOT_STANDBY && \ +SlotSyncWorkerCanRestart()) Maybe add a comment above the macro explaining the logic? 4 === +#include "replication/walreceiver.h" #include "replication/slotsync.h" should be reverse order? 5 === + if (SlotSyncWorker->syncing) { - SpinLockRelease(&SlotSyncCtx->mutex); + SpinLockRelease(&SlotSyncWorker->mutex); ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot synchronize replication slots concurrently")); } worth to add a test in 040_standby_failover_slots_sync.pl for it? 6 === +static void +slotsync_reread_config(bool restart) +{ worth to add test(s) in 040_standby_failover_slots_sync.pl for it? [1]: https://www.postgresql.org/message-id/CAA4eK1JcBG6TJ3o5iUd4z0BuTbciLV3dK4aKgb7OgrNGoLcfSQ%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: planner chooses incremental but not the best one
On 2/15/24 13:45, Andrei Lepikhov wrote: > On 15/2/2024 18:10, Tomas Vondra wrote: >> >> >> On 2/15/24 07:50, Andrei Lepikhov wrote: >>> On 18/12/2023 19:53, Tomas Vondra wrote: On 12/18/23 11:40, Richard Guo wrote: The challenge is where to get usable information about correlation between columns. I only have a couple very rought ideas of what might try. For example, if we have multi-column ndistinct statistics, we might look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from ndistinct(b,c,d) / ndistinct(b,c) If we know how many distinct values we have for the predicate column, we could then estimate the number of groups. I mean, we know that for the restriction "WHERE b = 3" we only have 1 distinct value, so we could estimate the number of groups as 1 * ndistinct(b,c) >>> Did you mean here ndistinct(c,d) and the formula: >>> ndistinct(b,c,d) / ndistinct(c,d) ? >> >> Yes, I think that's probably a more correct ... Essentially, the idea is >> to estimate the change in number of distinct groups after adding a >> column (or restricting it in some way). > Thanks, I got it. I just think how to implement such techniques with > extensions just to test the idea in action. In the case of GROUP-BY we > can use path hook, of course. But what if to invent a hook on clauselist > estimation? Maybe. I have thought about introducing such hook to alter estimation of clauses, so I'm not opposed to it. Ofc, it depends on where would the hook be, what would it be allowed to do etc. And as it doesn't exist yet, it'd be more a "local" improvement to separate the changes into an extension. >>> Do you implicitly bear in mind here the necessity of tracking clauses >>> that were applied to the data up to the moment of grouping? >>> >> >> I don't recall what exactly I considered two months ago when writing the >> message, but I don't see why we would need to track that beyond what we >> already have. Shouldn't it be enough for the grouping to simply inspect >> the conditions on the lower levels? > Yes, exactly. I've thought about looking into baserestrictinfos and, if > group-by references a subquery targetlist, into subqueries too. > True. Something like that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: index prefetching
On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra wrote: > On 2/15/24 00:06, Peter Geoghegan wrote: > > I suppose that it might be much more important than I imagine it is > > right now, but it'd be nice to have something a bit more concrete to > > go on. > > > > This probably depends on which corner cases are considered important. > > The page-at-a-time approach essentially means index items at the > beginning of the page won't get prefetched (or vice versa, prefetch > distance drops to 0 when we get to end of index page). I don't think that's true. At least not for nbtree scans. As I went into last year, you'd get the benefit of the work I've done on "boundary cases" (most recently in commit c9c0589f from just a couple of months back), which helps us get the most out of suffix truncation. This maximizes the chances of only having to scan a single index leaf page in many important cases. So I can see no reason why index items at the beginning of the page are at any particular disadvantage (compared to those from the middle or the end of the page). Where you might have a problem is cases where it's just inherently necessary to visit more than a single leaf page, despite the best efforts of the nbtsplitloc.c logic -- cases where the scan just inherently needs to return tuples that "straddle the boundary between two neighboring pages". That isn't a particularly natural restriction, but it's also not obvious that it's all that much of a disadvantage in practice. > It certainly was a great improvement, no doubt about that. I dislike the > restriction, but that's partially for aesthetic reasons - it just seems > it'd be nice to not have this. > > That being said, I'd be OK with having this restriction if it makes v1 > feasible. For me, the big question is whether it'd mean we're stuck with > this restriction forever, or whether there's a viable way to improve > this in v2. I think that there is no question that this will need to not completely disable kill_prior_tuple -- I'd be surprised if one single person disagreed with me on this point. There is also a more nuanced way of describing this same restriction, but we don't necessarily need to agree on what exactly that is right now. > And I don't have answer to that :-( I got completely lost in the ongoing > discussion about the locking implications (which I happily ignored while > working on the PoC patch), layering tensions and questions which part > should be "in control". Honestly, I always thought that it made sense to do things on the index AM side. When you went the other way I was surprised. Perhaps I should have said more about that, sooner, but I'd already said quite a bit at that point, so... Anyway, I think that it's pretty clear that "naive desynchronization" is just not acceptable, because that'll disable kill_prior_tuple altogether. So you're going to have to do this in a way that more or less preserves something like the current kill_prior_tuple behavior. It's going to have some downsides, but those can be managed. They can be managed from within the index AM itself, a bit like the _bt_killitems() no-pin stuff does things already. Obviously this interpretation suggests that doing things at the index AM level is indeed the right way to go, layering-wise. Does it make sense to you, though? -- Peter Geoghegan
Re: MAINTAIN privilege -- what do we need to un-revert it?
On Wed, Feb 14, 2024 at 01:02:26PM -0600, Nathan Bossart wrote: > BTW I have been testing reverting commit 151c22d (i.e., un-reverting > MAINTAIN) every month or two, and last I checked, it still applies pretty > cleanly. The only changes I've needed to make are to the catversion and to > a hard-coded version in a test (16 -> 17). Posting to get some cfbot coverage. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From ce4f0c8cc87c2534d46060dc0725783ad6275f21 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 15 Feb 2024 10:02:41 -0600 Subject: [PATCH v1 1/1] Revert "Revert MAINTAIN privilege and pg_maintain predefined role." XXX: NEEDS CATVERSION BUMP This reverts commit 151c22deee66a3390ca9a1c3675e29de54ae73fc. --- doc/src/sgml/ddl.sgml | 35 -- doc/src/sgml/func.sgml| 2 +- .../sgml/ref/alter_default_privileges.sgml| 4 +- doc/src/sgml/ref/analyze.sgml | 6 +- doc/src/sgml/ref/cluster.sgml | 10 +- doc/src/sgml/ref/grant.sgml | 3 +- doc/src/sgml/ref/lock.sgml| 4 +- .../sgml/ref/refresh_materialized_view.sgml | 5 +- doc/src/sgml/ref/reindex.sgml | 23 ++-- doc/src/sgml/ref/revoke.sgml | 2 +- doc/src/sgml/ref/vacuum.sgml | 6 +- doc/src/sgml/user-manag.sgml | 12 ++ src/backend/catalog/aclchk.c | 15 +++ src/backend/commands/analyze.c| 13 +- src/backend/commands/cluster.c| 43 +-- src/backend/commands/indexcmds.c | 34 ++--- src/backend/commands/lockcmds.c | 2 +- src/backend/commands/matview.c| 3 +- src/backend/commands/tablecmds.c | 16 +-- src/backend/commands/vacuum.c | 65 +- src/backend/utils/adt/acl.c | 8 ++ src/bin/pg_dump/dumputils.c | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 2 +- src/bin/psql/tab-complete.c | 6 +- src/include/catalog/pg_authid.dat | 5 + src/include/commands/tablecmds.h | 5 +- src/include/commands/vacuum.h | 5 +- src/include/nodes/parsenodes.h| 3 +- src/include/utils/acl.h | 5 +- .../expected/cluster-conflict-partition.out | 8 +- .../specs/cluster-conflict-partition.spec | 2 +- .../perl/PostgreSQL/Test/AdjustUpgrade.pm | 11 ++ src/test/regress/expected/cluster.out | 7 ++ src/test/regress/expected/create_index.out| 4 +- src/test/regress/expected/dependency.out | 22 ++-- src/test/regress/expected/privileges.out | 116 ++ src/test/regress/expected/rowsecurity.out | 34 ++--- src/test/regress/sql/cluster.sql | 5 + src/test/regress/sql/dependency.sql | 2 +- src/test/regress/sql/privileges.sql | 68 ++ 40 files changed, 444 insertions(+), 178 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 9d7e2c756b..8616a8e9cc 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -1868,8 +1868,8 @@ ALTER TABLE products RENAME TO items; INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, CONNECT, TEMPORARY, - EXECUTE, USAGE, SET - and ALTER SYSTEM. + EXECUTE, USAGE, SET, + ALTER SYSTEM, and MAINTAIN. The privileges applicable to a particular object vary depending on the object's type (table, function, etc.). More detail about the meanings of these privileges appears below. @@ -2160,7 +2160,19 @@ REVOKE ALL ON accounts FROM PUBLIC; - + + +MAINTAIN + + + Allows VACUUM, ANALYZE, + CLUSTER, REFRESH MATERIALIZED VIEW, + REINDEX, and LOCK TABLE on a + relation. + + + + The privileges required by other commands are listed on the reference page of the respective command. @@ -2309,6 +2321,11 @@ REVOKE ALL ON accounts FROM PUBLIC; A PARAMETER + + MAINTAIN + m + TABLE + @@ -2399,7 +2416,7 @@ REVOKE ALL ON accounts FROM PUBLIC; TABLE (and table-like objects) - arwdDxt + arwdDxtm none \dp @@ -2465,11 +2482,11 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw; => \dp mytable Access privileges - Schema | Name | Type | Access privileges | Column privileges | Policies -+-+---+---+---+-- - public | mytable | table | miriam=arwdDxt/miriam+| col1:+| -| | | =r/miriam+| miriam_rw=rw/miriam | -| | | admin=arw/miriam | | + Schema | Name | Type |
Re: Add trim_trailing_whitespace to editorconfig file
On 15.02.24 10:26, Jelte Fennema-Nio wrote: On Wed, 14 Feb 2024 at 23:19, Daniel Gustafsson wrote: +1 from me. But when do we want it to be false? That is, why not declare it true for all file types? Regression test .out files commonly have spaces at the end of the line. (Not to mention the ECPG .c files but they probably really shouldn't have.) Attached is v2, which now makes the rules between gitattributes and editorconfig completely identical. As well as improving two minor things about .gitattributes before doing that. Is there a command-line tool to verify the syntax of .editorconfig and check compliance of existing files? I'm worried that expanding .editorconfig with detailed per-file rules will lead to a lot of mistakes and blind editing, if we don't have verification tooling.
Re: Replace current implementations in crypt() and gen_salt() to OpenSSL
On 15.02.24 13:42, Koshi Shibagaki (Fujitsu) wrote: However, crypt() and gen_salt() do not use OpenSSL as mentioned in [2]. Therefore, if we run crypt() and gen_salt() on a machine with FIPS mode enabled, they are not affected by FIPS mode. This means we can use encryption algorithms disallowed in FIPS. I would like to change the proprietary implementations of crypt() and gen_salt() to use OpenSSL API. If it's not a problem, I am going to create a patch, but if you have a better approach, please let me know. The problems are: 1. All the block ciphers currently supported by crypt() and gen_salt() are not FIPS-compliant. 2. The crypt() and gen_salt() methods built on top of them (modes of operation, kind of) are not FIPS-compliant. 3. The implementations (crypt-blowfish.c, crypt-des.c, etc.) are not structured in a way that OpenSSL calls can easily be patched in. So if you want FIPS-compliant cryptography, these interfaces look like a dead end. I don't know if there are any modern equivalents of these functions that we should be supplying instead.
Re: Allow passing extra options to initdb for tests
Daniel Gustafsson writes: > On 15 Feb 2024, at 11:38, Peter Eisentraut wrote: >> We don't have a man page for pg_regress, so there is no place to >> comprehensively document all the options and their interactions. > This comes up every now and again, just yesterday there was a question on > -general [0] about alternate output files which also are > undocumented. Really? https://www.postgresql.org/docs/devel/regress-variant.html > Maybe it's time to add documentation for pg_regress? I'm inclined to think that a formal man page wouldn't be that useful, since nobody ever invokes pg_regress directly; as Peter says, what is of interest is the "make check" targets and the corresponding meson behaviors. I think 32.1 is already reasonably thorough about the make targets; but the complete lack of equivalent info about what to do in a meson build clearly needs to be rectified. regards, tom lane
Re: Reducing output size of nodeToString
On Thu, 15 Feb 2024 at 13:59, Peter Eisentraut wrote: > > Thanks, this patch set is a good way to incrementally work through these > changes. > > I have looked at > v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today. > Here are my thoughts: > > I believe we had discussed offline to not omit enum fields with value 0 > (WRITE_ENUM_FIELD). This is because the values of enum fields are > implementation artifacts, and this could be confusing for readers. Thanks for reminding me, I didn't remember this when I worked on updating the patchset. I'll update this soon. > I have some concerns about the round-trippability of float values. If > we do, effectively, if (node->fldname != 0.0), then I think this would > also match negative zero, but when we read it back it, it would get > assigned positive zero. Maybe there are other edge cases like this. > Might be safer to not mess with this. That's a good point. Would an additional check that the sign of the field equals the default's sign be enough for this? As for other cases, I'm not sure we currently want to support non-normal floats, even if it is technically possible to do the round-trip in the current format. > On the reading side, the macro nesting has gotten a bit out of hand. :) > We had talked earlier in the thread about the _DIRECT macros and you > said there were left over from something else you want to try, but I see > nothing else in this patch set uses this. I think this could all be > much simpler, like (omitting required punctuation) [...] > Not only is this simpler, but it might also have better performance, > because we don't have separate pg_strtok_next() and pg_strtok() calls in > sequence. Good points. I'll see what I can do here. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: index prefetching
On 2/15/24 00:06, Peter Geoghegan wrote: > On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman > wrote: > >> ... > > 2. Are you sure that the leaf-page-at-a-time thing is such a huge > hindrance to effective prefetching? > > I suppose that it might be much more important than I imagine it is > right now, but it'd be nice to have something a bit more concrete to > go on. > This probably depends on which corner cases are considered important. The page-at-a-time approach essentially means index items at the beginning of the page won't get prefetched (or vice versa, prefetch distance drops to 0 when we get to end of index page). That may be acceptable, considering we can usually fit 200+ index items on a single page. Even then it limits what effective_io_concurrency values are sensible, but in my experience quickly diminish past ~32. > 3. Even if it is somewhat important, do you really need to get that > part working in v1? > > Tomas' original prototype worked with the leaf-page-at-a-time thing, > and that still seemed like a big improvement to me. While being less > invasive, in effect. If we can agree that something like that > represents a useful step in the right direction (not an evolutionary > dead end), then we can make good incremental progress within a single > release. > It certainly was a great improvement, no doubt about that. I dislike the restriction, but that's partially for aesthetic reasons - it just seems it'd be nice to not have this. That being said, I'd be OK with having this restriction if it makes v1 feasible. For me, the big question is whether it'd mean we're stuck with this restriction forever, or whether there's a viable way to improve this in v2. And I don't have answer to that :-( I got completely lost in the ongoing discussion about the locking implications (which I happily ignored while working on the PoC patch), layering tensions and questions which part should be "in control". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: When extended query protocol ends?
On Wed, 14 Feb 2024 at 17:55, Tatsuo Ishii wrote: > >>> From [1] I think the JDBC driver sends something like below if > >>> autosave=always option is specified. > >>> > >>> "BEGIN READ ONLY" Parse/Bind/Eexecute (in the extended query protocol) > >>> "SAVEPOINT PGJDBC_AUTOSAVE" (in the simple query protocol) > >>> > >>> It seems the SAVEPOINT is sent without finishing the extended query > >>> protocol (i.e. without Sync message). Is it possible for the JDBC > >>> driver to issue a Sync message before sending SAVEPOINT in simple > >>> query protocol? Or you can send SAVEPOINT using the extended query > >>> protocol. > >>> > >>> [1] > >>> > https://www.pgpool.net/pipermail/pgpool-general/2023-December/009051.html > >> > >> > >> Can you ask the OP what version of the driver they are using. From what > I > >> can tell we send BEGIN using SimpleQuery. > > > > Sure. I will get back once I get the JDBC version. > > Here it is: > > JDBC driver version used:42.5.1 Regards, Karel. > Can you ask the OP what they are doing in the startup. I'm trying to replicate their situation. Looks like possibly 'setReadOnly' and 'select version()' Thanks, Dave > >
Re: Reducing output size of nodeToString
Thanks, this patch set is a good way to incrementally work through these changes. I have looked at v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today. Here are my thoughts: I believe we had discussed offline to not omit enum fields with value 0 (WRITE_ENUM_FIELD). This is because the values of enum fields are implementation artifacts, and this could be confusing for readers. (This could be added as a squeeze-out-every-byte change later, but if we're going to keep the format fit for human reading, I think we should skip this.) I have some concerns about the round-trippability of float values. If we do, effectively, if (node->fldname != 0.0), then I think this would also match negative zero, but when we read it back it, it would get assigned positive zero. Maybe there are other edge cases like this. Might be safer to not mess with this. On the reading side, the macro nesting has gotten a bit out of hand. :) We had talked earlier in the thread about the _DIRECT macros and you said there were left over from something else you want to try, but I see nothing else in this patch set uses this. I think this could all be much simpler, like (omitting required punctuation) #define READ_INT_FIELD(fldname, default) if ((token = next_field(fldname, &length))) local_node->fldname = atoi(token); else local_node->fldname = default; where next_field() would 1. read the next token 2. if it is ":fldname", continue; else rewind the read pointer and return NULL 3. read the next token and return that Not only is this simpler, but it might also have better performance, because we don't have separate pg_strtok_next() and pg_strtok() calls in sequence.
Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
> On 15 Feb 2024, at 04:58, David Benjamin wrote: > > By the way, I'm unable to add the patch to the next commitfest due to the > cool off period for new accounts. How long is that period? I don't suppose > there's a way to avoid it? There is a way to expedite the cooling-off period (it's a SPAM prevention measure), but I don't have access to it. In the meantime I've added the patch for you, and once the cooling off is over we can add your name as author. https://commitfest.postgresql.org/47/4835/ I'm currently reviewing it and will get back to you soon on that. -- Daniel Gustafsson
Re: planner chooses incremental but not the best one
On 15/2/2024 18:10, Tomas Vondra wrote: On 2/15/24 07:50, Andrei Lepikhov wrote: On 18/12/2023 19:53, Tomas Vondra wrote: On 12/18/23 11:40, Richard Guo wrote: The challenge is where to get usable information about correlation between columns. I only have a couple very rought ideas of what might try. For example, if we have multi-column ndistinct statistics, we might look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from ndistinct(b,c,d) / ndistinct(b,c) If we know how many distinct values we have for the predicate column, we could then estimate the number of groups. I mean, we know that for the restriction "WHERE b = 3" we only have 1 distinct value, so we could estimate the number of groups as 1 * ndistinct(b,c) Did you mean here ndistinct(c,d) and the formula: ndistinct(b,c,d) / ndistinct(c,d) ? Yes, I think that's probably a more correct ... Essentially, the idea is to estimate the change in number of distinct groups after adding a column (or restricting it in some way). Thanks, I got it. I just think how to implement such techniques with extensions just to test the idea in action. In the case of GROUP-BY we can use path hook, of course. But what if to invent a hook on clauselist estimation? Do you implicitly bear in mind here the necessity of tracking clauses that were applied to the data up to the moment of grouping? I don't recall what exactly I considered two months ago when writing the message, but I don't see why we would need to track that beyond what we already have. Shouldn't it be enough for the grouping to simply inspect the conditions on the lower levels? Yes, exactly. I've thought about looking into baserestrictinfos and, if group-by references a subquery targetlist, into subqueries too. -- regards, Andrei Lepikhov Postgres Professional
Re: Synchronizing slots from primary to standby
On Thu, Feb 15, 2024 at 12:07 PM Zhijie Hou (Fujitsu) wrote: > > Since the slotsync function is committed, I rebased remaining patches. > And here is the V88 patch set. > Please find the improvements in some of the comments in v88_0001* attached. Kindly include these in next version, if you are okay with it. -- With Regards, Amit Kapila. diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 68f48c052c..6173143bcf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1469,16 +1469,16 @@ FinishWalRecovery(void) XLogShutdownWalRcv(); /* -* Shutdown the slot sync workers to prevent potential conflicts between -* user processes and slotsync workers after a promotion. +* Shutdown the slot sync worker to prevent it from keep trying to fetch +* slots and drop any temporary slots. * * We do not update the 'synced' column from true to false here, as any * failed update could leave 'synced' column false for some slots. This * could cause issues during slot sync after restarting the server as a -* standby. While updating after switching to the new timeline is an -* option, it does not simplify the handling for 'synced' column. -* Therefore, we retain the 'synced' column as true after promotion as it -* may provide useful information about the slot origin. +* standby. While updating the 'synced' column after switching to the new +* timeline is an option, it does not simplify the handling for the +* 'synced' column. Therefore, we retain the 'synced' column as true after +* promotion as it may provide useful information about the slot origin. */ ShutDownSlotSync(); diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b0334ce449..011b5f4828 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -168,7 +168,7 @@ * they will never become live backends. dead_end children are not assigned a * PMChildSlot. dead_end children have bkend_type NORMAL. * - * "Special" children such as the startup, bgwriter, autovacuum launcher and + * "Special" children such as the startup, bgwriter, autovacuum launcher, and * slot sync worker tasks are not in this list. They are tracked via StartupPID * and other pid_t variables below. (Thus, there can't be more than one of any * given "special" child process type. We use BackendList entries for any @@ -3415,7 +3415,7 @@ CleanupBackend(int pid, /* * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer, - * walwriter, autovacuum, archiver, slot sync worker or background worker. + * walwriter, autovacuum, archiver, slot sync worker, or background worker. * * The objectives here are to clean up our local state about the child * process, and to signal all other remaining children to quickdie. diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index e40cdbe4d9..04271ee703 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -6,8 +6,8 @@ * loaded as a dynamic module to avoid linking the main server binary with * libpq. * - * Apart from walreceiver, the libpq-specific routines here are now being used - * by logical replication workers and slot sync worker as well. + * Apart from walreceiver, the libpq-specific routines are now being used by + * logical replication workers and slot synchronization. * * Portions Copyright (c) 2010-2024, PostgreSQL Global Development Group * diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c index 6492582156..56eb0ea11b 100644 --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -22,10 +22,10 @@ * which we can call pg_sync_replication_slots() periodically to perform * syncs. * - * The slot sync worker worker waits for a period of time before the next - * synchronization, with the duration varying based on whether any slots were - * updated during the last cycle. Refer to the comments above - * wait_for_slot_activity() for more details. + * The slot sync worker waits for some time before the next synchronization, + * with the duration varying based on whether any slots were updated during + * the last cycle. Refer to the comments above wait_for_slot_activity() for + * more details. * * Any standby synchronized slots will be dropped if they no longer need * to be synchronized. See comment atop drop_local_obsolete_slots() for more @@ -1449,9 +1449,8 @@ ShutDownSlotSync(void) /* * SlotSyncWorkerCanRestart * - * Returns true if the worker is allowed to restart if enough time has -
Replace current implementations in crypt() and gen_salt() to OpenSSL
Hi This is Shibagaki. When FIPS mode is enabled, some encryption algorithms cannot be used. Since PostgreSQL15, pgcrypto requires OpenSSL[1], digest() and other functions also follow this policy. However, crypt() and gen_salt() do not use OpenSSL as mentioned in [2]. Therefore, if we run crypt() and gen_salt() on a machine with FIPS mode enabled, they are not affected by FIPS mode. This means we can use encryption algorithms disallowed in FIPS. I would like to change the proprietary implementations of crypt() and gen_salt() to use OpenSSL API. If it's not a problem, I am going to create a patch, but if you have a better approach, please let me know. Thank you [1] https://github.com/postgres/postgres/commit/db7d1a7b0530e8cbd045744e1c75b0e63fb6916f [2] https://peter.eisentraut.org/blog/2023/12/05/postgresql-and-fips-mode crypt() and gen_salt() are performed on in example below. / -- OS RHEL8.6 $openssl version OpenSSL 1.1.1k FIPS 25 Mar 2021 $fips-mode-setup --check FIPS mode is enabled. $./pgsql17/bin/psql psql (17devel) Type "help" for help. postgres=# SHOW server_version; server_version 17devel (1 row) postgres=# SELECT digest('data','md5'); ERROR: Cannot use "md5": Cipher cannot be initialized postgres=# SELECT crypt('new password',gen_salt('md5')); -- md5 is not available when fips mode is turned on. This is a normal behavior ERROR: crypt(3) returned NULL postgres=# SELECT crypt('new password',gen_salt('des')); -- however, des is avalable. This may break a FIPS rule crypt --- 32REGk7H6dSnE (1 row) / FYI - OpenSSL itself cannot use DES algorithm while encrypting files. This is an expected behavior. --- Fujitsu Limited Shibagaki Koshi shibagaki.ko...@fujitsu.com
Re: Synchronizing slots from primary to standby
On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot wrote: > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > > wrote: > > > Attach the v2 patch here. > > > > > > Apart from the new log message. I think we can add one more debug message > > > in > > > reserve_wal_for_local_slot, this could be useful to analyze the failure. > > > > Yeah, that can also be helpful, but the added message looks naive to me. > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); > > > > Instead of the above, how about something like: "segno: %ld of > > purposed restart_lsn for the synced slot, oldest_segno: %ld > > available"? > > Looks good to me. I'm not sure if it would make more sense to elog only if > segno < oldest_segno means just before the XLogSegNoOffsetToRecPtr() call? > > But I'm fine with the proposed location too. > I am also fine either way but the current location gives required information in more number of cases and could be helpful in debugging this new facility. > > > > > And we > > > can also enable the DEBUG log in the 040 tap-test, I see we have similar > > > setting in 010_logical_decoding_timline and logging debug1 message doesn't > > > increase noticable time on my machine. These are done in 0002. > > > > > > > I haven't tested it but I think this can help in debugging BF > > failures, if any. I am not sure if to keep it always like that but > > till the time these tests are stabilized, this sounds like a good > > idea. So, how, about just making test changes as a separate patch so > > that later if required we can revert/remove it easily? Bertrand, do > > you have any thoughts on this? > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I think we > took the same approach for 035_standby_logical_decoding.pl IIRC) and then > revert > it back. > Good to know! > Also I was thinking: what about adding an output to > pg_sync_replication_slots()? > The output could be the number of sync slots that have been created and are > not considered as sync-ready during the execution. > Yeah, we can consider outputting some information via this function like how many slots are synced and persisted but not sure what would be appropriate here. Because one can anyway find that or more information by querying pg_replication_slots. I think we can keep discussing what makes more sense as a return value but let's commit the debug/log patches as they will be helpful to analyze BF failures or any other issues reported. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > wrote: > > Attach the v2 patch here. > > > > Apart from the new log message. I think we can add one more debug message in > > reserve_wal_for_local_slot, this could be useful to analyze the failure. > > Yeah, that can also be helpful, but the added message looks naive to me. > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); > > Instead of the above, how about something like: "segno: %ld of > purposed restart_lsn for the synced slot, oldest_segno: %ld > available"? Looks good to me. I'm not sure if it would make more sense to elog only if segno < oldest_segno means just before the XLogSegNoOffsetToRecPtr() call? But I'm fine with the proposed location too. > > > And we > > can also enable the DEBUG log in the 040 tap-test, I see we have similar > > setting in 010_logical_decoding_timline and logging debug1 message doesn't > > increase noticable time on my machine. These are done in 0002. > > > > I haven't tested it but I think this can help in debugging BF > failures, if any. I am not sure if to keep it always like that but > till the time these tests are stabilized, this sounds like a good > idea. So, how, about just making test changes as a separate patch so > that later if required we can revert/remove it easily? Bertrand, do > you have any thoughts on this? +1 on having DEBUG log in the 040 tap-test until it's stabilized (I think we took the same approach for 035_standby_logical_decoding.pl IIRC) and then revert it back. Also I was thinking: what about adding an output to pg_sync_replication_slots()? The output could be the number of sync slots that have been created and are not considered as sync-ready during the execution. I think that could be a good addition to v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch proposed here (should trigger special attention in case of non zero value). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add system identifier to backup manifest
On Thu, Feb 15, 2024 at 3:05 PM Amul Sul wrote: > Kindly have a look at the attached version. IMHO, 0001 looks fine, except probably the comment could be phrased a bit more nicely. That can be left for whoever commits this to wordsmith. Michael, what are your plans? 0002 seems like a reasonable approach, but there's a hunk in the wrong patch: 0004 modifies pg_combinebackup's check_control_files to use get_dir_controlfile() rather than git_controlfile(), but it looks to me like that should be part of 0002. -- Robert Haas EDB: http://www.enterprisedb.com
Re: make add_paths_to_append_rel aware of startup cost
On Thu, 15 Feb 2024 at 21:42, Andy Fan wrote: > I found the both plans have the same cost, I can't get the accurate > cause of this after some hours research, but it is pretty similar with > 7516056c584e3, so I uses a similar strategy to stable it. is it > acceptable? It's pretty hard to say. I can only guess why this test would be flapping like this. I see it's happened before on mylodon, so probably not a cosmic ray. It's not like add_path() chooses a random path when the costs are the same, so I wondered if something similar is going on here that was going on that led to f03a9ca4. In particular, see [1]. On master, I set a breakpoint in try_nestloop_path() to break on "outerrel->relid==1 && innerrel->relid==2". I see the total Nested Loop cost comes out the same with the join order reversed. Which is: -> Nested Loop (cost=0.00..1500915.00 rows=1 width=4) Doing the same with your patch applied, I get: -> Nested Loop (cost=0.00..600925.00 rows=4000 width=4) and forcing the join order to swap with the debugger, I see: -> Nested Loop (cost=0.00..600940.00 rows=4000 width=4) So there's a difference now, but it's quite small. If it was a problem like we had on [1], then since tenk1 and tenk2 have 345 pages (on my machine), if relpages is down 1 or 2 pages, we'll likely get more of a costing difference than 600925 vs 600940. If I use: explain select t1.unique1 from tenk1 t1 inner join tenk2 t2 on t1.tenthous = t2.tenthous and t2.thousand = 0 union all (values(1)) limit 1; I get: -> Nested Loop (cost=0.00..2415.03 rows=10 width=4) and with the join order reversed, I get: -> Nested Loop (cost=0.00..2440.00 rows=10 width=4) I'd be more happy using this one as percentage-wise, the cost difference is much larger. I don't quite have the will to go through proving what the actual problem is here. I think [1] already proved the relpages problem can (or could) happen. I checked that the t2.thounsand = 0 query still tests the cheap startup paths in add_paths_to_append_rel() and it does. If I flip startup_subpaths_valid to false in the debugger, the plan flips to: QUERY PLAN --- Limit (cost=470.12..514.00 rows=1 width=4) -> Append (cost=470.12..952.79 rows=11 width=4) -> Hash Join (cost=470.12..952.73 rows=10 width=4) Hash Cond: (t1.tenthous = t2.tenthous) -> Seq Scan on tenk1 t1 (cost=0.00..445.00 rows=1 width=8) -> Hash (cost=470.00..470.00 rows=10 width=4) -> Seq Scan on tenk2 t2 (cost=0.00..470.00 rows=10 width=4) Filter: (thousand = 0) -> Result (cost=0.00..0.01 rows=1 width=4) So, if nobody has any better ideas, I'm just going to push the " and t2.thousand = 0" adjustment. David [1] https://www.postgresql.org/message-id/4174.1563239552%40sss.pgh.pa.us
Re: Memory consumed by paths during partitionwise join planning
On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov wrote: > > On 6/2/2024 19:51, Ashutosh Bapat wrote: > > On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat > > The patches are raw. make check has some crashes that I need to fix. I > > am waiting to hear whether this is useful and whether the design is on > > the right track. > Let me write words of opinion on that feature. > I generally like the idea of a refcount field. We had problems > generating alternative paths in extensions and designing the Asymmetric > Join feature [1] when we proposed an alternative path one level below > and called the add_path() routine. We had lost the path in the path list > referenced from the upper RelOptInfo. This approach allows us to make > more handy solutions instead of hacking with a copy/restore pathlist. Thanks for your comments. I am glad that you find this useful. > But I'm not sure about freeing unreferenced paths. I would have to see > alternatives in the pathlist. I didn't understand this. Can you please elaborate? A path in any pathlist is referenced. An unreferenced path should not be in any pathlist. > About partitioning. As I discovered planning issues connected to > partitions, the painful problem is a rule, according to which we are > trying to use all nomenclature of possible paths for each partition. > With indexes, it quickly increases optimization work. IMO, this can help > a 'symmetrical' approach, which could restrict the scope of possible > pathways for upcoming partitions if we filter some paths in a set of > previously planned partitions. filter or free? > Also, I am glad to see a positive opinion about the path_walker() > routine. Somewhere else, for example, in [2], it seems we need it too. > I agree that we should introduce path_walker() yes. I am waiting for David's opinion about the performance numbers shared in [1] before working further on the patches and bring them out of WIP state. [1] postgr.es/m/CAExHW5uDkMQL8SicV3_=aycswwmtnr8gzjo310j7sv7tmlo...@mail.gmail.com -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
By the way, I'm unable to add the patch to the next commitfest due to the cool off period for new accounts. How long is that period? I don't suppose there's a way to avoid it? On Mon, Feb 12, 2024 at 11:31 AM David Benjamin wrote: > On Mon, Feb 12, 2024 at 9:38 AM Daniel Gustafsson wrote: > >> > On 11 Feb 2024, at 19:19, David Benjamin wrote: >> > It turns out the parts that came from the OpenSSL socket BIO were a >> no-op, and in fact PostgreSQL is relying on it being a no-op. Instead, it's >> cleaner to just define a custom BIO the normal way, which then leaves the >> more standard BIO_get_data mechanism usable. This also avoids the risk that >> a future OpenSSL will add a now BIO_ctrl to the socket type, with libssl >> calling into it, and then break some assumptions made by PostgreSQL. >> >> + case BIO_CTRL_FLUSH: >> + /* libssl expects all BIOs to support BIO_flush. >> */ >> + res = 1; >> + break; >> >> Will this always be true? Returning 1 implies that we have flushed all >> data on >> the socket, but what if we just before called BIO_set_retry..XX()? >> > > The real one is also just a no-op. :-) > https://github.com/openssl/openssl/blob/master/crypto/bio/bss_sock.c#L215 > > This is used in places like buffer BIO or the FILE* BIO, where BIO_write > might accept data, but stage it into a buffer, to be flushed later. libssl > ends up calling BIO_flush at the end of every flight, which in turn means > that BIOs used with libssl need to support it, even if to return true > because there's nothing to flush. (Arguably TCP sockets could have used a > flush concept, to help control Nagle's algorithm, but for better or worse, > that's a socket-wide TCP_NODELAY option, rather than an explicit flush > call.) > > BIO_set_retry.. behaves like POSIX I/O, where a failed EWOULDBLOCK write > is as if you never wrote to the socket at all and doesn't impact > socket state. That is, the data hasn't been accepted yet. It's not expected > for BIO_flush to care about the rejected write data. (Also I don't believe > libssl will ever trigger this case.) It's confusing because unlike an > EWOULDBLOCK errno, BIO_set_retry.. *is* itself BIO state, but that's just > because the BIO calling convention is goofy and didn't just return the > error out of the return value. So OpenSSL just stashes the bit on the BIO > itself, for you to query out immediately afterwards. > > >> > I've attached a patch which does that. The existing SSL tests pass with >> it, tested on Debian stable. (Though it took me a few iterations to figure >> out how to run the SSL tests, so it's possible I've missed something.) >> >> We've done a fair bit of work on making them easier to run, so I'm >> curious if >> you saw any room for improvements there as someone coming to them for the >> first >> time? >> > > A lot of my time was just trying to figure out how to run the tests in > the first place, so perhaps documentation? But I may just have been looking > in the wrong spot and honestly didn't really know what I was doing. I can > try to summarize what I did (from memory), and perhaps that can point to > possible improvements? > > - I looked in the repository for instructions on running the tests and > couldn't find any. At this point, I hadn't found src/test/README. > - I found > https://wiki.postgresql.org/wiki/Developer_FAQ#How_do_I_test_my_changes.3F, > linked from https://www.postgresql.org/developer/ > - I ran the configure build with --enable-cassert, ran make check, tests > passed. > - I wrote my patch and then spent a while intentionally adding bugs to see > if the tests would catch it (I wasn't sure whether there was ssl test > coverage), finally concluding that I wasn't running any ssl tests > - I looked some more and found src/test/ssl/README > - I reconfigured with --enable-tap-tests and ran make check > PG_TEST_EXTRA=ssl per those instructions, but the SSL tests still weren't > running > - I grepped for PG_TEST_EXTRA and found references in the CI config, but > using the meson build > - I installed meson, mimicked a few commands from the CI. That seemed to > work. > - I tried running only the ssl tests, looking up how you specify > individual tests in meson, to make my compile/test cycles a bit faster, but > they failed. > - I noticed that the first couple "tests" were named like setup tasks, and > guessed that the ssl tests depended on this setup to run. But by then I > just gave up and waited out the whole test suite per run. :-) > > Once I got it running, it was quite smooth. I just wasn't sure how to do > it. > > David >
Re: Synchronizing slots from primary to standby
On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) wrote: > > On Thursday, February 15, 2024 5:20 PM Amit Kapila > wrote: > > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > On Thursday, February 15, 2024 10:49 AM Amit Kapila > > wrote: > > > > > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > > > > > Right, we can do that or probably this test would have made more > > > > sense with a worker patch where we could wait for the slot to be synced. > > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do > > > > you think that adding a LOG when we are not able to sync will help > > > > in debugging such problems? I think eventually we can change it to > > > > DEBUG1 but for now, it can help with stabilizing BF and or some other > > reported issues. > > > > > > Here is the patch that attempts the re-create sub idea. > > > > > > > Pushed this. > > > > > > > I also think that a LOG/DEBUG > > > would be useful for such analysis, so the 0002 is to add such a log. > > > > > > > I feel such a LOG would be useful. > > > > + ereport(LOG, > > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" > > +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", > > > > I think waiting is a bit misleading here, how about something like: > > "could not sync slot information as remote slot precedes local slot: > > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X), > > catalog xmin (%u)" > > Changed. > > Attach the v2 patch here. > > Apart from the new log message. I think we can add one more debug message in > reserve_wal_for_local_slot, this could be useful to analyze the failure. Yeah, that can also be helpful, but the added message looks naive to me. + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); Instead of the above, how about something like: "segno: %ld of purposed restart_lsn for the synced slot, oldest_segno: %ld available"? > And we > can also enable the DEBUG log in the 040 tap-test, I see we have similar > setting in 010_logical_decoding_timline and logging debug1 message doesn't > increase noticable time on my machine. These are done in 0002. > I haven't tested it but I think this can help in debugging BF failures, if any. I am not sure if to keep it always like that but till the time these tests are stabilized, this sounds like a good idea. So, how, about just making test changes as a separate patch so that later if required we can revert/remove it easily? Bertrand, do you have any thoughts on this? -- With Regards, Amit Kapila.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada wrote: > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor wrote: > I've also run the same scripts in my environment just in case and got > similar results: Thanks for testing, looks good as well. > > There are still some micro-benchmarks we could do on tidstore, and > > it'd be good to find out worse-case memory use (1 dead tuple each on > > spread-out pages), but this is decent demonstration. > > I've tested a simple case where vacuum removes 33k dead tuples spread > about every 10 pages. > > master: > 198,000 bytes (=33000 * 6) > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > v-59: > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s The memory usage for the sparse case may be a concern, although it's not bad -- a multiple of something small is probably not huge in practice. See below for an option we have for this. > > > > I'm pretty sure there's an > > > > accidental memset call that crept in there, but I'm running out of > > > > steam today. > > > > I have just a little bit of work to add for v59: > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > and there are none between 64 and 128, so not a huge deal but I think > > it's a bit nicer in this patch. > > LGTM. Okay, I've squashed this. > I've drafted the commit message. Thanks, this is a good start. > I've run regression tests with valgrind and run the coverity scan, and > I don't see critical issues. Great! Now, I think we're in pretty good shape. There are a couple of things that might be objectionable, so I want to try to improve them in the little time we have: 1. Memory use for the sparse case. I shared an idea a few months ago of how runtime-embeddable values (true combined pointer-value slots) could work for tids. I don't think this is a must-have, but it's not a lot of code, and I have this working: v61-0006: Preparatory refactoring -- I think we should do this anyway, since the intent seems more clear to me. v61-0007: Runtime-embeddable tids -- Optional for v17, but should reduce memory regressions, so should be considered. Up to 3 tids can be stored in the last level child pointer. It's not polished, but I'll only proceed with that if we think we need this. "flags" iis called that because it could hold tidbitmap.c booleans (recheck, lossy) in the future, in addition to reserving space for the pointer tag. Note: I hacked the tests to only have 2 offsets per block to demo, but of course both paths should be tested. 2. Management of memory contexts. It's pretty verbose and messy. I think the abstraction could be better: A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we can't destroy or reset it. That means we have to do a lot of manual work. B: Passing "max_bytes" to the radix tree was my idea, I believe, but it seems the wrong responsibility. Not all uses will have a work_mem-type limit, I'm guessing. We only use it for limiting the max block size, and aset's default 8MB is already plenty small for vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so smaller, and there it makes sense to limit the max blocksize this way. C: The context for values has complex #ifdefs based on the value length/varlen, but it's both too much and not enough. If we get a bump context, how would we shoehorn that in for values for vacuum but not for tidbitmap? Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to TidStoreCreate(), and then to RT_CREATE. That context will contain the values (for local mem), and the node slabs will be children of the value context. That way, measuring memory usage and free-ing can just call with this parent context, and let recursion handle the rest. Perhaps the passed context can also hold the radix-tree struct, but I'm not sure since I haven't tried it. What do you think? With this resolved, I think the radix tree is pretty close to committable. The tid store will likely need some polish yet, but no major issues I know of. (And, finally, a small thing I that I wanted to share just so I don't forget, but maybe not worth the attention: In Andres's prototype, there is a comment wondering if an update can skip checking if it first need to create a root node. This is pretty easy, and done in v61-0008.) v61-ART.tar.gz Description: application/gzip
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Hi, On Thu, Feb 15, 2024 at 02:09:45PM +0900, Michael Paquier wrote: > On Thu, Jan 18, 2024 at 02:20:28PM +, Bertrand Drouvot wrote: > > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: > >> I'm not sure if it > >> can happen at all, but I think we can rely on previous conflict reason > >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. > > > > I'm not sure what you mean here. I think we should still keep the "initial" > > LSN > > so that the next check done with it still makes sense. The previous conflict > > reason as you're proposing also makes sense to me but for another reason: > > PANIC > > in case the issue still happen (for cases we did not think about, means not > > covered by what the added previous LSNs are covering). > > Using a PANIC seems like an overreaction to me for this path. Sure, > we should not record twice a conflict reason, but wouldn't an > assertion be more adapted enough to check that a termination is not > logged twice for this code path run in the checkpointer? Thanks for looking at it! Agree, using an assertion instead in v3 attached. > > +if (!terminated) > +{ > +initial_restart_lsn = s->data.restart_lsn; > +initial_effective_xmin = s->effective_xmin; > +initial_catalog_effective_xmin = s->effective_catalog_xmin; > +} > > It seems important to document why we are saving this data here; while > we hold the LWLock for repslots, before we perform any termination, > and we want to protect conflict reports with incorrect data if the > slot data got changed while the lwlock is temporarily released while > there's a termination. Yeah, comments added in v3. > >> 3. Is there a way to reproduce this racy issue, may be by adding some > >> sleeps or something? If yes, can you share it here, just for the > >> records and verify the whatever fix provided actually works? > > > > Alexander was able to reproduce it on a slow machine and the issue was not > > there > > anymore with v1 in place. I think it's tricky to reproduce as it would need > > the > > slot to advance between the 2 checks. > > I'd really want a test for that in the long term. And an injection > point stuck between the point where ReplicationSlotControlLock is > released then re-acquired when there is an active PID should be > enough, isn't it? Yeah, that should be enough. > For example just after the kill()? It seems to me > that we should stuck the checkpointer, perform a forced upgrade of the > xmins, release the checkpointer and see the effects of the change in > the second loop. Now, modules/injection_points/ does not allow that, > yet, but I've posted a patch among these lines that can stop a process > and release it using a condition variable (should be 0006 on [1]). I > was planning to start a new thread with a patch posted for the next CF > to add this kind of facility with a regression test for an old bug, > the patch needs a few hours of love, first. I should be able to post > that next week. Great, that looks like a good idea! Are we going to fix this on master and 16 stable first and then later on add a test on master with the injection points? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From eb8bb9eda4c1ea5fe2abe87df66ef2b86cdb2441 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 11 Jan 2024 13:57:53 + Subject: [PATCH v3] Fix race condition in InvalidatePossiblyObsoleteSlot() In case of an active slot we proceed in 2 steps: - terminate the backend holding the slot - report the slot as obsolete This is racy because between the two we release the mutex on the slot, which means the effective_xmin and effective_catalog_xmin could advance during that time. Holding the mutex longer is not an option (given what we'd to do while holding it) so record the previous LSNs instead that were used during the backend termination. --- src/backend/replication/slot.c | 39 -- 1 file changed, 33 insertions(+), 6 deletions(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 2180a38063..33f76c4acc 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1454,6 +1454,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, { int last_signaled_pid = 0; bool released_lock = false; + bool terminated = false; + XLogRecPtr initial_effective_xmin = InvalidXLogRecPtr; + XLogRecPtr initial_catalog_effective_xmin = InvalidXLogRecPtr; + XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr; + ReplicationSlotInvalidationCause conflict_prev PG_USED_FOR_ASSERTS_ONLY = RS_INVAL_NONE; for (;;) { @@ -1488,11 +1493,24 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, */ if (s->data.invalidated == RS_INVA
Re: planner chooses incremental but not the best one
On 2/15/24 07:50, Andrei Lepikhov wrote: > On 18/12/2023 19:53, Tomas Vondra wrote: >> On 12/18/23 11:40, Richard Guo wrote: >> The challenge is where to get usable information about correlation >> between columns. I only have a couple very rought ideas of what might >> try. For example, if we have multi-column ndistinct statistics, we might >> look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from >> >> ndistinct(b,c,d) / ndistinct(b,c) >> >> If we know how many distinct values we have for the predicate column, we >> could then estimate the number of groups. I mean, we know that for the >> restriction "WHERE b = 3" we only have 1 distinct value, so we could >> estimate the number of groups as >> >> 1 * ndistinct(b,c) > Did you mean here ndistinct(c,d) and the formula: > ndistinct(b,c,d) / ndistinct(c,d) ? Yes, I think that's probably a more correct ... Essentially, the idea is to estimate the change in number of distinct groups after adding a column (or restricting it in some way). > > Do you implicitly bear in mind here the necessity of tracking clauses > that were applied to the data up to the moment of grouping? > I don't recall what exactly I considered two months ago when writing the message, but I don't see why we would need to track that beyond what we already have. Shouldn't it be enough for the grouping to simply inspect the conditions on the lower levels? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: Synchronizing slots from primary to standby
On Thursday, February 15, 2024 5:20 PM Amit Kapila wrote: > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Thursday, February 15, 2024 10:49 AM Amit Kapila > wrote: > > > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > > > Right, we can do that or probably this test would have made more > > > sense with a worker patch where we could wait for the slot to be synced. > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do > > > you think that adding a LOG when we are not able to sync will help > > > in debugging such problems? I think eventually we can change it to > > > DEBUG1 but for now, it can help with stabilizing BF and or some other > reported issues. > > > > Here is the patch that attempts the re-create sub idea. > > > > Pushed this. > > > > I also think that a LOG/DEBUG > > would be useful for such analysis, so the 0002 is to add such a log. > > > > I feel such a LOG would be useful. > > + ereport(LOG, > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" > +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", > > I think waiting is a bit misleading here, how about something like: > "could not sync slot information as remote slot precedes local slot: > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X), > catalog xmin (%u)" Changed. Attach the v2 patch here. Apart from the new log message. I think we can add one more debug message in reserve_wal_for_local_slot, this could be useful to analyze the failure. And we can also enable the DEBUG log in the 040 tap-test, I see we have similar setting in 010_logical_decoding_timline and logging debug1 message doesn't increase noticable time on my machine. These are done in 0002. Best Regards, Hou zj v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch Description: v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch v2-0002-Add-a-debug-message-and-enable-DEBUG-log-in-slots.patch Description: v2-0002-Add-a-debug-message-and-enable-DEBUG-log-in-slots.patch
Re: Allow passing extra options to initdb for tests
> On 15 Feb 2024, at 11:38, Peter Eisentraut wrote: > We don't have a man page for pg_regress, so there is no place to > comprehensively document all the options and their interactions. This comes up every now and again, just yesterday there was a question on -general [0] about alternate output files which also are undocumented. Maybe it's time to add documentation for pg_regress? -- Daniel Gustafsson [0] CACX+KaPOzzRHEt4w_=iqkbtpmkjyrugvng1c749yp3r6dpr...@mail.gmail.com
Re: Allow passing extra options to initdb for tests
On 14.02.24 06:22, Ian Lawrence Barwick wrote: I had a longer look at this and can't find any issues with the code or documentation changes. Thanks, committed. I did wonder whether it would be worth mentioning that any initdb options set in "PG_TEST_INITDB_EXTRA_OPTS" will override those which can be set by pg_regress, but of the four ("--no-clean", "--no-sync", "--debug" and "--no-locale"), only the optional "--no-locale" can actually be overridden, so it doesn't seem important. I thought about this. We don't have a man page for pg_regress, so there is no place to comprehensively document all the options and their interactions. The documentation in regress.sgml works on a makefile level. AFAICT, the --debug option is not exposed via the makefiles at all, while --no-locale can be requested by the environment variable NOLOCALE, but that is not documented, and also not ported to meson. I think if you want tweaks on this level, there is some expectations right now that you might need to study the source code a bit. One thing that might be interesting would be to allow running initdb without the --no-sync option, to exercise fsync a bit. But there is no "--yes-sync" option to override --no-sync, otherwise you could do it with the just-committed feature.
Re: Do away with zero-padding assumption before WALRead()
Hi, On Tue, 13 Feb 2024 at 09:17, Bharath Rupireddy wrote: > > Hi, > > I noticed an assumption [1] at WALRead() call sites expecting the > flushed WAL page to be zero-padded after the flush LSN. I think this > can't always be true as the WAL can get flushed after determining the > flush LSN before reading it from the WAL file using WALRead(). I've > hacked the code up a bit to check if that's true - > https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP, > the tests hit the Assert(false); added. Which means, the zero-padding > comment around WALRead() call sites isn't quite right. > > I'm wondering why the WALRead() callers are always reading XLOG_BLCKSZ > despite knowing exactly how much to read. Is it to tell the OS to > explicitly fetch the whole page from the disk? If yes, the OS will do > that anyway because the page transfers from disk to OS page cache are > always in terms of disk block sizes, no? I am curious about the same. The page size and disk block size could be different, so the reason could be explicitly fetching the whole page from the disk as you said. Is this the reason or are there any other benefits of always reading XLOG_BLCKSZ instead of reading the sufficient part? I tried to search in older threads and code comments but I could not find an explanation. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Returning non-terminated string in ECPG Informix-compatible function
Thanks for review! I added a regression test that is based on code from previous email New patch is attached Oleg Tselebrovskiy, Postgres Prodiff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c index dccf39582da..80d40aa3e09 100644 --- a/src/interfaces/ecpg/compatlib/informix.c +++ b/src/interfaces/ecpg/compatlib/informix.c @@ -654,7 +654,7 @@ intoasc(interval * i, char *str) if (!tmp) return -errno; - memcpy(str, tmp, strlen(tmp)); + strcpy(str, tmp); free(tmp); return 0; } diff --git a/src/interfaces/ecpg/pgtypeslib/dt_common.c b/src/interfaces/ecpg/pgtypeslib/dt_common.c index 99bdc94d6d7..d4ca0cbff6e 100644 --- a/src/interfaces/ecpg/pgtypeslib/dt_common.c +++ b/src/interfaces/ecpg/pgtypeslib/dt_common.c @@ -2659,6 +2659,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, */ pfmt++; tmp = pgtypes_alloc(strlen("%m/%d/%y") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%m/%d/%y"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); @@ -2784,6 +2786,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, case 'r': pfmt++; tmp = pgtypes_alloc(strlen("%I:%M:%S %p") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%I:%M:%S %p"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); @@ -2792,6 +2796,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, case 'R': pfmt++; tmp = pgtypes_alloc(strlen("%H:%M") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%H:%M"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); @@ -2837,6 +2843,8 @@ PGTYPEStimestamp_defmt_scan(char **str, char *fmt, timestamp * d, case 'T': pfmt++; tmp = pgtypes_alloc(strlen("%H:%M:%S") + strlen(pstr) + 1); +if(!tmp) + return 1; strcpy(tmp, "%H:%M:%S"); strcat(tmp, pfmt); err = PGTYPEStimestamp_defmt_scan(&pstr, tmp, d, year, month, day, hour, minute, second, tz); diff --git a/src/interfaces/ecpg/test/compat_informix/.gitignore b/src/interfaces/ecpg/test/compat_informix/.gitignore index f97706ba4be..6967ae77cd2 100644 --- a/src/interfaces/ecpg/test/compat_informix/.gitignore +++ b/src/interfaces/ecpg/test/compat_informix/.gitignore @@ -4,6 +4,8 @@ /dec_test.c /describe /describe.c +/intoasc +/intoasc.c /rfmtdate /rfmtdate.c /rfmtlong diff --git a/src/interfaces/ecpg/test/compat_informix/Makefile b/src/interfaces/ecpg/test/compat_informix/Makefile index d50fdc29fd1..638b4e0af78 100644 --- a/src/interfaces/ecpg/test/compat_informix/Makefile +++ b/src/interfaces/ecpg/test/compat_informix/Makefile @@ -16,7 +16,8 @@ TESTS = test_informix test_informix.c \ rnull rnull.c \ sqlda sqlda.c \ describe describe.c \ -charfuncs charfuncs.c +charfuncs charfuncs.c \ +intoasc intoasc.c all: $(TESTS) diff --git a/src/interfaces/ecpg/test/compat_informix/intoasc.pgc b/src/interfaces/ecpg/test/compat_informix/intoasc.pgc new file mode 100644 index 000..d13c83bb7a7 --- /dev/null +++ b/src/interfaces/ecpg/test/compat_informix/intoasc.pgc @@ -0,0 +1,21 @@ +#include +#include + +#include "pgtypes_interval.h" + +EXEC SQL BEGIN DECLARE SECTION; +char dirty_str[100] = "a__c_d_"; +interval *interval_ptr; +EXEC SQL END DECLARE SECTION; + +int main() +{ +interval_ptr = (interval *) malloc(sizeof(interval)); +interval_ptr->time = 1; +interval_ptr->month = 240; + +printf("dirty_str contents before intoasc: %s\n", dirty_str); +intoasc(interval_ptr, dirty_str); +printf("dirty_str contents after intoasc: %s\n", dirty_str); +return 0; +} diff --git a/src/interfaces/ecpg/test/compat_informix/meson.build b/src/interfaces/ecpg/test/compat_informix/meson.build index e2f8802330d..7e4790933ad 100644 --- a/src/interfaces/ecpg/test/compat_informix/meson.build +++ b/src/interfaces/ecpg/test/compat_informix/meson.build @@ -4,6 +4,7 @@ pgc_files = [ 'charfuncs', 'dec_test', 'describe', + 'intoasc', 'rfmtdate', 'rfmtlong', 'rnull', diff --git a/src/interfaces/ecpg/test/ecpg_schedule b/src/interfaces/ecpg/test/ecpg_schedule index 39814a39c17..f9c0a0e3c00 100644 --- a/src/interfaces/ecpg/test/ecpg_schedule +++ b/src/interfaces/ecpg/test/ecpg_schedule @@ -7,6 +7,7 @@ test: compat_informix/sqlda test: compat_informix/describe test: compat_informix/test_informix test: compat_informix/test_informix2 +test: compat_informix/intoasc test: compat_oracle/char_array test: connect/test2 test: connect/test3 diff --git a/src/interfaces/ecpg/test/expected/compat_informix-intoasc.c b/src/interfaces/ecpg/test/expected/compat_informix-intoasc.c new file mode 100644 index 00
Re: A new strategy for pull-up correlated ANY_SUBLINK
Hi Alexander, > Hi! > >> If the changes of Alena are ok, can you merge the changes and post an >> updated version so that CFBot can apply the patch and verify the >> changes. As currently CFBot is trying to apply only Alena's changes >> and failing with the following at [1]: > > I think this is a very nice and pretty simple optimization. I've > merged the changes by Alena, and slightly revised the code changes in > convert_ANY_sublink_to_join(). I'm going to push this if there are no > objections. Thanks for picking up this! I double checked the patch, it looks good to me. -- Best Regards Andy Fan
Re: pg_upgrade and logical replication
On Wed, Feb 14, 2024 at 03:37:03AM +, Hayato Kuroda (Fujitsu) wrote: > Attached patch modified the test accordingly. Also, it contains some > optimizations. > This can pass the test on my env: What optimizations? I can't see them, and since the patch is described as rearranging test cases (and therefore already difficult to read), I guess they should be a separate patch, or the optimizations described. -- Justin
Re: 039_end_of_wal: error in "xl_tot_len zero" test
Hello, Thomas, On 19/01/2024 01:35, Thomas Munro wrote: I don't yet have an opinion on the best way to do it though. Would it be enough to add emit_message($node, 0) after advance_out_of_record_splitting_zone()? Yes, indeed that seems to be enough. At least I could not produce any more "xl_tot_len zero" failures with that addition. I like this solution the best. Tolerating the two different messages would weaken the test. I agree, I also don't really like this option. -- Anton Voloshin Postgres Professional, The Russian Postgres Company https://postgrespro.ru
Re: Synchronizing slots from primary to standby
Hi, On Thu, Feb 15, 2024 at 02:49:54PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) > wrote: > > > > On Thursday, February 15, 2024 10:49 AM Amit Kapila > > wrote: > > > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > > > Right, we can do that or probably this test would have made more sense > > > with a > > > worker patch where we could wait for the slot to be synced. > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do you > > > think that > > > adding a LOG when we are not able to sync will help in debugging such > > > problems? I think eventually we can change it to DEBUG1 but for now, it > > > can help > > > with stabilizing BF and or some other reported issues. > > > > Here is the patch that attempts the re-create sub idea. > > > > Pushed this. > > > > I also think that a LOG/DEBUG > > would be useful for such analysis, so the 0002 is to add such a log. > > > > I feel such a LOG would be useful. Same here. > + ereport(LOG, > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" > +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", > > I think waiting is a bit misleading here, how about something like: > "could not sync slot information as remote slot precedes local slot: > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN > (%X/%X), catalog xmin (%u)" > This wording works for me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add system identifier to backup manifest
On Thu, Feb 15, 2024 at 7:18 AM Michael Paquier wrote: > On Wed, Feb 14, 2024 at 12:29:07PM +0530, Amul Sul wrote: > > Ok, I did that way in the attached version, I have passed the control > file's > > full path as a second argument to verify_system_identifier() what we > gets in > > verify_backup_file(), but that is not doing any useful things with it, > > since we > > were using get_controlfile() to open the control file, which takes the > > directory as an input and computes the full path on its own. > > I've read through the patch, and that's pretty cool. > Thank you for looking into this. > -static void > -parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p, > - manifest_wal_range > **first_wal_range_p) > +static manifest_data * > +parse_manifest_file(char *manifest_path) > > In 0001, should the comment describing this routine be updated as > well? > Ok, updated in the attached version. > > + identifier with pg_control of the backup directory or fails > verification > > This is missing a markup here. > Done, in the attached version. > > + PostgreSQL 17, it is 2; in older > versions, > + it is 1. > > Perhaps a couple of s here. > Done. > + if (strcmp(parse->manifest_version, "1") != 0 && > + strcmp(parse->manifest_version, "2") != 0) > + json_manifest_parse_failure(parse->context, > + > "unexpected manifest version"); > + > + /* Parse version. */ > + version = strtoi64(parse->manifest_version, &ep, 10); > + if (*ep) > + json_manifest_parse_failure(parse->context, > + > "manifest version not an integer"); > + > + /* Invoke the callback for version */ > + context->version_cb(context, version); > > Shouldn't these two checks be reversed? And is there actually a need > for the first check at all knowing that the version callback should be > in charge of performing the validation vased on the version received? > Make sense, reversed the order. I think, particular allowed versions should be placed at the central place, and the callback can check and react on the versions suitable to them, IMHO. > +my $node2; > +{ > + local $ENV{'INITDB_TEMPLATE'} = undef; > > Not sure that it is a good idea to duplicate this pattern twice. > Shouldn't this be something we'd want to control with an option in the > init() method instead? > Yes, I did that in a separate patch, see 0001 patch. > +static void > +verify_system_identifier(verifier_context *context, char *controlpath) > > Relying both on controlpath, being a full path to the control file > including the data directory, and context->backup_directory to read > the contents of the control file looks a bit weird. Wouldn't it be > cleaner to just use one of them? > Well, yes, I had to have the same feeling, how about having another function that can accept a full path of pg_control? I tried in the 0002 patch, where the original function is renamed to get_dir_controlfile(), which accepts the data directory path as before, and get_controlfile() now accepts the full path of the pg_control file. Kindly have a look at the attached version. Regards, Amul v7-0004-Add-system-identifier-to-backup-manifest.patch Description: Binary data v7-0003-pg_verifybackup-code-refactor.patch Description: Binary data v7-0001-Add-option-to-force-initdb-in-PostgreSQL-Test-Clu.patch Description: Binary data v7-0002-Code-refactor-get_controlfile-to-accept-full-path.patch Description: Binary data
Re: Add trim_trailing_whitespace to editorconfig file
On Wed, 14 Feb 2024 at 23:19, Daniel Gustafsson wrote: > > +1 from me. But when do we want it to be false? That is, why not > > declare it true for all file types? > > Regression test .out files commonly have spaces at the end of the line. (Not > to mention the ECPG .c files but they probably really shouldn't have.) Attached is v2, which now makes the rules between gitattributes and editorconfig completely identical. As well as improving two minor things about .gitattributes before doing that. v2-0002-Require-final-newline-in-.po-files.patch Description: Binary data v2-0003-Bring-editorconfig-in-line-with-gitattributes.patch Description: Binary data v2-0004-Add-note-about-keeping-.editorconfig-and-.gitattr.patch Description: Binary data v2-0001-Remove-non-existing-file-from-.gitattributes.patch Description: Binary data
Re: Synchronizing slots from primary to standby
On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) wrote: > > On Thursday, February 15, 2024 10:49 AM Amit Kapila > wrote: > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > Right, we can do that or probably this test would have made more sense with > > a > > worker patch where we could wait for the slot to be synced. > > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think > > that > > adding a LOG when we are not able to sync will help in debugging such > > problems? I think eventually we can change it to DEBUG1 but for now, it can > > help > > with stabilizing BF and or some other reported issues. > > Here is the patch that attempts the re-create sub idea. > Pushed this. > I also think that a LOG/DEBUG > would be useful for such analysis, so the 0002 is to add such a log. > I feel such a LOG would be useful. + ereport(LOG, + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" +" (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", I think waiting is a bit misleading here, how about something like: "could not sync slot information as remote slot precedes local slot: remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X), catalog xmin (%u)" -- With Regards, Amit Kapila.
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi, In "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 15 Feb 2024 17:09:20 +0800, jian he wrote: > My environment is slow (around 10x) but consistent. > I see around 2-3 percent increase consistently. > (with patch 7369.068 ms, without patch 7574.802 ms) Thanks for sharing your numbers! It will help us to determine whether these changes improve performance or not. > the patchset looks good in my eyes, i can understand it. > however I cannot apply it cleanly against the HEAD. Hmm, I used 9bc1eee988c31e66a27e007d41020664df490214 as the base version. But both patches based on the same revision. So we may not be able to apply both patches at once cleanly. > +/* > + * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo > + * instead of initializing it for each call. This is for performance. > + */ > +FunctionCallInfoBaseData * > +PrepareInputFunctionCallInfo(void) > +{ > + FunctionCallInfoBaseData *fcinfo; > + > + fcinfo = (FunctionCallInfoBaseData *) palloc(SizeForFunctionCallInfo(3)); > > just wondering, I saw other similar places using palloc0, > do we need to use palloc0? I think that we don't need to use palloc0() here because the following InitFunctionCallInfoData() call initializes all members explicitly. Thanks, -- kou
Re: RFC: Logging plan of the running query
Hi, I've just been catching up on this thread. + if (MyProc->heldLocks) + { + ereport(LOG_SERVER_ONLY, + errmsg("ignored request for logging query plan due to lock conflicts"), + errdetail("You can try again in a moment.")); + return; + } I don't like this for several reasons. First, I think it's not nice to have a request just get ignored. A user will expect that if we cannot immediately respond to some request, we'll respond later at the first time that it's safe to do so, rather than just ignoring it and telling them to retry. Second, I don't think that the error message is very good. It talks about lock conflicts, but we don't know that there is any real problem. We know that, if we enter this block, the server is in the middle of trying to acquire some lock, and we also know that we could attempt to acquire a lock as part of generating the EXPLAIN output, and maybe that's an issue. But that's not a lock conflict. That's a re-entrancy problem. I don't know that we want to talk about re-entrancy problems in an error message, and I don't really think this error message should exist in the first place, but if we're going to error out in this case surely we shouldn't do so with an error message that describes a problem we don't have. Third, I think that the re-entrancy problems with this patch may extend well beyond lock acquisition. This is one example of how it can be unsafe to do something as complicated as EXPLAIN at any arbitrary CHECK_FOR_INTERRUPTS(). It is not correct to say, as http://postgr.es/m/caaaqye9euuzd8bkjxtvcd9e4n5c7kzhzcvucjxt-xds9x4c...@mail.gmail.com does, that the problems with running EXPLAIN at an arbitrary point are specific to running this code in an aborted transaction. The existence of this code shows that there is at least one hazard even if the current transaction is not aborted, and I see no analysis on this thread indicating whether there are any more such hazards, or how we could go about finding them all. I think the issue is very general. We have lots of subsystems that both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS(). If we process an interrupt while that code is in the middle of manipulating its global variables and then again re-enter that code, bad things might happen. We could try to rule that out by analyzing all such subsystems and all places where CHECK_FOR_INTERRUPTS() may appear, but that's very difficult. Suppose we took the alternative approach recommended by Andrey Lepikhov in http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b5...@app.fastmail.com and instead set a flag that gets handled in some suitable place in the executor code, like ExecProcNode(). If we did that, then we would know that we're not in the middle of a syscache lookup, a catcache lookup, a heavyweight lock acquisition, an ereport, or any other low-level subsystem call that might create problems for the EXPLAIN code. In that design, the hack shown above would go away, and we could be much more certain that we don't need any other similar hacks for other subsystems. The only downside is that we might experience a slightly longer delay before a requested EXPLAIN plan actually shows up, but that seems like a pretty small price to pay for being able to reason about the behavior of the system. I don't *think* there are any cases where we run in the executor for a particularly long time without a new call to ExecProcNode(). I think this case is a bit like vacuum_delay_point(). You might think that vacuum_delay_point() could be moved inside of CHECK_FOR_INTERRUPTS(), but we've made the opposite decision: every vacuum_delay_point() calls CHECK_FOR_INTERRUPTS() but not every CHECK_FOR_INTERRUPTS() calls vacuum_delay_point(). That means that we can allow vacuum_delay_point() only at cases where we know it's safe, rather than at every CHECK_FOR_INTERRUPTS(). I think that's a pretty smart decision, even for vacuum_delay_point(), and AFAICS the code we're proposing to run here does things that are substantially more complicated than what vacuum_delay_point() does. That code just does a bit of reading of shared memory, reports a wait event, and sleeps. That's really simple compared to code that could try to do relcache builds, which can trigger syscache lookups, which can both trigger heavyweight lock acquisition, lightweight lock acquisition, bringing pages into shared_buffers possibly through physical I/O, processing of invalidation messages, and a bunch of other stuff. It's really hard for me to accept that the heavyweight lock problem for which the patch contains a workaround is the only one that exists. I can't see any reason why that should be true. ...Robert
Re: Make COPY format extendable: Extract COPY TO format implementations
On Thu, Feb 15, 2024 at 2:34 PM Sutou Kouhei wrote: > > > Thanks for the info. Let's use InputFunctionCallSafeWithInfo(). > See that attached patch: > v2-0001-Reuse-fcinfo-used-in-COPY-FROM.patch > > I also attach a patch for COPY TO: > v1-0001-Reuse-fcinfo-used-in-COPY-TO.patch > > I measured the COPY TO patch on my environment with: > COPY (SELECT > 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, > generate_series(1, 100::int4)) TO '/dev/null' \watch c=5 > > master: > 740.066ms > 734.884ms > 738.579ms > 734.170ms > 727.953ms > > patched: > 730.714ms > 741.483ms > 714.149ms > 715.436ms > 713.578ms > > It seems that it improves performance a bit but my > environment isn't suitable for benchmark. So they may not > be valid numbers. My environment is slow (around 10x) but consistent. I see around 2-3 percent increase consistently. (with patch 7369.068 ms, without patch 7574.802 ms) the patchset looks good in my eyes, i can understand it. however I cannot apply it cleanly against the HEAD. +/* + * Prepare callinfo for InputFunctionCallSafeWithInfo to reuse one callinfo + * instead of initializing it for each call. This is for performance. + */ +FunctionCallInfoBaseData * +PrepareInputFunctionCallInfo(void) +{ + FunctionCallInfoBaseData *fcinfo; + + fcinfo = (FunctionCallInfoBaseData *) palloc(SizeForFunctionCallInfo(3)); just wondering, I saw other similar places using palloc0, do we need to use palloc0?
Re: [PATCH] Expose port->authn_id to extensions and triggers
Hi, Michael Paquier writes: > [[PGP Signed Part:Undecided]] > On Tue, Aug 23, 2022 at 10:04:30AM -0700, Jacob Champion wrote: > > My main worry here is EXEC_BACKEND, where we would just use our own > implementation of fork(), and it is a bad idea at the end to leave > that untouched while we could have code paths that attempt to access > it. At the end, I have moved the initialization at the same place as > where we set MyProcPort for a backend in BackendInitialize(), mainly > as a matter of consistency because ClientConnectionInfo is aimed at > being a subset of that. And applied. I found a compiler complaint of this patch. The attached fix that. -- Best Regards Andy Fan >From 9e8a3fb7a044704fbfcd682a897f72260266bd54 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Thu, 15 Feb 2024 16:46:57 +0800 Subject: [PATCH v1 1/1] Remove the "parameter 'maxsize' set but not used" error. maxsize is only used in Assert build, to make compiler quiet, it is better maintaining it only assert build. --- src/backend/utils/init/miscinit.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c index 23f77a59e5..0e72fcefab 100644 --- a/src/backend/utils/init/miscinit.c +++ b/src/backend/utils/init/miscinit.c @@ -1050,7 +1050,9 @@ SerializeClientConnectionInfo(Size maxsize, char *start_address) Assert(maxsize >= sizeof(serialized)); memcpy(start_address, &serialized, sizeof(serialized)); +#ifdef USE_ASSERT_CHECKING maxsize -= sizeof(serialized); +#endif start_address += sizeof(serialized); /* Copy authn_id into the space after the struct */ -- 2.34.1
Re: make add_paths_to_append_rel aware of startup cost
Andy Fan writes: > Thomas Munro writes: > >> On Thu, Oct 5, 2023 at 9:07 PM David Rowley wrote: >>> Thanks. Pushed. >> >> FYI somehow this plan from a8a968a8212e flipped in this run: >> >> === dumping >> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs >> === >> diff -U3 >> /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out >> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out >> --- /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out >> 2024-01-15 00:31:13.947555940 + >> +++ >> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out >> 2024-02-14 00:06:17.075584839 + >> @@ -1447,9 +1447,9 @@ >> -> Append >> -> Nested Loop >> Join Filter: (t1.tenthous = t2.tenthous) >> - -> Seq Scan on tenk1 t1 >> + -> Seq Scan on tenk2 t2 >> -> Materialize >> - -> Seq Scan on tenk2 t2 >> + -> Seq Scan on tenk1 t1 >> -> Result >> (8 rows) >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-02-14%2000%3A01%3A03 > > Thanks for this information! I will take a look at this. I found the both plans have the same cost, I can't get the accurate cause of this after some hours research, but it is pretty similar with 7516056c584e3, so I uses a similar strategy to stable it. is it acceptable? -- Best Regards Andy Fan >From 01c2b5ae76a4493f33b894afdd4a8d3ce31e1a3e Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Thu, 15 Feb 2024 16:29:30 +0800 Subject: [PATCH v1 1/1] Attempt to stabilize new unionall + limit regression test This test was recently added in a8a968a8212e, and some times It appears to be unstable in regards to the join order presumably due to the relations at either side of the join being equal in side. Here we add a qual to make one of them smaller so the planner is more likely to choose to material the smaller of the two. Reported-by: Thomas Munro Author: Andy Fan Discussion: https://postgr.es/m/CA%2BhUKGLqC-NobKYfjxNM3Gexv9OJ-Fhvy9bugUcXsZjTqH7W%3DQ%40mail.gmail.com#88e6420b59a863403b9b67a0128fdacc --- src/test/regress/expected/union.out | 11 ++- src/test/regress/sql/union.sql | 4 ++-- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out index 73e320bad4..0e527b65b3 100644 --- a/src/test/regress/expected/union.out +++ b/src/test/regress/expected/union.out @@ -1438,8 +1438,8 @@ where (x = 0) or (q1 >= q2 and q1 <= q2); -- Ensure we get a Nested Loop join between tenk1 and tenk2 explain (costs off) select t1.unique1 from tenk1 t1 -inner join tenk2 t2 on t1.tenthous = t2.tenthous - union all +inner join tenk2 t2 on t1.tenthous = t2.tenthous and t1.ten > 5 +union all (values(1)) limit 1; QUERY PLAN @@ -1447,11 +1447,12 @@ inner join tenk2 t2 on t1.tenthous = t2.tenthous -> Append -> Nested Loop Join Filter: (t1.tenthous = t2.tenthous) - -> Seq Scan on tenk1 t1 + -> Seq Scan on tenk2 t2 -> Materialize - -> Seq Scan on tenk2 t2 + -> Seq Scan on tenk1 t1 + Filter: (ten > 5) -> Result -(8 rows) +(9 rows) -- Ensure there is no problem if cheapest_startup_path is NULL explain (costs off) diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql index 6c509ac80c..13ba9f21ad 100644 --- a/src/test/regress/sql/union.sql +++ b/src/test/regress/sql/union.sql @@ -548,8 +548,8 @@ where (x = 0) or (q1 >= q2 and q1 <= q2); -- Ensure we get a Nested Loop join between tenk1 and tenk2 explain (costs off) select t1.unique1 from tenk1 t1 -inner join tenk2 t2 on t1.tenthous = t2.tenthous - union all +inner join tenk2 t2 on t1.tenthous = t2.tenthous and t1.ten > 5 +union all (values(1)) limit 1; -- Ensure there is no problem if cheapest_startup_path is NULL -- 2.34.1