Re: Add basic tests for the low-level backup method.
On 3/15/24 18:32, Michael Paquier wrote: On Fri, Mar 15, 2024 at 06:23:15PM +1300, David Steele wrote: Well, this is what we recommend in the docs, i.e. using bin mode to save backup_label, so it seems OK to me. Indeed, I didn't notice that this is actually documented, so what I did took the right angle. French flair, perhaps.. This seems like a reasonable explanation to me. -David
Re: Add basic tests for the low-level backup method.
On Fri, Mar 15, 2024 at 06:23:15PM +1300, David Steele wrote: > Well, this is what we recommend in the docs, i.e. using bin mode to save > backup_label, so it seems OK to me. Indeed, I didn't notice that this is actually documented, so what I did took the right angle. French flair, perhaps.. -- Michael signature.asc Description: PGP signature
Re: Add basic tests for the low-level backup method.
On Fri, Mar 15, 2024 at 08:38:47AM +0900, Michael Paquier wrote: > That's why these tests are not that easy, they can be racy. I've run > the test 5~10 times in the CI this time to gain more confidence, and > saw zero failures with the stability fixes in place including Windows. > I've applied it now, as I can still monitor the buildfarm for a few > more days. Let's see what happens, but that should be better. So, it looks like the buildfarm is clear. sidewinder has reported a green state, and the recent runs of the CFbot across all the patches are looking stable as well on all platforms. There are still a few buildfarm members on Windows that will take time more time before running. -- Michael signature.asc Description: PGP signature
Re: Add basic tests for the low-level backup method.
On 3/15/24 12:38, Michael Paquier wrote: On Fri, Mar 15, 2024 at 09:40:38AM +1300, David Steele wrote: Is the missing test in meson the reason we did not see test failures for Windows in CI? The test has to be listed in src/test/recovery/meson.build or the CI would ignore it. Right -- I will keep this in mind for the future. The second LOG is something that can be acted on. I've added some debugging to the parsing of the backup_label file in the backend, and noticed that the first fscanf() for START WAL LOCATION is failing because the last %c is detected as \r rather than \n. Tweaking the contents stored from pg_backend_stop() with a sed won't help, because the issue is that we write the CRLFs with append_to_file, and the startup process cannot cope with that. The simplest method I can think of is to use binmode, as of the attached. Yeah, that makes sense. I am wondering if there is a better trick here that would not require changes in the backend to make the backup_label parsing more flexible, though. Well, this is what we recommend in the docs, i.e. using bin mode to save backup_label, so it seems OK to me. I am attaching an updated patch with all that fixed, which is stable in the CI and any tests I've run. Do you have any comments about These changes look good to me. Sure wish we had an easier to way to test commits in the build farm. That's why these tests are not that easy, they can be racy. I've run the test 5~10 times in the CI this time to gain more confidence, and saw zero failures with the stability fixes in place including Windows. I've applied it now, as I can still monitor the buildfarm for a few more days. Let's see what happens, but that should be better. At least sidewinder is happy now -- and the build farm in general as far as I can see. Thank you for your help on this! -David
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Wed, Mar 13, 2024 at 9:38 AM Amit Kapila wrote: > > BTW, is XID the based parameter 'max_slot_xid_age' not have similarity > with 'max_slot_wal_keep_size'? I think it will impact the rows we > removed based on xid horizons. Don't we need to consider it while > vacuum computing the xid horizons in ComputeXidHorizons() similar to > what we do for WAL w.r.t 'max_slot_wal_keep_size'? I'm having a hard time understanding why we'd need something up there in ComputeXidHorizons(). Can you elaborate it a bit please? What's proposed with max_slot_xid_age is that during checkpoint we look at slot's xmin and catalog_xmin, and the current system txn id. Then, if the XID age of (xmin, catalog_xmin) and current_xid crosses max_slot_xid_age, we invalidate the slot. Let me illustrate how all this works: 1. Setup a primary and standby with hot_standby_feedback set to on on standby. For instance, check my scripts at [1]. 2. Stop the standby to make the slot inactive on the primary. Check the slot is holding xmin of 738. ./pg_ctl -D sbdata -l logfilesbdata stop postgres=# SELECT * FROM pg_replication_slots; -[ RECORD 1 ]---+- slot_name | sb_repl_slot plugin | slot_type | physical datoid | database| temporary | f active | f active_pid | xmin| 738 catalog_xmin| restart_lsn | 0/300 confirmed_flush_lsn | wal_status | reserved safe_wal_size | two_phase | f conflict_reason | failover| f synced | f 3. Start consuming the XIDs on the primary with the following script for instance ./psql -d postgres -p 5432 DROP TABLE tab_int; CREATE TABLE tab_int (a int); do $$ begin for i in 1..268435 loop -- use an exception block so that each iteration eats an XID begin insert into tab_int values (i); exception when division_by_zero then null; end; end loop; end$$; 4. Make some dead rows in the table. update tab_int set a = a+1; delete from tab_int where a%4=0; postgres=# SELECT n_dead_tup, n_tup_ins, n_tup_upd, n_tup_del FROM pg_stat_user_tables WHERE relname = 'tab_int'; -[ RECORD 1 ]-- n_dead_tup | 335544 n_tup_ins | 268435 n_tup_upd | 268435 n_tup_del | 67109 5. Try vacuuming to delete the dead rows, observe 'tuples: 0 removed, 536870 remain, 335544 are dead but not yet removable'. The dead rows can't be removed because the inactive slot is holding an xmin, see 'removable cutoff: 738, which was 268441 XIDs old when operation ended'. postgres=# vacuum verbose tab_int; INFO: vacuuming "postgres.public.tab_int" INFO: finished vacuuming "postgres.public.tab_int": index scans: 0 pages: 0 removed, 2376 remain, 2376 scanned (100.00% of total) tuples: 0 removed, 536870 remain, 335544 are dead but not yet removable removable cutoff: 738, which was 268441 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s buffer usage: 4759 hits, 0 misses, 0 dirtied WAL usage: 0 records, 0 full page images, 0 bytes system usage: CPU: user: 0.07 s, system: 0.00 s, elapsed: 0.07 s VACUUM 6. Now, repeat the above steps but with setting max_slot_xid_age = 20 on the primary. 7. Do a checkpoint to invalidate the slot. postgres=# checkpoint; CHECKPOINT postgres=# SELECT * FROM pg_replication_slots; -[ RECORD 1 ]---+- slot_name | sb_repl_slot plugin | slot_type | physical datoid | database| temporary | f active | f active_pid | xmin| 738 catalog_xmin| restart_lsn | 0/300 confirmed_flush_lsn | wal_status | lost safe_wal_size | two_phase | f conflicting | failover| f synced | f invalidation_reason | xid_aged 8. And, then vacuum the table, observe 'tuples: 335544 removed, 201326 remain, 0 are dead but not yet removable'. postgres=# vacuum verbose tab_int; INFO: vacuuming "postgres.public.tab_int" INFO: finished vacuuming "postgres.public.tab_int": index scans: 0 pages: 0 removed, 2376 remain, 2376 scanned (100.00% of total) tuples: 335544 removed, 201326 remain, 0 are dead but not yet removable removable cutoff: 269179, which was 0 XIDs old when operation ended new relfrozenxid: 269179, which is 268441 XIDs ahead of previous value frozen: 1189 pages from table (50.04% of total) had 201326 tuples frozen index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 0.000 MB/s, avg write rate: 193.100 MB/s buffer usage: 4760 hits, 0 misses, 2381 dirtied WAL usage: 5942 records, 2378 full page images, 8343275 bytes system usage: CPU: user: 0.09 s, system: 0.00 s, elapsed: 0.09 s VACUUM [1] cd
Re: POC, WIP: OR-clause support for indexes
On 14/3/2024 17:39, Alexander Korotkov wrote: Thank you, Andrei. Looks like a very undesirable side effect. Do you have any idea why it happens? Partition pruning should work correctly for both transformed and non-transformed quals, why does transformation hurt it? Now we have the v23-0001-* patch with all issues resolved. The last one which caused execution stage pruning was about necessity to evaluate SAOP expression right after transformation. In previous version the core executed it on transformed expressions. > As you can see this case is not related to partial indexes. Just no > index selective for the whole query. However, splitting scan by the > OR qual lets use a combination of two selective indexes. Thanks for the case. I will try to resolve it. -- regards, Andrei Lepikhov Postgres Professional From 156c00c820a38e5e1856f07363af87b3109b5d77 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Fri, 2 Feb 2024 22:01:09 +0300 Subject: [PATCH 1/2] Transform OR clauses to ANY expression. Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here C is a constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'CX op expr'). Sometimes it can lead to not optimal plan. But we think it is better to have array of elements instead of a lot of OR clauses. Here is a room for further optimizations on decomposing that array into more optimal parts. Authors: Alena Rybakina , Andrey Lepikhov Reviewed-by: Peter Geoghegan , Ranier Vilela Reviewed-by: Alexander Korotkov , Robert Haas Reviewed-by: jian he --- .../postgres_fdw/expected/postgres_fdw.out| 8 +- doc/src/sgml/config.sgml | 17 + src/backend/nodes/queryjumblefuncs.c | 27 ++ src/backend/optimizer/prep/prepqual.c | 374 +- src/backend/utils/misc/guc_tables.c | 11 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/queryjumble.h | 1 + src/include/optimizer/optimizer.h | 2 + src/test/regress/expected/create_index.out| 156 +++- src/test/regress/expected/join.out| 62 ++- src/test/regress/expected/partition_prune.out | 215 +- src/test/regress/expected/stats_ext.out | 12 +- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 23 +- src/test/regress/sql/create_index.sql | 35 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 2 + 19 files changed, 929 insertions(+), 58 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 58a603ac56..a965b43cc6 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8838,18 +8838,18 @@ insert into utrtest values (2, 'qux'); -- Check case where the foreign partition is a subplan target rel explain (verbose, costs off) update utrtest set a = 1 where a = 1 or a = 2 returning *; - QUERY PLAN - + QUERY PLAN +-- Update on public.utrtest Output: utrtest_1.a, utrtest_1.b Foreign Update on public.remp utrtest_1 Update on public.locp utrtest_2 -> Append -> Foreign Update on public.remp utrtest_1 - Remote SQL: UPDATE public.loct SET a = 1 WHERE (((a = 1) OR (a = 2))) RETURNING a, b + Remote SQL: UPDATE public.loct SET a = 1 WHERE ((a = ANY ('{1,2}'::integer[]))) RETURNING a, b -> Seq Scan on public.locp utrtest_2 Output: 1, utrtest_2.tableoid, utrtest_2.ctid, NULL::record - Filter: ((utrtest_2.a = 1) OR (utrtest_2.a = 2)) + Filter: (utrtest_2.a = ANY ('{1,2}'::integer[])) (10 rows) -- The new values are concatenated with ' triggered !' diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 65a6e6c408..2de6ae301a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5472,6 +5472,23 @@ ANY num_sync ( + enable_or_transformation (boolean) + +enable_or_transformation configuration parameter + + + + +Enables or disables the query
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Thu, Mar 14, 2024 at 7:58 PM Bharath Rupireddy wrote: > > On Thu, Mar 14, 2024 at 12:24 PM Amit Kapila wrote: > > > > On Wed, Mar 13, 2024 at 9:24 PM Bharath Rupireddy > > > > > > Yes, there will be some sort of duplicity if we emit conflict_reason > > > as a text field. However, I still think the better way is to turn > > > conflict_reason text to conflict boolean and set it to true only on > > > rows_removed and wal_level_insufficient invalidations. When conflict > > > boolean is true, one (including all the tests that we've added > > > recently) can look for invalidation_reason text field for the reason. > > > This sounds reasonable to me as opposed to we just mentioning in the > > > docs that "if invalidation_reason is rows_removed or > > > wal_level_insufficient it's the reason for conflict with recovery". +1 on maintaining both conflicting and invalidation_reason > > Fair point. I think we can go either way. Bertrand, Nathan, and > > others, do you have an opinion on this matter? > > While we wait to hear from others on this, I'm attaching the v9 patch > set implementing the above idea (check 0001 patch). Please have a > look. I'll come back to the other review comments soon. Thanks for the patch. JFYI, patch09 does not apply to HEAD, some recent commit caused the conflict. Some trivial comments on patch001 (yet to review other patches) 1) info.c: - "%s as caught_up, conflict_reason IS NOT NULL as invalid " + "%s as caught_up, invalidation_reason IS NOT NULL as invalid " Can we revert back to 'conflicting as invalid' since it is a query for logical slots only. 2) 040_standby_failover_slots_sync.pl: - q{SELECT conflict_reason IS NULL AND synced AND NOT temporary FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';} + q{SELECT invalidation_reason IS NULL AND synced AND NOT temporary FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';} Here too, can we have 'NOT conflicting' instead of ' invalidation_reason IS NULL' as it is a logical slot test. thanks Shveta
Re: Add publisher and subscriber to glossary documentation.
On Fri, Mar 15, 2024, at 1:14 AM, Amit Kapila wrote: > I think node should mean instance for both physical and logical > replication, otherwise, it would be confusing. We need both the usages > as a particular publication/subscription is defined at the database > level but the server on which we define those is referred to as a > node/instance. If you are creating a subscription that connects to the same instance (replication between 2 databases in the same cluster), your definition is not correct and Alvaro's definition is accurate. The node definition is closely linked to the connection string. While the physical replication does not specify a database (meaning "any database" referring to an instance), the logical replication requires a database. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Skip collecting decoded changes of already-aborted transactions
On Fri, Mar 15, 2024 at 3:17 PM Masahiko Sawada wrote: > > I resumed working on this item. I've attached the new version patch. > > I rebased the patch to the current HEAD and updated comments and > commit messages. The patch is straightforward and I'm somewhat > satisfied with it, but I'm thinking of adding some tests for it. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com I just had a look at the patch, the patch no longer applies because of a removal of a header in a recent commit. Overall the patch looks fine, and I didn't find any issues. Some cosmetic comments: in ReorderBufferCheckTXNAbort() + /* Quick return if we've already knew the transaction status */ + if (txn->aborted) + return true; knew/know /* + * If logical_replication_mode is "immediate", we don't check the + * transaction status so the caller always process this transaction. + */ + if (debug_logical_replication_streaming == DEBUG_LOGICAL_REP_STREAMING_IMMEDIATE) + return false; /process/processes regards, Ajin Cherian Fujitsu Australia
Re: Add publisher and subscriber to glossary documentation.
On Thu, Mar 14, 2024 at 7:51 PM Alvaro Herrera wrote: > > On 2024-Mar-14, Shlok Kyal wrote: > > > Andrew Atkinson wrote: > > > > > Anyway, hopefully these examples show “node” and “database” are > > > mixed and perhaps others agree using one consistently might help the > > > goals of the docs. > > > > For me the existing content looks good, I felt let's keep it as it is > > unless others feel differently. > > Actually it's these small terminology glitches that give me pause. If > we're going to have terms that are interchangeable (in this case "node" > and "database"), then they should be always interchangeable, not just in > some unspecified cases. Maybe the idea of using "node" (which sounds > like something that's instance-wide) is wrong for logical replication, > which is necessarily something that happens database-locally. > > Then again, maybe defining "node" as something that exists at a > database-local level when used in the context of logical replication is > sufficient. In that case, it would be better to avoid defining it as a > synonym of "instance". Then the terms are not always interchangeable, > but it's clear when they are and when they aren't. > > "Node: in replication, each of the endpoints to which or > from which data is replicated. In the context of physical replication, > each node is an instance. In the context of logical replication, each > node is a database". > I think node should mean instance for both physical and logical replication, otherwise, it would be confusing. We need both the usages as a particular publication/subscription is defined at the database level but the server on which we define those is referred to as a node/instance. One of the usages pointed out by Andrew: "The subscriber database..." [1] is unclear but I feel we can use node there as well instead of database. [1] - https://www.postgresql.org/docs/current/logical-replication-subscription.html -- With Regards, Amit Kapila.
Re: speed up a logical replica setup
On Wed, Mar 13, 2024, at 10:09 AM, Shlok Kyal wrote: > Added a top-up patch v28-0005 to fix this issue. > I am not changing the version as v28-0001 to v28-0004 is the same as above. Thanks for your review! I'm posting a new patch (v29) that merges the previous patches (v28-0002 and v28-0003). I applied the fix provided by Hayato [1]. It was an oversight during a rebase. I also included the patch proposed by Shlok [2] that stops the target server on error if it is running. Tomas suggested in [3] that maybe the PID should be replaced with something else that has more entropy. Instead of PID, it uses a random number for replication slot and subscription. There is also a concern about converting multiple standbys that will have the same publication name. It added the same random number to the publication name so it doesn't fail because the publication already exists. Documentation was changed based on Tomas feedback. The user name was always included in the subscriber connection string. Let's have the libpq to choose it. While on it, a new routine (get_sub_conninfo) contains the code to build the subscriber connection string. As I said in [4], there wasn't a way to inform a different configuration file. If your cluster has a postgresql.conf outside PGDATA, when pg_createsubscriber starts the server it will fail. The new --config-file option let you inform the postgresql.conf location and the server is started just fine. I also did some changes in the start_standby_server routine. I replaced the strcat and snprintf with appendPQExpBuffer that has been used to store the pg_ctl command. [1] https://www.postgresql.org/message-id/TYCPR01MB12077FD21BB186C5A685C0BF3F52A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com [2] https://www.postgresql.org/message-id/CANhcyEW6-dH28gLbFc5XpDTJ6JPizU%2Bt5g-aKUWJBf5W_Zriqw%40mail.gmail.com [3] https://www.postgresql.org/message-id/6423dfeb-a729-45d3-b71e-7bf1b3adb0c9%40enterprisedb.com [4] https://www.postgresql.org/message-id/d898faad-f6d7-4b0d-b816-b9dcdf490685%40app.fastmail.com -- Euler Taveira EDB https://www.enterprisedb.com/ v29-0001-pg_createsubscriber-creates-a-new-logical-replic.patch.gz Description: application/gzip
Re: Inconsistent printf placeholders
On Fri, 15 Mar 2024 at 15:27, Kyotaro Horiguchi wrote: > I have considered only the two messages. Actually, buffile.c and md.c > are already like that. The attached aligns the messages in > pg_combinebackup.c and reconstruct.c with the precedents. This looks like a worthy cause to make translator work easier. I don't want to widen the goalposts or anything, but just wondering if you'd searched for any others that could get similar treatment? I only just had a quick look at the following. $ cat src/backend/po/fr.po | grep -E "^msgid\s" | sed -E 's/%[a-zA-Z]+/\%/g' | sort | uniq -d -c 31 msgid "" 2 msgid "could not accept SSL connection: %" 2 msgid "could not initialize LDAP: %" 2 msgid "could not look up local user ID %: %" 2 msgid "could not open file \"%\": %" 2 msgid "could not read file \"%\": read % of %" 2 msgid "could not read from log segment %, offset %: %" 2 msgid "could not read from log segment %, offset %: read % of %" 2 msgid "index % out of valid range, 0..%" 2 msgid "invalid value for parameter \"%\": %" 2 msgid "%%% is outside the valid range for parameter \"%\" (% .. %)" 2 msgid "must be owner of large object %" 2 msgid "oversize GSSAPI packet sent by the client (% > %)" 2 msgid "permission denied for large object %" 2 msgid "string is too long for tsvector (% bytes, max % bytes)" 2 msgid "timestamp out of range: \"%\"" 2 msgid "Valid values are between \"%\" and \"%\"." I've not looked at how hard it would be with any of the above to determine how hard it would be to make the formats consistent. The 3rd last one seems similar enough that it might be worth doing together with this? David
Re: Possibility to disable `ALTER SYSTEM`
On Thu, Mar 14, 2024 at 07:43:15PM -0400, Robert Haas wrote: > On Thu, Mar 14, 2024 at 5:15 PM Maciek Sakrejda wrote: > > It's not a security feature: it's a usability feature. > > > > It's a usability feature because, when Postgres configuration is > > managed by an outside mechanism (e.g., as in a Kubernetes > > environment), ALTER SYSTEM currently allows a superuser to make > > changes that appear to work, but may be discarded at some point in the > > future when that outside mechanism updates the config. They may also > > be represented incorrectly in a management dashboard if that dashboard > > is based on the values in the outside configuration mechanism, rather > > than values directly from Postgres. > > > > In this case, the end user with access to Postgres superuser > > privileges presumably also has access to the outside configuration > > mechanism. The goal is not to prevent them from changing settings, but > > to offer guard rails that prevent them from changing settings in a way > > that will be unstable (revertible by a future update) or confusing > > (not showing up in a management UI). > > > > There are challenges here in making sure this is _not_ seen as a > > security feature. But I do think the feature itself is sensible and > > worthwhile. > > This is what I would have said if I'd tried to offer an explanation, > except you said it better than I would have done. I do think the docs need to clearly say this is not a security feature. In fact, I wonder if the ALTER SYSTEM error message should explain the GUC that is causing the failure. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Reports on obsolete Postgres versions
On Thu, Mar 14, 2024 at 10:46:28PM -0400, Bruce Momjian wrote: > > In the end, while I certainly don't mind improving the web page, I > > think that a lot of what we're seeing here probably has to do with the > > growing popularity and success of PostgreSQL. If you have more people > > using your software, you're also going to have more people using > > out-of-date versions of your software. > > Yeah, probably, and we recently end-of-life'ed PG 11. In a way it is that we had more users during the PG 10/11 period than before that, and those people aren't upgrading as quickly. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor wrote: > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada > > wrote: > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor > > > wrote: > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > general and put the validation down into C as well. First we save the > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > found values in another array. So the static items structure would > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > the iteration output is private to check_set_block_offsets(). Then > > > > sort as needed and check they are all the same. > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > tests too. If you're going to work on this, I'll do other tests on my > > > environment in the meantime. > > > > Some progress on this in v72 -- I tried first without using SQL to > > save the blocks, just using the unique blocks from the verification > > array. It seems to work fine. > > Thanks! > > > > > - Since there are now three arrays we should reduce max bytes to > > something smaller. > > Agreed. > > > - Further on that, I'm not sure if the "is full" test is telling us > > much. It seems we could make max bytes a static variable and set it to > > the size of the empty store. I'm guessing it wouldn't take much to add > > enough tids so that the contexts need to allocate some blocks, and > > then it would appear full and we can test that. I've made it so all > > arrays repalloc when needed, just in case. > > How about using work_mem as max_bytes instead of having it as a static > variable? In test_tidstore.sql we set work_mem before creating the > tidstore. It would make the tidstore more controllable by SQL queries. > > > - Why are we switching to TopMemoryContext? It's not explained -- the > > comment only tells what the code is doing (which is obvious), but not > > why. > > This is because the tidstore needs to live across the transaction > boundary. We can use TopMemoryContext or CacheMemoryContext. > > > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > > now have a separate lookup test, the only thing it can tell us is that > > lookups fail on an empty store. I arranged it so that > > check_set_block_offsets() works on an empty store. Although that's > > even more trivial, it's just reusing what we already need. > > Agreed. > I have two questions on tidstore.c: +/* + * Set the given TIDs on the blkno to TidStore. + * + * NB: the offset numbers in offsets must be sorted in ascending order. + */ Do we need some assertions to check if the given offset numbers are sorted expectedly? --- + if (TidStoreIsShared(ts)) + found = shared_rt_set(ts->tree.shared, blkno, page); + else + found = local_rt_set(ts->tree.local, blkno, page); + + Assert(!found); Given TidStoreSetBlockOffsets() is designed to always set (i.e. overwrite) the value, I think we should not expect that found is always false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Reports on obsolete Postgres versions
On Thu, Mar 14, 2024 at 10:15:18AM -0400, Robert Haas wrote: > I think that whatever we say here should focus on what we try to do or > guarantee, not on what actions we think users ought to take, never > mind must take. We can say that we try to avoid making any changes > upon which an application might be relying -- but there surely is some > weasel-wording there, because we have made such changes before in the > name of security, and sometimes to fix bugs, and we will likely to do > so again in the future. But it's not for us to decide how much testing > is warranted. It's the user's system, not ours. Yes, good point, let's tell whem what our goals are and they can decide what testing they need. > In the end, while I certainly don't mind improving the web page, I > think that a lot of what we're seeing here probably has to do with the > growing popularity and success of PostgreSQL. If you have more people > using your software, you're also going to have more people using > out-of-date versions of your software. Yeah, probably, and we recently end-of-life'ed PG 11. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Typos in reorderbuffer.c.
At Thu, 14 Mar 2024 11:23:38 +0530, Amit Kapila wrote in > On Thu, Mar 14, 2024 at 9:58 AM Kyotaro Horiguchi > wrote: > > > > While examining reorderbuffer.c, I found several typos. I'm not sure > > if fixing them is worthwhile, but I've attached a fix just in case. > > > > LGTM. I'll push this in some time. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Inconsistent printf placeholders
Thank you for the suggestions. At Thu, 14 Mar 2024 13:45:41 +0100, Daniel Gustafsson wrote in > I've only skimmed this so far but +1 on keeping the messages the same where > possible to reduce translation work. Adding a comment on the message where > the > casting is done to indicate that it is for translation might reduce the risk > of > it "getting fixed" down the line. Added a comment "/* cast xxx to avoid extra translatable messages */. At Thu, 14 Mar 2024 14:02:46 +0100, Peter Eisentraut wrote in > If you want to make them uniform, then I suggest the error messages Yeah. Having the same messages with only the placeholders changed is not very pleasing during translation. If possible, I would like to align them. > should both be "%zd of %zu bytes", which are the actual types read() > deals with. I have considered only the two messages. Actually, buffile.c and md.c are already like that. The attached aligns the messages in pg_combinebackup.c and reconstruct.c with the precedents. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c index 6f0814d9ac..feb4d5dcf4 100644 --- a/src/bin/pg_combinebackup/pg_combinebackup.c +++ b/src/bin/pg_combinebackup/pg_combinebackup.c @@ -1301,8 +1301,9 @@ slurp_file(int fd, char *filename, StringInfo buf, int maxlen) if (rb < 0) pg_fatal("could not read file \"%s\": %m", filename); else - pg_fatal("could not read file \"%s\": read only %zd of %lld bytes", - filename, rb, (long long int) st.st_size); + /* cast st_size to avoid extra translatable messages */ + pg_fatal("could not read file \"%s\": read only %zd of %zu bytes", + filename, rb, (size_t) st.st_size); } /* Adjust buffer length for new data and restore trailing-\0 invariant */ diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c index 41f06bb26b..a4badb90e2 100644 --- a/src/bin/pg_combinebackup/reconstruct.c +++ b/src/bin/pg_combinebackup/reconstruct.c @@ -504,15 +504,16 @@ make_rfile(char *filename, bool missing_ok) static void read_bytes(rfile *rf, void *buffer, unsigned length) { - int rb = read(rf->fd, buffer, length); + ssize_t rb = read(rf->fd, buffer, length); if (rb != length) { if (rb < 0) pg_fatal("could not read file \"%s\": %m", rf->filename); else - pg_fatal("could not read file \"%s\": read only %d of %u bytes", - rf->filename, rb, length); + /* cast length to avoid extra translatable messages */ + pg_fatal("could not read file \"%s\": read only %zd of %zu bytes", + rf->filename, rb, (size_t) length); } }
Re: Add bump memory context type and use it for tuplesorts
On Tue, 12 Mar 2024 at 23:57, Tomas Vondra wrote: > Attached is an updated version of the mempool patch, modifying all the > memory contexts (not just AllocSet), including the bump context. And > then also PDF with results from the two machines, comparing results > without and with the mempool. There's very little impact on small reset > values (128kB, 1MB), but pretty massive improvements on the 8MB test > (where it's a 2x improvement). I think it would be good to have something like this. I've done some experiments before with something like this [1]. However, mine was much more trivial. One thing my version did was get rid of the *context* freelist stuff in aset. I wondered if we'd need that anymore as, if I understand correctly, it's just there to stop malloc/free thrashing, which is what the patch aims to do anyway. Aside from that, it's now a little weird that aset.c has that but generation.c and slab.c do not. One thing I found was that in btbeginscan(), we have "so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));", which on this machine is 27344 bytes and results in a call to AllocSetAllocLarge() and therefore a malloc(). Ideally, there'd be no malloc() calls in a standard pgbench run, at least once the rel and cat caches have been warmed up. I think there are a few things in your patch that could be improved, here's a quick review. 1. MemoryPoolEntryIndex() could follow the lead of AllocSetFreeIndex(), which is quite well-tuned and has no looping. I think you can get rid of MemoryPoolEntrySize() and just have MemoryPoolEntryIndex() round up to the next power of 2. 2. The following could use "result = Min(MEMPOOL_MIN_BLOCK, pg_nextpower2_size_t(size));" + * should be very low, though (less than MEMPOOL_SIZES, i.e. 14). + */ + result = MEMPOOL_MIN_BLOCK; + while (size > result) + result *= 2; 3. "MemoryPoolFree": I wonder if this is a good name for such a function. Really you want to return it to the pool. "Free" sounds like you're going to free() it. I went for "Fetch" and "Release" which I thought was less confusing. 4. MemoryPoolRealloc(), could this just do nothing if the old and new indexes are the same? 5. It might be good to put a likely() around this: + /* only do this once every MEMPOOL_REBALANCE_DISTANCE allocations */ + if (pool->num_requests < MEMPOOL_REBALANCE_DISTANCE) + return; Otherwise, if that function is inlined then you'll bloat the functions that inline it for not much good reason. Another approach would be to have a static inline function which checks and calls a noinline function that does the work so that the rebalance stuff is never inlined. Overall, I wonder if the rebalance stuff might make performance testing quite tricky. I see: +/* + * How often to rebalance the memory pool buckets (number of allocations). + * This is a tradeoff between the pool being adaptive and more overhead. + */ +#define MEMPOOL_REBALANCE_DISTANCE 25000 Will TPS take a sudden jump after 25k transactions doing the same thing? I'm not saying this shouldn't happen, but... benchmarking is pretty hard already. I wonder if there's something more fine-grained that can be done which makes the pool adapt faster but not all at once. (I've not studied your algorithm for the rebalance.) David [1] https://github.com/david-rowley/postgres/tree/malloccache
Re: cleanup patches for incremental backup
On Thu, Mar 14, 2024 at 04:00:10PM -0500, Nathan Bossart wrote: > Separately, I suppose it's probably time to revert the temporary debugging > code adding by commit 5ddf997. I can craft a patch for that, too. As promised... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3923e30acb1d4e21ce311b25edbcd8b96b4223d2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 14 Mar 2024 20:36:48 -0500 Subject: [PATCH v1 1/2] Revert "Temporary patch to help debug pg_walsummary test failures." This reverts commit 5ddf9973477729cf161b4ad0a1efd52f4fea9c88. --- src/backend/backup/walsummary.c | 7 --- src/bin/pg_walsummary/t/002_blocks.pl | 14 -- 2 files changed, 21 deletions(-) diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c index 4d047e1c02..322ae3c3ad 100644 --- a/src/backend/backup/walsummary.c +++ b/src/backend/backup/walsummary.c @@ -252,15 +252,8 @@ RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time) ereport(ERROR, (errcode_for_file_access(), errmsg("could not stat file \"%s\": %m", path))); - /* XXX temporarily changed to debug buildfarm failures */ -#if 0 ereport(DEBUG2, (errmsg_internal("removing file \"%s\"", path))); -#else - ereport(LOG, - (errmsg_internal("removing file \"%s\" cutoff_time=%llu", path, - (unsigned long long) cutoff_time))); -#endif } /* diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl index d4bae3d564..52d3bd8840 100644 --- a/src/bin/pg_walsummary/t/002_blocks.pl +++ b/src/bin/pg_walsummary/t/002_blocks.pl @@ -51,7 +51,6 @@ my $summarized_lsn = $node1->safe_psql('postgres',From 38dda89ee48736489d37e18dea186e90358468b0 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 14 Mar 2024 20:46:02 -0500 Subject: [PATCH v1 2/2] Fix possible overflow in MaybeRemoveOldWalSummaries(). --- src/backend/postmaster/walsummarizer.c | 4 ++-- src/backend/utils/misc/guc_tables.c| 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c index ec2874c18c..c820d1f9ed 100644 --- a/src/backend/postmaster/walsummarizer.c +++ b/src/backend/postmaster/walsummarizer.c @@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr; * GUC parameters */ bool summarize_wal = false; -int wal_summary_keep_time = 10 * 24 * 60; +int wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR; static void WalSummarizerShutdown(int code, Datum arg); static XLogRecPtr GetLatestLSN(TimeLineID *tli); @@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void) * Files should only be removed if the last modification time precedes the * cutoff time we compute here. */ - cutoff_time = time(NULL) - 60 * wal_summary_keep_time; + cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE; /* Get all the summaries that currently exist. */ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 57d9de4dd9..1e71e7db4a 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] = GUC_UNIT_MIN, }, _summary_keep_time, - 10 * 24 * 60, /* 10 days */ + 10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */ 0, - INT_MAX, + INT_MAX / SECS_PER_MINUTE, NULL, NULL, NULL }, -- 2.25.1
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Thu, Mar 14, 2024 at 10:28 AM Robert Haas wrote: > > On Wed, Oct 4, 2023 at 9:12 PM James Coleman wrote: > > All right, attached is a v3 which attempts to fix the wrong > > information with an economy of words. I may at some point submit a > > separate patch that adds a broader pruning section, but this at least > > brings the docs inline with reality insofar as they address it. > > I don't think this is as good as what I proposed back on October 2nd. > IMHO, that version does a good job making the text accurate and clear, > and is directly responsive to your original complaint, namely, that > the root of the HOT chain can't be removed. But this version seems to > contain a number of incidental changes that are unrelated to that > point, e.g. "old versions" -> "old, no longer visible versions", "can > be completely removed" -> "may be pruned", and the removal of the > sentence "In summary, heap-only tuple updates can only be created - if > columns used by indexes are not updated" which AFAICT is both > completely correct as-is and unrelated to the original complaint. > > Maybe I shouldn't be, but I'm slightly frustrated here. I thought I > had proposed an alternative which you found acceptable, but then you > proposed several more versions that did other things instead, and I > never really understood why we couldn't just adopt the version that > you seemed to think was OK. If there's a problem with that, say what > it is. If there's not, let's do that and move on. I think there's simply a misunderstanding here. I read your proposal as "here's an idea to consider as you work on the patch" (as happens on many other threads), and so I attempted to incorporate your primary points of feedback into my next version of the patch. Obviously I have reasons for the other changes I made: for example, "no longer visible" improves the correctness, since being an old version isn't sufficient. I removed the "In summary" sentence because it simply doesn't follow from the prose before it. That sentence simply restates information already appearing earlier in almost as simple a form, so it's redundant. But more importantly it's just not actually a summary of the text before it, so removing it improves the documentation. I can explain my reasoning further if desired, but I fear it would simply frustrate you further, so I'll stop here. If the goal here is the most minimal patch possible, then please commit what you proposed. I am interested in improving the document further, but I don't know how to do that easily if the requirement is effectively "must only change one specific detail at a time". So, that leaves me feeling a bit frustrated also. Regards, James Coleman
Re: Support json_errdetail in FRONTEND builds
Michael Paquier writes: > Hmm. I am not sure how much protection this would offer, TBH. One > thing that I find annoying with common/stringinfo.c as it is currently > is that we have two exit() calls in the enlarge path, and it does not > seem wise to me to spread that even more. > My last argument sounds like a nit for HEAD knowing that this does not > impact libpq that has its own pqexpbuffer.c to avoid issues with > palloc, elog and exit, but that could be a problem if OAuth relies > more on these code paths in libpq. I hope nobody is expecting that such code will get accepted. We have a policy (and an enforcement mechanism) that libpq.so must not call exit(). OAuth code in libpq will need to cope with using pqexpbuffer. regards, tom lane
Re: broken JIT support on Fedora 40
For me it seems that the LLVMRunPasses() call, new in commit 76200e5ee469e4a9db5f9514b9d0c6a31b496bff Author: Thomas Munro Date: Wed Oct 18 22:15:54 2023 +1300 jit: Changes for LLVM 17. is reaching code that segfaults inside libLLVM, specifically in llvm::InlineFunction(llvm::CallBase&, llvm::InlineFunctionInfo&, bool, llvm::AAResults*, bool, llvm::Function*). First obvious question would be: is that NULL argument still acceptable? Perhaps it wants our LLVMTargetMachineRef there: err = LLVMRunPasses(module, passes, NULL, options); But then when we see what is does with that argument, it arrives at a place that apparently accepts nullptr. https://github.com/llvm/llvm-project/blob/6b2bab2839c7a379556a10287034bd55906d7094/llvm/lib/Passes/PassBuilderBindings.cpp#L56 https://github.com/llvm/llvm-project/blob/6b2bab2839c7a379556a10287034bd55906d7094/llvm/include/llvm/Passes/PassBuilder.h#L124 Hrmph. Might need an assertion build to learn more. I'll try to look again next week or so.
Re: Support json_errdetail in FRONTEND builds
On Thu, Mar 14, 2024 at 10:56:46AM +0100, Daniel Gustafsson wrote: > + /* don't allow destroys of read-only StringInfos */ > + Assert(str->maxlen != 0); > Considering that StringInfo.c don't own the memory here I think it's warranted > to turn this assert into an elog() to avoid the risk of use-after-free bugs. Hmm. I am not sure how much protection this would offer, TBH. One thing that I find annoying with common/stringinfo.c as it is currently is that we have two exit() calls in the enlarge path, and it does not seem wise to me to spread that even more. My last argument sounds like a nit for HEAD knowing that this does not impact libpq that has its own pqexpbuffer.c to avoid issues with palloc, elog and exit, but that could be a problem if OAuth relies more on these code paths in libpq. -- Michael signature.asc Description: PGP signature
Re: broken JIT support on Fedora 40
On Fri, Mar 15, 2024 at 12:27 PM Daniel Gustafsson wrote: > > On 14 Mar 2024, at 20:15, Pavel Stehule wrote: > > > build is ok, but regress tests fails with crash (it works without > > -with-llvm) > > Can you post some details around this crash? It doesn't seem to be a > combination we have covered in the buildfarm. Yeah, 18.1 (note they switched to 1-based minor numbers, there was no 18.0) just came out a week or so ago. Despite testing their 18 branch just before their "RC1" tag, as recently as commit d282e88e50521a457fa1b36e55f43bac02a3167f Author: Thomas Munro Date: Thu Jan 25 10:37:35 2024 +1300 Track LLVM 18 changes. at which point everything worked, it seems that something changed before they released. I haven't looked into why yet but it's crashing on my FreeBSD box too.
Re: Possibility to disable `ALTER SYSTEM`
On Thu, Mar 14, 2024 at 5:15 PM Maciek Sakrejda wrote: > It's not a security feature: it's a usability feature. > > It's a usability feature because, when Postgres configuration is > managed by an outside mechanism (e.g., as in a Kubernetes > environment), ALTER SYSTEM currently allows a superuser to make > changes that appear to work, but may be discarded at some point in the > future when that outside mechanism updates the config. They may also > be represented incorrectly in a management dashboard if that dashboard > is based on the values in the outside configuration mechanism, rather > than values directly from Postgres. > > In this case, the end user with access to Postgres superuser > privileges presumably also has access to the outside configuration > mechanism. The goal is not to prevent them from changing settings, but > to offer guard rails that prevent them from changing settings in a way > that will be unstable (revertible by a future update) or confusing > (not showing up in a management UI). > > There are challenges here in making sure this is _not_ seen as a > security feature. But I do think the feature itself is sensible and > worthwhile. This is what I would have said if I'd tried to offer an explanation, except you said it better than I would have done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fix the synopsis of pg_md5_hash
Hi, Daniel and Michael, On Thu, Mar 14, 2024 at 09:32:55AM +0100, Daniel Gustafsson wrote: > > On 14 Mar 2024, at 07:02, Tatsuro Yamada wrote: > >> So, I created a patch to fix them. > > > > Thanks, applied. > > Oops. Thanks. > -- > Michael > Thank you guys! Regards, Tatsuro Yamada NTT Open Source Software Center
Re: Add basic tests for the low-level backup method.
On Fri, Mar 15, 2024 at 09:40:38AM +1300, David Steele wrote: > Is the missing test in meson the reason we did not see test failures for > Windows in CI? The test has to be listed in src/test/recovery/meson.build or the CI would ignore it. >> The second LOG is something that can be acted on. I've added some >> debugging to the parsing of the backup_label file in the backend, and >> noticed that the first fscanf() for START WAL LOCATION is failing >> because the last %c is detected as \r rather than \n. Tweaking the >> contents stored from pg_backend_stop() with a sed won't help, because >> the issue is that we write the CRLFs with append_to_file, and the >> startup process cannot cope with that. The simplest method I can >> think of is to use binmode, as of the attached. > > Yeah, that makes sense. I am wondering if there is a better trick here that would not require changes in the backend to make the backup_label parsing more flexible, though. >> I am attaching an updated patch with all that fixed, which is stable >> in the CI and any tests I've run. Do you have any comments about > > These changes look good to me. Sure wish we had an easier to way to test > commits in the build farm. That's why these tests are not that easy, they can be racy. I've run the test 5~10 times in the CI this time to gain more confidence, and saw zero failures with the stability fixes in place including Windows. I've applied it now, as I can still monitor the buildfarm for a few more days. Let's see what happens, but that should be better. -- Michael signature.asc Description: PGP signature
Re: broken JIT support on Fedora 40
> On 14 Mar 2024, at 20:15, Pavel Stehule wrote: > build is ok, but regress tests fails with crash (it works without -with-llvm) Can you post some details around this crash? It doesn't seem to be a combination we have covered in the buildfarm. -- Daniel Gustafsson
Re: small_cleanups around login event triggers
> On 14 Mar 2024, at 14:21, Robert Treat wrote: > On Thu, Mar 14, 2024 at 8:21 AM Daniel Gustafsson wrote: >> - canceling connection in psql wouldn't cancel >> + canceling a connection in psql will not >> cancel >> Nitpickery (perhaps motivated by english not being my first language), but >> since psql only deals with one connection I would expect this to read "the >> connection". >> > > My interpretation of this is that "a connection" is more correct > because it could be your connection or someone else's connection (ie, > you are canceling one of many possible connections). Definitely > nitpickery either way. Fair point. >> - * Returns true iff the lock was acquired. >> + * Returns true if the lock was acquired. >> Using "iff" here is being consistent with the rest of the file (and >> technically >> correct): > Ah, yeah, I was pretty focused on the event trigger stuff and didn't > notice it being used elsewhere; thought it was a typo, but I guess > it's meant as shorthand for "if and only if", I wonder how many people > are familiar with that. I would like to think it's fairly widely understood among programmers, but I might be dating myself in saying so. I went ahead and applied this with the fixes mentioned here with one more tiny change to the last hunk of the patch to make it say "login event trigger" rather than just "login trigger". Thanks for the submission! -- Daniel Gustafsson
Re: Weird test mixup
Michael Paquier writes: > On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote: >> I wonder if it'd be wise to adjust the injection point stuff so that >> it's active in only the specific database the injection point was >> activated in. > It can be made optional by extending InjectionPointAttach() to > specify a database OID or a database name. Note that > 041_checkpoint_at_promote.pl wants an injection point to run in the > checkpointer, where we don't have a database requirement. > Or we could just disable runningcheck because of the concurrency > requirement in this test. The test would still be able to run, just > less times. No, actually we *must* mark all these tests NO_INSTALLCHECK if we stick with the current definition of injection points. The point of installcheck mode is that the tests are supposed to be safe to run in a live installation. Side-effects occurring in other databases are completely not OK. I can see that some tests would want to be able to inject code cluster-wide, but I bet that's going to be a small minority. I suggest that we invent a notion of "global" vs "local" injection points, where a "local" one only fires in the DB it was defined in. Then only tests that require a global injection point need to be NO_INSTALLCHECK. regards, tom lane
Re: Weird test mixup
On Fri, Mar 15, 2024 at 07:53:57AM +0900, Michael Paquier wrote: > It can be made optional by extending InjectionPointAttach() to > specify a database OID or a database name. Note that > 041_checkpoint_at_promote.pl wants an injection point to run in the > checkpointer, where we don't have a database requirement. Slight correction here. It is also possible to not touch InjectionPointAttach() at all: just tweak the callbacks to do that as long as the database that should be used is tracked in shmem with its point name, say with new fields in InjectionPointSharedState. That keeps the backend APIs in a cleaner state. -- Michael signature.asc Description: PGP signature
Re: Fix the synopsis of pg_md5_hash
On Thu, Mar 14, 2024 at 09:32:55AM +0100, Daniel Gustafsson wrote: > On 14 Mar 2024, at 07:02, Tatsuro Yamada wrote: >> So, I created a patch to fix them. > > Thanks, applied. Oops. Thanks. -- Michael signature.asc Description: PGP signature
Re: type cache cleanup improvements
On Thu, Mar 14, 2024 at 04:27:43PM +0300, Teodor Sigaev wrote: >> So I would like to suggest the attached patch for this first piece. >> What do you think? > > I have not any objections Okay, I've applied this piece for now. Not sure I'll have much room to look at the rest. -- Michael signature.asc Description: PGP signature
Re: pg16: XX000: could not find pathkey item to sort
On Thu, 14 Mar 2024 at 12:00, David Rowley wrote: > I've attached a patch which fixes the problem for me. I've pushed the patch to fix gather_grouping_paths(). The issue with the RelOptInfo having the incorrect PathTarget->exprs after the partial phase of partition-wise aggregate remains. David
Re: Weird test mixup
On Thu, Mar 14, 2024 at 06:19:38PM -0400, Tom Lane wrote: > Do they? It'd be fairly easy to explain this if these things were > being run in "installcheck" style. I'm not sure about CI, but from > memory, the buildfarm does use installcheck for some things. > > I wonder if it'd be wise to adjust the injection point stuff so that > it's active in only the specific database the injection point was > activated in. It can be made optional by extending InjectionPointAttach() to specify a database OID or a database name. Note that 041_checkpoint_at_promote.pl wants an injection point to run in the checkpointer, where we don't have a database requirement. Or we could just disable runningcheck because of the concurrency requirement in this test. The test would still be able to run, just less times. -- Michael signature.asc Description: PGP signature
Re: Weird test mixup
Thomas Munro writes: > On Fri, Mar 15, 2024 at 11:19 AM Tom Lane wrote: >> Do they? It'd be fairly easy to explain this if these things were >> being run in "installcheck" style. I'm not sure about CI, but from >> memory, the buildfarm does use installcheck for some things. > Right, as mentioned here: > https://www.postgresql.org/message-id/flat/CA%2BhUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX%3D%3DAVnW84f-%2B8yamQ%40mail.gmail.com > That's the "running" test, which is like the old installcheck. Hmm. Seems like maybe we need to institute a rule that anything using injection points has to be marked NO_INSTALLCHECK. That's kind of a big hammer though. regards, tom lane
Re: Weird test mixup
I wrote: > Heikki Linnakangas writes: >> Somehow the 'gin-leave-leaf-split-incomplete' injection point was active >> in the 'intarray' test. That makes no sense. That injection point is >> only used by the test in src/test/modules/gin/. Perhaps that ran at the >> same time as the intarray test? But they run in separate instances, with >> different data directories. > Do they? It'd be fairly easy to explain this if these things were > being run in "installcheck" style. I'm not sure about CI, but from > memory, the buildfarm does use installcheck for some things. Hmm, Munro's comment yesterday[1] says that current CI does use installcheck mode in some cases. regards, tom lane [1] https://www.postgresql.org/message-id/CA+hUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX==avnw84f-+8y...@mail.gmail.com
Re: Weird test mixup
On Fri, Mar 15, 2024 at 11:19 AM Tom Lane wrote: > Heikki Linnakangas writes: > > Somehow the 'gin-leave-leaf-split-incomplete' injection point was active > > in the 'intarray' test. That makes no sense. That injection point is > > only used by the test in src/test/modules/gin/. Perhaps that ran at the > > same time as the intarray test? But they run in separate instances, with > > different data directories. > > Do they? It'd be fairly easy to explain this if these things were > being run in "installcheck" style. I'm not sure about CI, but from > memory, the buildfarm does use installcheck for some things. Right, as mentioned here: https://www.postgresql.org/message-id/flat/CA%2BhUKGJYhcG_o2nwSK6r01eOZJwNWUJUbX%3D%3DAVnW84f-%2B8yamQ%40mail.gmail.com That's the "running" test, which is like the old installcheck.
Re: Proposal to add page headers to SLRU pages
On Mon, 2024-03-11 at 10:01 +, Li, Yong wrote: > - The clog LSN group has been brought back. > Now the page LSN on each clog page is used for honoring the write- > ahead rule > and it is always the highest LSN of all the LSN groups on the page. > The LSN groups are used by TransactionIdGetStatus() as before. I like where this is going. Álvaro, do you still see a problem with this approach? > - New comments have been added to pg_upgrade to mention the SLRU > page header change as the reason for upgrading clog files. That seems reasonable, but were any alternatives discussed? Do we have consensus that this is the right thing to do? And if we use this approach, is there extra validation or testing that can be done? Regards, Jeff Davis
Re: Weird test mixup
Heikki Linnakangas writes: > Somehow the 'gin-leave-leaf-split-incomplete' injection point was active > in the 'intarray' test. That makes no sense. That injection point is > only used by the test in src/test/modules/gin/. Perhaps that ran at the > same time as the intarray test? But they run in separate instances, with > different data directories. Do they? It'd be fairly easy to explain this if these things were being run in "installcheck" style. I'm not sure about CI, but from memory, the buildfarm does use installcheck for some things. I wonder if it'd be wise to adjust the injection point stuff so that it's active in only the specific database the injection point was activated in. regards, tom lane
Re: BitmapHeapScan streaming read user and prelim refactoring
On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra wrote: > > On 3/14/24 19:16, Melanie Plageman wrote: > > On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: > >> ... > >> > >> Ok, committed that for now. Thanks for looking! > > > > Attached v6 is rebased over your new commit. It also has the "fix" in > > 0010 which moves BitmapAdjustPrefetchIterator() back above > > table_scan_bitmap_next_block(). I've also updated the Streaming Read API > > commit (0013) to Thomas' v7 version from [1]. This has the update that > > we theorize should address some of the regressions in the bitmapheapscan > > streaming read user in 0014. > > > > Should I rerun the benchmarks with these new patches, to see if it > really helps with the regressions? That would be awesome! I will soon send out a summary of what we investigated off-list about 0010 (though we didn't end up concluding anything). My "fix" (leaving BitmapAdjustPrefetchIterator() above table_scan_bitmap_next_block()) eliminates the regression in 0010 on the one example that I repro'd upthread, but it would be good to know if it eliminates the regressions across some other tests. I think it would be worthwhile to run the subset of tests which seemed to fare the worst on 0010 against the patches 0001-0010-- cyclic uncached on your xeon machine with 4 parallel workers, IIRC -- even the 1 million scale would do the trick, I think. And then separately run the subset of tests which seemed to do the worst on 0014. There were several groups of issues across the different tests, but I think that the uniform pages data test would be relevant to use. It showed the regressions with eic 0. As for the other regressions showing with 0014, I think we would want to see at least one with fully-in-shared-buffers and one with fully uncached. Some of the fixes were around pinning fewer buffers when the blocks were already in shared buffers. - Melanie
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/14/24 19:16, Melanie Plageman wrote: > On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: >> ... >> >> Ok, committed that for now. Thanks for looking! > > Attached v6 is rebased over your new commit. It also has the "fix" in > 0010 which moves BitmapAdjustPrefetchIterator() back above > table_scan_bitmap_next_block(). I've also updated the Streaming Read API > commit (0013) to Thomas' v7 version from [1]. This has the update that > we theorize should address some of the regressions in the bitmapheapscan > streaming read user in 0014. > Should I rerun the benchmarks with these new patches, to see if it really helps with the regressions? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Weird test mixup
I got a weird test failure while testing my forking refactor patches on Cirrus CI (https://cirrus-ci.com/task/5880724448870400?logs=test_running#L121): [16:52:39.753] Summary of Failures: [16:52:39.753] [16:52:39.753] 66/73 postgresql:intarray-running / intarray-running/regress ERROR 6.27s exit status 1 [16:52:39.753] [16:52:39.753] Ok: 72 [16:52:39.753] Expected Fail: 0 [16:52:39.753] Fail: 1 [16:52:39.753] Unexpected Pass:0 [16:52:39.753] Skipped:0 [16:52:39.753] Timeout:0 [16:52:39.753] [16:52:39.753] Full log written to /tmp/cirrus-ci-build/build/meson-logs/testlog-running.txt And: diff -U3 /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out /tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out --- /tmp/cirrus-ci-build/contrib/intarray/expected/_int.out 2024-03-14 16:48:48.690367000 + +++ /tmp/cirrus-ci-build/build/testrun/intarray-running/regress/results/_int.out 2024-03-14 16:52:05.759444000 + @@ -804,6 +804,7 @@ DROP INDEX text_idx; CREATE INDEX text_idx on test__int using gin ( a gin__int_ops ); +ERROR: error triggered for injection point gin-leave-leaf-split-incomplete SELECT count(*) from test__int WHERE a && '{23,50}'; count --- @@ -877,6 +878,7 @@ (1 row) DROP INDEX text_idx; +ERROR: index "text_idx" does not exist -- Repeat the same queries with an extended data set. The data set is the -- same that we used before, except that each element in the array is -- repeated three times, offset by 1000 and 2000. For example, {1, 5} Somehow the 'gin-leave-leaf-split-incomplete' injection point was active in the 'intarray' test. That makes no sense. That injection point is only used by the test in src/test/modules/gin/. Perhaps that ran at the same time as the intarray test? But they run in separate instances, with different data directories. And the 'gin' test passed. I'm completely stumped. Anyone have a theory? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Possibility to disable `ALTER SYSTEM`
On Thu, Mar 14, 2024 at 1:38 PM Robert Haas wrote: > On Thu, Mar 14, 2024 at 4:08 PM Tom Lane wrote: > > The patch-of-record contains no such wording. > > I plan to fix that, if nobody else beats me to it. > > > And if this isn't a > > security feature, then what is it? If you have to say to your > > (super) users "please don't mess with the system configuration", > > you might as well just trust them not to do it the easy way as not > > to do it the hard way. If they're untrustworthy, why have they > > got superuser? > > I mean, I feel like this question has been asked and answered before, > multiple times, on this thread. If you sincerely don't understand the > use case, I can try again to explain it. But somehow I feel like it's > more that you just don't like the idea, which is fair, but it seems > like a considerable number of people feel otherwise. I know I'm jumping into a long thread here, but I've been following it out of interest. I'm sympathetic to the use case, since I used to work at a Postgres cloud provider, and while our system intentionally did not give our end users superuser privileges, I can imagine other managed environments where that's not an issue. I'd like to give answering this question again a shot, because I think this has been a persistent misunderstanding in this thread, and I don't think it's been made all that clear. It's not a security feature: it's a usability feature. It's a usability feature because, when Postgres configuration is managed by an outside mechanism (e.g., as in a Kubernetes environment), ALTER SYSTEM currently allows a superuser to make changes that appear to work, but may be discarded at some point in the future when that outside mechanism updates the config. They may also be represented incorrectly in a management dashboard if that dashboard is based on the values in the outside configuration mechanism, rather than values directly from Postgres. In this case, the end user with access to Postgres superuser privileges presumably also has access to the outside configuration mechanism. The goal is not to prevent them from changing settings, but to offer guard rails that prevent them from changing settings in a way that will be unstable (revertible by a future update) or confusing (not showing up in a management UI). There are challenges here in making sure this is _not_ seen as a security feature. But I do think the feature itself is sensible and worthwhile. Thanks, Maciek
Re: JIT compilation per plan node
On 3/14/24 20:14, Robert Haas wrote: > On Tue, Feb 20, 2024 at 5:31 AM Tomas Vondra > wrote: >> I certainly agree that the current JIT costing is quite crude, and we've >> all seen cases where the decision turns out to not be great. And I think >> the plan to make the decisions at the node level makes sense, so +1 to >> that in general. > > Seems reasonable to me also. > >> And I think you're right that looking just at the node total cost may >> not be sufficient - that we may need a better cost model, considering >> how many times an expression is executed and so on. But I think we >> should try to do this in smaller steps, meaningful on their own, >> otherwise we won't move at all. The two threads linked by Melih are ~4y >> old and *nothing* changed since then, AFAIK. >> >> I think it's reasonable to start by moving the decision to the node >> level - it's where the JIT happens, anyway. It may not be perfect, but >> it seems like a clear improvement. And if we then choose to improve the >> "JIT cost model" to address some of the issues you pointed out, surely >> that would need to happen at the node level too ... > > I'm not sure I understand whether you (Tomas) think that this patch is > a good idea or a bad idea as it stands. I read the first of these two > paragraphs to suggest that the patch hasn't really evolved much in the > last few years, perhaps suggesting that if it wasn't good enough to > commit back then, it still isn't now. But the second of these two > paragraphs seems more supportive. > To clarify, I think the patch is a step in the right direction, and a meaningful improvement. It may not be the perfect solution we imagine (but who knows how far we are from that), but AFAIK moving these decisions to the node level is something the ideal solution would need to do too. The reference to the 4y old patches was meant to support this patch as an improvement - perhaps incomplete, but still an improvement. We keep imagining "perfect solutions" and then end up doing nothing. I recognize there's a risk we may never get to have the ideal solution (e.g. because it requires information we don't possess). But I still think moving the decision to the node level would allow us to do better decisions compared to just doing it for the query as a whole. > From my own point of view, I definitely agree with David's statement > that what we really want to know is how many times each expression > will be evaluated. If we had that information, or just an estimate, I > think we could make much better decisions in this area. But we don't > have that infrastructure now, and it doesn't seem easy to create, so > it seems to me that what we have to decide now is whether applying a > cost threshold on a per-plan-node basis will produce better or worse > results than what making one decision for the whole plan. David's > provided an example of where it does indeed work better back in > https://www.postgresql.org/message-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com > - but could there be enough cases where the opposite happens to make > us think that the patch is overall a bad idea? > Right, this risk or regression is always there, and I'm sure it'd be possible to construct such cases. But considering how crude the current costing is, I'd be surprised if this ends up being a net negative. Also, is the number of executions really the thing we're missing? Surely we know the number of rows the node is dealing with, so we could use this (yes, I realize there are issues, but we deal with that when costing quals too). Isn't it much bigger issue that we have pretty much no cost model for the actual JIT (compilation/optimization) depending on how many expressions it deals with? > I personally find that a bit unlikely, although not impossible. I see > a couple of ways that using the per-node cost can distort things -- it > seems like it will tend to heavily feature JIT for "interior" plan > nodes because the cost of a plan node includes it's children -- and as > was mentioned previously, it doesn't really care whether the node cost > is high because of expression evaluation or something else. But > neither of those things seem like they'd be bad enough to make this a > bad way forward over all. For the patch to lose, it seems like we'd > need a case where the overall plan cost would have been high enough to > trigger JIT pre-patch, but most of the benefit would have come from > relatively low-cost nodes that don't get JITted post-patch. The > easiest way for that to happen is if the planner's estimates are off, > but that's not really an argument against this patch as much as it is > an argument that query planning is hard in general. > > A slightly subtler way the patch could lose is if the new threshold is > harder to adjust than the old one. For example, imagine that you have > a query that does a Cartesian join. That makes the cost of the input > nodes rather small compared
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 12:05:15PM -0600, Nathan Bossart wrote: > There might be an overflow risk in the cutoff time calculation, but I doubt > that's the root cause of these failures: > > /* >* Files should only be removed if the last modification time precedes > the >* cutoff time we compute here. >*/ > cutoff_time = time(NULL) - 60 * wal_summary_keep_time; I've attached a short patch for fixing this overflow risk. Specifically, it limits wal_summary_keep_time to INT_MAX / SECS_PER_MINUTE, just like log_rotation_age. I considering checking for overflow when we subtract the keep-time from the result of time(2), but AFAICT there's only a problem if time_t is unsigned, which Wikipedia leads me to believe is unusual [0], so I figured we might be able to just wait this one out until 2038. > Otherwise, I think we'll probably need to add some additional logging to > figure out what is happening... Separately, I suppose it's probably time to revert the temporary debugging code adding by commit 5ddf997. I can craft a patch for that, too. [0] https://en.wikipedia.org/wiki/Unix_time#Representing_the_number -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c index ec2874c18c..c820d1f9ed 100644 --- a/src/backend/postmaster/walsummarizer.c +++ b/src/backend/postmaster/walsummarizer.c @@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr; * GUC parameters */ bool summarize_wal = false; -int wal_summary_keep_time = 10 * 24 * 60; +int wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR; static void WalSummarizerShutdown(int code, Datum arg); static XLogRecPtr GetLatestLSN(TimeLineID *tli); @@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void) * Files should only be removed if the last modification time precedes the * cutoff time we compute here. */ - cutoff_time = time(NULL) - 60 * wal_summary_keep_time; + cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE; /* Get all the summaries that currently exist. */ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 57d9de4dd9..1e71e7db4a 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] = GUC_UNIT_MIN, }, _summary_keep_time, - 10 * 24 * 60, /* 10 days */ + 10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */ 0, - INT_MAX, + INT_MAX / SECS_PER_MINUTE, NULL, NULL, NULL },
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
On 14/03/2024 22:00, Melanie Plageman wrote: On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote: typedef struct SharedBitmapHeapInstrumentation { int num_workers; BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; } SharedBitmapHeapInstrumentation; typedef struct BitmapHeapScanState { ScanState ss; /* its first field is NodeTag */ ... SharedBitmapHeapInstrumentation sinstrument; } BitmapHeapScanState; that compiles, at least with my compiler, but I find it weird to have a variable-length inner struct embedded in an outer struct like that. In the attached patch, BitmapHeapScanState->sinstrument is a pointer, though. Or are you proposing the above as an alternative that you decided not to go with? Right, the above is what I contemplated at first but decided it was a bad idea. -- Heikki Linnakangas Neon (https://neon.tech)
Re: BitmapHeapScan streaming read user and prelim refactoring
On Fri, Mar 15, 2024 at 3:18 AM Tomas Vondra wrote: > So, IIUC this means (1) the patched code is more aggressive wrt > prefetching (because we prefetch more data overall, because master would > prefetch N pages and patched prefetches N ranges, each of which may be > multiple pages. And (2) it's not easy to quantify how much more > aggressive it is, because it depends on how we happen to coalesce the > pages into ranges. > > Do I understand this correctly? Yes. Parallelism must prevent coalescing here though. Any parallel aware executor node that allocates block numbers to workers without trying to preserve ranges will. That not only hides the opportunity to coalesce reads, it also makes (globally) sequential scans look random (ie locally they are more random), so that our logic to avoid issuing advice for sequential scan won't work, and we'll inject extra useless or harmful (?) fadvise calls. I don't know what to do about that yet, but it seems like a subject for future research. Should we recognise sequential scans with a window (like Linux does), instead of strictly next-block detection (like some other OSes do)? Maybe a shared streaming read that all workers pull blocks from, so it can see what's going on? I think the latter would be strictly more like what the ad hoc BHS prefetching code in master is doing, but I don't know if it'd be over-engineering, or hard to do for some reason. Another aspect of per-backend streaming reads in one parallel query that don't know about each other is that they will all have their own effective_io_concurrency limit. That is a version of a problem that comes up again and again in parallel query, to be solved by the grand unified resource control system of the future.
Re: Recent 027_streaming_regress.pl hangs
Thomas Munro writes: > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin wrote: >> Could it be that the timeout (360 sec?) is just not enough for the test >> under the current (changed due to switch to meson) conditions? > But you're right that under meson the test takes a lot longer, I guess > due to increased concurrency: What it seems to be is highly variable. Looking at calliphoridae's last half dozen successful runs, I see reported times for 027_stream_regress anywhere from 183 to 704 seconds. I wonder what else is happening on that machine. Also, this is probably not helping anything: 'extra_config' => { ... 'fsync = on' I would suggest turning that off and raising wait_timeout a good deal, and then we'll see if calliphoridae gets any more stable. regards, tom lane
Re: Built-in CTYPE provider
On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote: > On 14.03.24 09:08, Jeff Davis wrote: > > 0001 (the C.UTF-8 locale) is also close... > > If have tested this against the libc locale C.utf8 that was available > on > the OS, and the behavior is consistent. That was the goal, in spirit. But to clarify: it's not guaranteed that the built-in C.UTF-8 is always the same as the libc UTF-8, because different implementations do different things. For instance, I saw significant differences on MacOS. > I wonder if we should version the builtin locales too. We might make > a > mistake and want to change something sometime? I'm fine with that, see v25-0004 in the reply to your other mail. The version only tracks sort order, and all of the builtin locales sort based on memcmp(). But it's possible there are bugs in the optimizations around memcmp() (e.g. abbreviated keys, or some future optimization). > Tiny comments: > > * src/bin/scripts/t/020_createdb.pl > > The two added tests should have different names that tells them apart > (like the new initdb tests). > > * src/include/catalog/pg_collation.dat Done in v25-0002 (in reply to your other mail). Regards, Jeff Davis
Re: Add basic tests for the low-level backup method.
On 3/14/24 20:00, Michael Paquier wrote: On Thu, Mar 14, 2024 at 09:12:52AM +1300, David Steele wrote: I think you are right that the start message is better since it can only appear once when the backup_label is found. The completed message could in theory appear after a restart, though the backup_label must have been found at some point. So, I've given a try to this patch with 99b4a63bef94, to note that sidewinder failed because of a timing issue on Windows: the recovery of the node without backup_label, expected to fail, would try to backup the last segment it has replayed, because, as it has no backup_label, it behaves like the primary. It would try to use the same archive location as the primary, leading to a conflict failure on Windows. This one was easy to fix, by overwritting postgresql.conf on the node to not do archiving. Hmmm, I wonder why this did not show up in the Windows tests on CI? Following that, I've noticed a second race condition: we don't wait for the segment after pg_switch_wal() to be archived. This one can be easily avoided with a poll on pg_stat_archiver. Ugh, yeah, good change. After that, also because I've initially managed to, cough, forget an update of meson.build to list the new test, I've noticed a third failure on Windows for the case of the node that has a backup_label. Here is one of the failures: https://cirrus-ci.com/task/5245341683941376 Is the missing test in meson the reason we did not see test failures for Windows in CI? regress_log_042_low_level_backup and 042_low_level_backup_replica_success.log have all the information needed, that can be summarized like that: The system cannot find the file specified. 2024-03-14 06:02:37.670 GMT [560][startup] FATAL: invalid data in file "backup_label" The first message is something new to me, that seems to point to a corruption failure of the file system. Why don't we see something similar in other tests, then? Leaving that aside.. The second LOG is something that can be acted on. I've added some debugging to the parsing of the backup_label file in the backend, and noticed that the first fscanf() for START WAL LOCATION is failing because the last %c is detected as \r rather than \n. Tweaking the contents stored from pg_backend_stop() with a sed won't help, because the issue is that we write the CRLFs with append_to_file, and the startup process cannot cope with that. The simplest method I can think of is to use binmode, as of the attached. Yeah, that makes sense. It is the first time that we'd take the contents received from a BackgroundPsql and write them to a file parsed by the backend, so perhaps we should try to do that in a more general way, but I'm not sure how, tbh, and the case of this test is special while adding handling for \r when reading the backup_label got discussed in the past but we were OK with what we are doing now on HEAD. I think it makes sense to leave the parsing code as is and make the change in the test. If we add more tests to this module we'll probably need a function to avoid repeating code. On top of all that, note that I have removed remove_tree as I am not sure if this would be OK in all the buildfarm animals, added a quit() for BackgroundPsql, moved queries to use less BackgroundPsql, as well as a few other things like avoiding the hardcoded segment names. meson.build is.. Cough.. Updated now. OK. I am attaching an updated patch with all that fixed, which is stable in the CI and any tests I've run. Do you have any comments about These changes look good to me. Sure wish we had an easier to way to test commits in the build farm. Regards, -David
Re: Possibility to disable `ALTER SYSTEM`
On Thu, Mar 14, 2024 at 4:08 PM Tom Lane wrote: > The patch-of-record contains no such wording. I plan to fix that, if nobody else beats me to it. > And if this isn't a > security feature, then what is it? If you have to say to your > (super) users "please don't mess with the system configuration", > you might as well just trust them not to do it the easy way as not > to do it the hard way. If they're untrustworthy, why have they > got superuser? I mean, I feel like this question has been asked and answered before, multiple times, on this thread. If you sincerely don't understand the use case, I can try again to explain it. But somehow I feel like it's more that you just don't like the idea, which is fair, but it seems like a considerable number of people feel otherwise. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 13.03.2024 10:41, Anton A. Melnikov wrote: Here is a version updated for the current master. During patch updating i mistakenly added double counting of deallocatated blocks. That's why the tests in the patch tester failed. Fixed it and squashed fix 0002 with 0001. Here is fixed version. With the best wishes! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres CompanyFrom 7a0925a38acfb9a945087318a5d91fae4680db0e Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 26 Dec 2023 17:54:40 +0100 Subject: [PATCH 1/2] Add tracking of backend memory allocated Add tracking of backend memory allocated in total and by allocation type (aset, dsm, generation, slab) by process. allocated_bytes tracks the current bytes of memory allocated to the backend process. aset_allocated_bytes, dsm_allocated_bytes, generation_allocated_bytes and slab_allocated_bytes track the allocation by type for the backend process. They are updated for the process as memory is malloc'd/freed. Memory allocated to items on the freelist is included. Dynamic shared memory allocations are included only in the value displayed for the backend that created them, they are not included in the value for backends that are attached to them to avoid double counting. DSM allocations that are not destroyed by the creating process prior to it's exit are considered long lived and are tracked in a global counter global_dsm_allocated_bytes. We limit the floor of allocation counters to zero. Created views pg_stat_global_memory_allocation and pg_stat_memory_allocation for access to these trackers. --- doc/src/sgml/monitoring.sgml| 246 src/backend/catalog/system_views.sql| 34 +++ src/backend/storage/ipc/dsm.c | 11 +- src/backend/storage/ipc/dsm_impl.c | 78 +++ src/backend/storage/lmgr/proc.c | 1 + src/backend/utils/activity/backend_status.c | 114 + src/backend/utils/adt/pgstatfuncs.c | 84 +++ src/backend/utils/init/miscinit.c | 3 + src/backend/utils/mmgr/aset.c | 17 ++ src/backend/utils/mmgr/generation.c | 13 ++ src/backend/utils/mmgr/slab.c | 22 ++ src/include/catalog/pg_proc.dat | 17 ++ src/include/storage/proc.h | 2 + src/include/utils/backend_status.h | 144 +++- src/test/regress/expected/rules.out | 27 +++ src/test/regress/expected/stats.out | 36 +++ src/test/regress/sql/stats.sql | 20 ++ 17 files changed, 867 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 8736eac284..41d788be45 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4599,6 +4599,252 @@ description | Waiting for a newly initialized WAL file to reach durable storage + + pg_stat_memory_allocation + + + pg_stat_memory_allocation + + + + The pg_stat_memory_allocation view will have one + row per server process, showing information related to the current memory + allocation of that process in total and by allocator type. Due to the + dynamic nature of memory allocations the allocated bytes values may not be + exact but should be sufficient for the intended purposes. Dynamic shared + memory allocations are included only in the value displayed for the backend + that created them, they are not included in the value for backends that are + attached to them to avoid double counting. Use + pg_size_pretty described in +to make these values more easily + readable. + + + + pg_stat_memory_allocation View + + + + + Column Type + + + Description + + + + + + + + datid oid + + + OID of the database this backend is connected to + + + + + + pid integer + + + Process ID of this backend + + + + + + allocated_bytes bigint + + + Memory currently allocated to this backend in bytes. This is the balance + of bytes allocated and freed by this backend. Dynamic shared memory + allocations are included only in the value displayed for the backend that + created them, they are not included in the value for backends that are + attached to them to avoid double counting. + + + + + + aset_allocated_bytes bigint + + + Memory currently allocated to this backend in bytes via the allocation + set allocator. + + + + + + dsm_allocated_bytes bigint + + + Memory currently allocated to this backend in bytes via the dynamic + shared memory allocator. Upon process exit, dsm allocations that have + not been freed are considered long lived and added to + global_dsm_allocated_bytes
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, Mar 14, 2024 at 1:54 PM Jelte Fennema-Nio wrote: > In my view there can be, **by definition,** only two general types of > protocol changes: > 1. A "simple protocol change": This is one that requires agreement by > both the client and server, that they understand the new message types > involved in this change. e.g. the ParameterSet message proposal (this > message type is either supported or it's not) > 2. A "parameterized protocol change": This requires the same as 1, but > should also allow some parameterization from the client, e.g. for the > compression proposal, the client should specify what compression > algorithm the server should use to compress data when sending it to > the client. You seem to be supposing here that all protocol changes will consist of adding new message types. While I think that will be a common pattern, I do not think it will be a universal one. I do agree, however, that every possible variation of the protocol is either Boolean-valued (i.e. are we doing X or not?) or something more complicated (i.e. how exactly are doing X?). > Client and Server can agree that a "simple protocol change" is > supported by both advertising a minimum minor protocol version. And > for a "parameterized protocol change" the client can send a _pq_ > parameter in the startup packet. > > So, new _pq_ parameters should only ever be introduced for > parameterized protocol changes. They are not meant to advertise > support, they are meant to configure protocol features. For a > non-configurable protocol feature, we'd simply bump the protocol > version. And since _pq_ parameters thus always control some setting at > the session level, we can simply store it as a GUC, because that's how > we store all our parameters at a session level. This is definitely not how I was thinking about it. I was imagining that we wanted to reserve bumping the protocol version for more significant changes, and that we'd use _pq_ parameters for relatively minor new functionality whether Boolean-valued or otherwise. > I think given the automatic downgrade supported by the > NegotiateProtocolVersion, there's no real down-side to requesting the > newest version by default. The only downside I can see is when > connecting to other applications (i.e. non PostgreSQL servers) that > implement the postgres protocol, but don't implement > NegotiateProtocolVersion. But for that I added the > max_protocol_version connection option in 0006 (of my latest > patchset). You might be right. This is a minor point that's not worth arguing about too much. > > I'm really unhappy with 0002-0004. > > Just to be clear, (afaict) your argument below seems to only really be > about 0004, not about 0002 or 0003. Was there anything about 0002 & > 0003 that you were unhappy with? 0002 & 0003 are not dependent an 0004 > imho. Because even when not making _pq_ parameters map to GUCs, we'd > still need to change libpq to not instantly close the connection > whenever a _pq_ parameter (or new protocol version) is not supported > by the server (which is what 0002 & 0003 do). I completely agree that we need to avoid slamming the connection shut. What I don't agree with is taking the list of protocol extensions that the server knows about and putting it into an array of strings that the user can see. I don't want the user to know or care so much about what's happening at the wire protocol level. The user is entitled to know whether PQsetProtocolParameter() will work or not, and the user is entitled to know whether it has a chance of working next time if it didn't work this time, and when it fails, the user is entitled to a good error message explaining the reason for the failure. But the user is not entitled to know what negotiation took place over the wire to figure that out. They shouldn't need to know that the _pq_ namespace exists, and they shouldn't need to know whether we negotiated the availability or unavailability of PQsetProtocolParameter() using [a] the protocol minor version number, [b] the protocol major version number, [c] a protocol option called parameter_set, or [d] a protocol option called bananas_foster. Those are all things that might change in the future. Just as a for instance, I had a thought that we might accumulate a few new message types controlled by protocol options (ParameterSet, AlwaysSendTypeInBinaryFormat, whatever else) while keeping the protocol version as 3.0, and then eventually bump the protocol version to 3.1 where all of that would be mandatory functionality. So the protocol parameters wouldn't be specified any more when using 3.1, but they would be specified when talking to older 3.0 servers. That difference shouldn't be visible to the user. The user should only know the result of the negotiation. > Okay, our interpretation is very different here. From my perspective > introducing a non-GUC namespace is NOT AT ALL the benefit of the _pq_ > prefix. The main benefit (imho) is that it allows sending an >
Re: Recent 027_streaming_regress.pl hangs
On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin wrote: > Could it be that the timeout (360 sec?) is just not enough for the test > under the current (changed due to switch to meson) conditions? Hmm, well it looks like he switched over to meson around 42 days ago 2024-02-01, looking at "calliphoridae" (skink has the extra complication of valgrind, let's look at a more 'normal' animal instead). The first failure that looks like that on calliphoridae is 19 days ago 2024-02-23, and after that it's happening every 3 days, sometimes in clusters. https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=calliphoridae=HEAD But you're right that under meson the test takes a lot longer, I guess due to increased concurrency: 287/287 postgresql:recovery / recovery/027_stream_regress OK 684.50s 6 subtests passed With make we don't have an individual time per script, but for for all of the recovery tests we had for example: t/027_stream_regress.pl ... ok All tests successful. Files=39, Tests=542, 65 wallclock secs ( 0.26 usr 0.06 sys + 20.16 cusr 31.65 csys = 52.13 CPU)
Re: Possibility to disable `ALTER SYSTEM`
Robert Haas writes: > On Thu, Mar 14, 2024 at 3:13 PM Tom Lane wrote: >> With the possible exception of #1, every one of these is easily >> defeatable by an uncooperative superuser. I'm not excited about >> adding a "security" feature with such obvious holes in it. > We're going to document that it's not a security feature along the > lines of what Magnus suggested in > http://postgr.es/m/CABUevEx9m=CV8=wpxvw+rtvvs858kdj6yprkexv7n+f6mk0...@mail.gmail.com The patch-of-record contains no such wording. And if this isn't a security feature, then what is it? If you have to say to your (super) users "please don't mess with the system configuration", you might as well just trust them not to do it the easy way as not to do it the hard way. If they're untrustworthy, why have they got superuser? What I think this is is a loaded foot-gun painted in kid-friendly colors. People will use it and then file CVEs about how it did not turn out to be as secure as they imagined (probably without reading the documentation). regards, tom lane
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
On Thu, Mar 14, 2024 at 05:30:30PM +0200, Heikki Linnakangas wrote: > On 18/02/2024 00:31, Tomas Vondra wrote: > > Do you plan to work continue working on this patch? I did take a look, > > and on the whole it looks reasonable - it modifies the right places etc. > > +1 > > > I think there are two things that may need an improvement: > > > > 1) Storing variable-length data in ParallelBitmapHeapState > > > > I agree with Robert the snapshot_and_stats name is not great. I see > > Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData - > > the reasons are somewhat different (phs_snapshot_off exists because we > > don't know which exact struct will be allocated), while here we simply > > need to allocate two variable-length pieces of memory. But it seems like > > it would work nicely for this. That is, we could try adding an offset > > for each of those pieces of memory: > > > > - snapshot_off > > - stats_off > > > > I don't like the GetSharedSnapshotData name very much, it seems very > > close to GetSnapshotData - quite confusing, I think. > > > > Dmitry also suggested we might add a separate piece of shared memory. I > > don't quite see how that would work for ParallelBitmapHeapState, but I > > doubt it'd be simpler than having two offsets. I don't think the extra > > complexity (paid by everyone) would be worth it just to make EXPLAIN > > ANALYZE work. > > I just removed phs_snapshot_data in commit 84c18acaf6. I thought that would > make this moot, but now that I rebased this, there are stills some aesthetic > questions on how best to represent this. > > In all the other node types that use shared instrumentation like this, the > pattern is as follows: (using Memoize here as an example, but it's similar > for Sort, IncrementalSort, Agg and Hash) > > /* > * Shared memory container for per-worker memoize information > * > */ > typedef struct SharedMemoizeInfo > { > int num_workers; > MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; > } SharedMemoizeInfo; > > /* this struct is backend-private */ > typedef struct MemoizeState > { > ScanState ss; /* its first field is > NodeTag */ > ... > MemoizeInstrumentation stats; /* execution statistics */ > SharedMemoizeInfo *shared_info; /* statistics for parallel workers */ > } MemoizeState; > > While the scan is running, the node updates its private data in > MemoizeState->stats. At the end of a parallel scan, the worker process > copies the MemoizeState->stats to MemoizeState->shared_info->stats, which > lives in shared memory. The leader process copies > MemoizeState->shared_info->stats to its own backend-private copy, which it > then stores in its MemoizeState->shared_info, replacing the pointer to the > shared memory with a pointer to the private copy. That happens in > ExecMemoizeRetrieveInstrumentation(). > > This is a little different for parallel bitmap heap scans, because a bitmap > heap scan keeps some other data in shared memory too, not just > instrumentation data. Also, the naming is inconsistent: the equivalent of > SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think we > should rename it to SharedBitmapHeapInfo, to make it clear that it lives in > shared memory, but I digress. FWIW, if we merge a BHS streaming read user like the one I propose in [1] (not as a pre-condition to this but just as something to make you more comfortable with these names), the ParallelBitmapHeapState will basically only contain the shared iterator and the coordination state for accessing it and could be named as such. Then if you really wanted to be consistent with Memoize, you could name the instrumentation SharedBitmapHeapInfo. But, personally I prefer the name you gave it: SharedBitmapHeapInstrumentation. I think that would have been a better name for SharedMemoizeInfo since num_workers is really just used as the length of the array of instrumentation info. > We could now put the new stats at the end of ParallelBitmapHeapState as a > varlen field. But I'm not sure that's a good idea. In > ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private > copy of the whole ParallelBitmapHeapState struct, even though the other > fields don't make sense after the shared memory is released? Sounds > confusing. Or we could introduce a separate struct for the stats, and copy > just that: > > typedef struct SharedBitmapHeapInstrumentation > { > int num_workers; > BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; > } SharedBitmapHeapInstrumentation; > > typedef struct BitmapHeapScanState > { > ScanState ss; /* its first field is > NodeTag */ > ... > SharedBitmapHeapInstrumentation sinstrument; > } BitmapHeapScanState; > > that compiles, at least with my compiler, but I find it
Re: JIT compilation per plan node
Thanks for chipping in here. On Fri, 15 Mar 2024 at 08:14, Robert Haas wrote: > A slightly subtler way the patch could lose is if the new threshold is > harder to adjust than the old one. For example, imagine that you have > a query that does a Cartesian join. That makes the cost of the input > nodes rather small compared to the cost of the join node, and it also > means that JITting the inner join child in particular is probably > rather important. But if you set join_above_cost low enough to JIT > that node post-patch, then maybe you'll also JIT a bunch of things > that aren't on the inner side of a nested loop and which might > therefore not really need JIT. Unless I'm missing something, this is a > fairly realistic example of where this patch's approach to costing > could turn out to be painful ... but it's not like the current system > is pain-free either. I think this case would be covered as the cost of the inner side of the join would be multiplied by the estimated outer-side rows. Effectively, making this part work is the bulk of the patch as we currently don't know the estimated number of loops of a node during create plan. David
RE: Popcount optimization using AVX512
> -Original Message- > From: Nathan Bossart > Sent: Monday, March 11, 2024 6:35 PM > To: Amonson, Paul D > Thanks. There's no need to wait to post the AVX portion. I recommend using > "git format-patch" to construct the patch set for the lists. After exploring git format-patch command I think I understand what you need. Attached. > > What exactly do you suggest here? I am happy to always call either > > pg_popcount32() or pg_popcount64() with the understanding that it may > > not be optimal, but I do need to know which to use. > > I'm recommending that we don't change any of the code in the pg_popcount() > function (which is renamed to pg_popcount_slow() in your v6 patch). If > pointers are 8 or more bytes, we'll try to process the buffer in 64-bit > chunks. > Else, we'll try to process it in 32-bit chunks. Any remaining bytes will be > processed one-by-one. Ok, we are on the same page now. :) It is already fixed that way in the refactor patch #1. As for new performance numbers: I just ran a full suite like I did earlier in the process. My latest results an equivalent to a pgbench scale factor 10 DB with the target column having varying column widths and appropriate random data are 1.2% improvement with a 2.2% Margin of Error at a 98% confidence level. Still seeing improvement and no regressions. As stated in the previous separate chain I updated the code removing the extra "extern" keywords. Thanks, Paul v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patch Description: v8-0001-Refactor-POPCNT-code-refactored-for-future-accelerat.patch v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patch Description: v8-0002-Feat-Add-AVX-512-POPCNT-support-initial-checkin.patch
Re: WIP Incremental JSON Parser
I've been poking at the partial token logic. The json_errdetail() bug mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk size) seems to be due to the disconnection between the "main" lex instance and the dummy_lex that's created from it. The dummy_lex contains all the information about the failed token, which is discarded upon an error return: > partial_result = json_lex(_lex); > if (partial_result != JSON_SUCCESS) > return partial_result; In these situations, there's an additional logical error: lex->token_start is pointing to a spot in the string after lex->token_terminator, which breaks an invariant that will mess up later pointer math. Nothing appears to be setting lex->token_start to point into the partial token buffer until _after_ the partial token is successfully lexed, which doesn't seem right -- in addition to the pointer math problems, if a previous chunk was freed (or on a stale stack frame), lex->token_start will still be pointing off into space. Similarly, wherever we set token_terminator, we need to know that token_start is pointing into the same buffer. Determining the end of a token is now done in two separate places between the partial- and full-lexer code paths, which is giving me a little heartburn. I'm concerned that those could drift apart, and if the two disagree on where to end a token, we could lose data into the partial token buffer in a way that would be really hard to debug. Is there a way to combine them? Thanks, --Jacob
Re: Possibility to disable `ALTER SYSTEM`
On Thu, Mar 14, 2024 at 3:13 PM Tom Lane wrote: > With the possible exception of #1, every one of these is easily > defeatable by an uncooperative superuser. I'm not excited about > adding a "security" feature with such obvious holes in it. > We reverted MAINTAIN last year for much less obvious holes; > how is it that we're going to look the other way on this one? We're going to document that it's not a security feature along the lines of what Magnus suggested in http://postgr.es/m/CABUevEx9m=CV8=wpxvw+rtvvs858kdj6yprkexv7n+f6mk0...@mail.gmail.com And then maybe someday we'll do this: > Really we'd need to do something about removing superusers' > access to the filesystem in order to build something with > fewer holes. I'm not against inventing such a feature, > but it'd take a fair amount of work and likely would end > in a noticeably less usable system (no plpython for example). Yep. It would be useful if you replied to the portion of http://postgr.es/m/ca+tgmoasugkz27x0xzh4edmq_b6jbrt6csuxf+phdgj-esk...@mail.gmail.com where I enumerate the methods that I know about for the superuser to get filesystem access. I don't think it's going to be practical to block all of those methods in a single commit, and I'm not entirely convinced that we can ever close all the holes without compromising the superuser's ability to do necessary system administration tasks, but maybe it's possible, and documenting the list of such methods would make it a lot easier for users to understand the risks and hackers to pick problems to try to tackle. -- Robert Haas EDB: http://www.enterprisedb.com
Re: broken JIT support on Fedora 40
Hi čt 14. 3. 2024 v 19:20 odesílatel Robert Haas napsal: > On Wed, Mar 6, 2024 at 1:54 AM Pavel Stehule > wrote: > > after today update, the build with --with-llvm produces broken code, and > make check fails with crash > > > > Upgradehwdata-0.380-1.fc40.noarch > @updates-testing > > Upgraded hwdata-0.379-1.fc40.noarch > @@System > > Upgradellvm-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > > Upgraded llvm-17.0.6-6.fc40.x86_64 > @@System > > Upgradellvm-devel-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > > Upgraded llvm-devel-17.0.6-6.fc40.x86_64 > @@System > > Upgradellvm-googletest-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > > Upgraded llvm-googletest-17.0.6-6.fc40.x86_64 > @@System > > Upgradellvm-libs-18.1.0~rc4-1.fc40.i686 > @updates-testing > > Upgraded llvm-libs-17.0.6-6.fc40.i686 > @@System > > Upgradellvm-libs-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > > Upgraded llvm-libs-17.0.6-6.fc40.x86_64 > @@System > > Upgradellvm-static-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > > Upgraded llvm-static-17.0.6-6.fc40.x86_64 > @@System > > Upgradellvm-test-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > > Instalovat llvm17-libs-17.0.6-7.fc40.i686 > @updates-testing > > Instalovat llvm17-libs-17.0.6-7.fc40.x86_64 > @updates-testing > > I don't know anything about LLVM, but somehow I'm guessing that people > who do will need more detail than this in order to be able to fix the > problem. I see that support for LLVM 18 was added in commit > d282e88e50521a457fa1b36e55f43bac02a3167f on January 18th; perhaps you > were building from an older commit? > I repeated build and check today, fresh master, today fedora update with same result build is ok, but regress tests fails with crash (it works without -with-llvm) Regards Pavel > > -- > Robert Haas > EDB: http://www.enterprisedb.com >
Re: JIT compilation per plan node
On Tue, Feb 20, 2024 at 5:31 AM Tomas Vondra wrote: > I certainly agree that the current JIT costing is quite crude, and we've > all seen cases where the decision turns out to not be great. And I think > the plan to make the decisions at the node level makes sense, so +1 to > that in general. Seems reasonable to me also. > And I think you're right that looking just at the node total cost may > not be sufficient - that we may need a better cost model, considering > how many times an expression is executed and so on. But I think we > should try to do this in smaller steps, meaningful on their own, > otherwise we won't move at all. The two threads linked by Melih are ~4y > old and *nothing* changed since then, AFAIK. > > I think it's reasonable to start by moving the decision to the node > level - it's where the JIT happens, anyway. It may not be perfect, but > it seems like a clear improvement. And if we then choose to improve the > "JIT cost model" to address some of the issues you pointed out, surely > that would need to happen at the node level too ... I'm not sure I understand whether you (Tomas) think that this patch is a good idea or a bad idea as it stands. I read the first of these two paragraphs to suggest that the patch hasn't really evolved much in the last few years, perhaps suggesting that if it wasn't good enough to commit back then, it still isn't now. But the second of these two paragraphs seems more supportive. >From my own point of view, I definitely agree with David's statement that what we really want to know is how many times each expression will be evaluated. If we had that information, or just an estimate, I think we could make much better decisions in this area. But we don't have that infrastructure now, and it doesn't seem easy to create, so it seems to me that what we have to decide now is whether applying a cost threshold on a per-plan-node basis will produce better or worse results than what making one decision for the whole plan. David's provided an example of where it does indeed work better back in https://www.postgresql.org/message-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com - but could there be enough cases where the opposite happens to make us think that the patch is overall a bad idea? I personally find that a bit unlikely, although not impossible. I see a couple of ways that using the per-node cost can distort things -- it seems like it will tend to heavily feature JIT for "interior" plan nodes because the cost of a plan node includes it's children -- and as was mentioned previously, it doesn't really care whether the node cost is high because of expression evaluation or something else. But neither of those things seem like they'd be bad enough to make this a bad way forward over all. For the patch to lose, it seems like we'd need a case where the overall plan cost would have been high enough to trigger JIT pre-patch, but most of the benefit would have come from relatively low-cost nodes that don't get JITted post-patch. The easiest way for that to happen is if the planner's estimates are off, but that's not really an argument against this patch as much as it is an argument that query planning is hard in general. A slightly subtler way the patch could lose is if the new threshold is harder to adjust than the old one. For example, imagine that you have a query that does a Cartesian join. That makes the cost of the input nodes rather small compared to the cost of the join node, and it also means that JITting the inner join child in particular is probably rather important. But if you set join_above_cost low enough to JIT that node post-patch, then maybe you'll also JIT a bunch of things that aren't on the inner side of a nested loop and which might therefore not really need JIT. Unless I'm missing something, this is a fairly realistic example of where this patch's approach to costing could turn out to be painful ... but it's not like the current system is pain-free either. I don't really know what's best here, but I'm mildly inclined to believe that the patch might be a change for the better. I have not reviewed the implementation and have no comment on whether it's good or bad from that point of view. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Possibility to disable `ALTER SYSTEM`
Robert Haas writes: > As far as I can see from reading the thread, most people agree that > it's reasonable to have some way to disable ALTER SYSTEM, but there > are at least six competing ideas about how to do that: > 1. command-line option > 2. GUC > 3. event trigger > 4. extension > 5. sentinel file > 6. remove permissions on postgresql.auto.conf With the possible exception of #1, every one of these is easily defeatable by an uncooperative superuser. I'm not excited about adding a "security" feature with such obvious holes in it. We reverted MAINTAIN last year for much less obvious holes; how is it that we're going to look the other way on this one? #2 with the GUC_DISALLOW_IN_AUTO_FILE flag can be made secure (I think) by putting the main postgresql.conf file outside the data directory and then making it not owned by or writable by the postgres user. But I doubt that's a common configuration, and I'm sure we will get complaints from people who failed to set it up that way. The proposed patch certainly doesn't bother to document the hazard. Really we'd need to do something about removing superusers' access to the filesystem in order to build something with fewer holes. I'm not against inventing such a feature, but it'd take a fair amount of work and likely would end in a noticeably less usable system (no plpython for example). regards, tom lane
Re: Possibility to disable `ALTER SYSTEM`
On Thu, 14 Mar 2024 at 17:37, Robert Haas wrote: > or in the > alternative (2) but with the GUC being PGC_SIGHUP and > GUC_DISALLOW_IN_AUTO_FILE. I believe there would be adequate consensus > to proceed with either of those approaches. Anybody feel like coding > it up? Here is a slightly modified version of Gabrielle his original patch, which already implemented GUC approach. The changes I made are adding PGC_SIGHUP and GUC_DISALLOW_IN_AUTO_FILE as well as adding some more docs. v2-0001-Add-enable_alter_system-GUC.patch Description: Binary data
Re: broken JIT support on Fedora 40
On Wed, Mar 6, 2024 at 1:54 AM Pavel Stehule wrote: > after today update, the build with --with-llvm produces broken code, and make > check fails with crash > > Upgradehwdata-0.380-1.fc40.noarch > @updates-testing > Upgraded hwdata-0.379-1.fc40.noarch @@System > Upgradellvm-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > Upgraded llvm-17.0.6-6.fc40.x86_64@@System > Upgradellvm-devel-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > Upgraded llvm-devel-17.0.6-6.fc40.x86_64 @@System > Upgradellvm-googletest-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > Upgraded llvm-googletest-17.0.6-6.fc40.x86_64 @@System > Upgradellvm-libs-18.1.0~rc4-1.fc40.i686 > @updates-testing > Upgraded llvm-libs-17.0.6-6.fc40.i686 @@System > Upgradellvm-libs-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > Upgraded llvm-libs-17.0.6-6.fc40.x86_64 @@System > Upgradellvm-static-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > Upgraded llvm-static-17.0.6-6.fc40.x86_64 @@System > Upgradellvm-test-18.1.0~rc4-1.fc40.x86_64 > @updates-testing > Instalovat llvm17-libs-17.0.6-7.fc40.i686 > @updates-testing > Instalovat llvm17-libs-17.0.6-7.fc40.x86_64 > @updates-testing I don't know anything about LLVM, but somehow I'm guessing that people who do will need more detail than this in order to be able to fix the problem. I see that support for LLVM 18 was added in commit d282e88e50521a457fa1b36e55f43bac02a3167f on January 18th; perhaps you were building from an older commit? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pre-proposal: unicode normalized text
On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote: > Attached is an implementation of a per-database option STRICT_UNICODE > which enforces the use of assigned code points only. The CF app doesn't seem to point at the latest patch: https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b31a840a915a9d349.ca...@j-davis.com which is perhaps why nobody has looked at it yet. But in any case, I'm OK if this gets bumped to 18. I still think it's a good feature, but some of the value will come later in v18 anyway, when I plan to propose support for case folding. Case folding is a version of lowercasing with compatibility guarantees when you only use assigned code points. Regards, Jeff Davis
Re: UUID v7
> On 14 Mar 2024, at 20:10, Peter Eisentraut wrote: > > I think the behavior of uuid_extract_var(iant) is wrong. The code > takes just two bits to return, but the draft document is quite clear > that the variant is 4 bits (see Table 1). Well, it was correct only for implemented variant. I've made version that implements full table 1 from section 4.1. >>> I think we are still interpreting this differently. I think >>> uuid_extract_variant should just return whatever is in those four bits. >>> Your function comment says "Can return only 0, 0b10, 0b110 and 0b111.", >>> which I don't think it is correct. It should return 0 through 15. >> We will return "do not care" bits. This bits can confuse someone. E.g. for >> varaint 0b10 we can return 8, 9, 10 and 11 randomly. Is it OK? BTW for some >> reason document lists number 1-15, but your are correct that range is 0-15. > > I agree it's confusing. Before I studied the RFC 4122bis project, I didn't > even know about variant vs. version. I think overall people will find this > more confusing than useful. If you just want to know, "is this UUID of the > kind specified in RFC 4122", you can query it with uuid_extract_version(x) IS > NOT NULL. So maybe we don't need the _extract_variant function? I think it's the best possible solution. The variant has no value besides detecting if a version can be extracted. Best regards, Andrey Borodin.
Re: linux cachestat in file Readv and Prefetch
On Sat, Feb 17, 2024 at 6:10 PM Tomas Vondra wrote: > I may be missing some important bit behind this idea, but this does not > seem like a great idea to me. The comment added to FilePrefetch says this: > > /* >* last time we visit this file (somewhere), nr_recently_evicted pages >* of the range were just removed from vm cache, it's a sign a memory >* pressure. so do not prefetch further. >* it is hard to guess if it is always the right choice in absence of >* more information like: >* - prefetching distance expected overall >* - access pattern/backend maybe >*/ > > Firstly, is this even a good way to detect memory pressure? It's clearly > limited to a single 1GB segment, so what's the chance we'll even see the > "pressure" on a big database with many files? > > If we close/reopen the file (which on large databases we tend to do very > often) how does that affect the data reported for the file descriptor? > > I'm not sure I even agree with the idea that we should stop prefetching > when there is memory pressure. IMHO it's perfectly fine to keep > prefetching stuff even if it triggers eviction of unnecessary pages from > page cache. That's kinda why the eviction exists. I agree with all of these criticisms. I think it's the job of pg_prewarm to do what the user requests, not to second-guess whether the user requested the right thing. One of the things that frustrates people about the ring-buffer system is that it's hard to get all of your data cached in shared_buffers by just reading it, e.g. SELECT * FROM my_table. If pg_prewarm also isn't guaranteed to actually read your data, and may decide that your data didn't need to be read after all, then what exactly is a user supposed to do if they're absolutely sure that they know better than PostgreSQL itself and want to guarantee that their data actually does get read? So I think a feature like this would at the very least need to be optional, but it's unclear to me why we'd want it at all, and I feel like Cedric's email doesn't really answer that question. I suppose that if you could detect useless prefetching and skip it, you'd save a bit of work, but under what circumstances does anyone use pg_prewarm so aggressively as to make that a problem, and why wouldn't the solution be for the user to just calm down a little bit? There shouldn't be any particular reason why the user can't know both the size of shared_buffers and the approximate size of the OS cache; indeed, they can probably know the latter much more reliably than PostgreSQL itself can. So it should be fairly easy to avoid just prefetching more data than will fit, and then you don't have to worry about any of this. And you'll probably get a better result, too, because, along the same lines as Tomas's remarks above, I doubt that this would be an accurate method anyway. > Well ... I'd argue at least some basic evaluation of performance is a > rather important / expected part of a proposal for a patch that aims to > improve a performance-focused feature. It's impossible to have any sort > of discussion about such patch without that. Right. I'm going to mark this patch as Rejected in the CommitFest application for now. If in subsequent discussion that comes to seem like the wrong result, then we can revise accordingly, but right now it looks extremely unlikely to me that this is something that we'd want. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Recent 027_streaming_regress.pl hangs
Hello Thomas and Michael, 14.03.2024 06:16, Thomas Munro wrote: Yeah, I was wondering if its checkpoint delaying logic might have got the checkpointer jammed or something like that, but I don't currently see how. Yeah, the replay of bulk newpages could be relevant, but it's not exactly new technology. One thing I wondered about is whether the Perl "wait for catchup" thing, which generates large volumes of useless log, could be somehow changed to actually show the progress after some time. Something like "I'm still waiting for this replica to reach LSN X, but it has so far only reported LSN Y, and here's a dump of the WAL around there"? I have perhaps reproduced the issue here (at least I'm seeing something similar), and going to investigate the issue in the coming days, but what I'm confused with now is the duration of poll_query_until: For the failure you referenced: [15:55:54.740](418.725s) # poll_query_until timed out executing this query: And a couple of others: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-03-08%2000%3A34%3A06 [00:45:57.747](376.159s) # poll_query_until timed out executing this query: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-03-04%2016%3A32%3A17 [16:45:24.870](407.970s) # poll_query_until timed out executing this query: Could it be that the timeout (360 sec?) is just not enough for the test under the current (changed due to switch to meson) conditions? Best regards, Alexander
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, 14 Mar 2024 at 16:45, Robert Haas wrote: > I feel bad arguing about the patches that you think are a slam-dunk, > but I find myself disagreeing with the design choices. No worries, thanks a lot for responding. I'm happy to discuss this design further. I didn't necessarily mean these patches were a slam-dunk. I mainly meant that these first few patches were not specific to any protocol change, but are changes we should agree on before any change to the protocol is possible at all. Based on your response, we currently disagree on a bunch of core things. I'll first try to summarize my view on (future) protocol changes and why I think the current core design in this patchset is the correct path forward, and then go into some details inline in your response below: In my view there can be, **by definition,** only two general types of protocol changes: 1. A "simple protocol change": This is one that requires agreement by both the client and server, that they understand the new message types involved in this change. e.g. the ParameterSet message proposal (this message type is either supported or it's not) 2. A "parameterized protocol change": This requires the same as 1, but should also allow some parameterization from the client, e.g. for the compression proposal, the client should specify what compression algorithm the server should use to compress data when sending it to the client. Client and Server can agree that a "simple protocol change" is supported by both advertising a minimum minor protocol version. And for a "parameterized protocol change" the client can send a _pq_ parameter in the startup packet. So, new _pq_ parameters should only ever be introduced for parameterized protocol changes. They are not meant to advertise support, they are meant to configure protocol features. For a non-configurable protocol feature, we'd simply bump the protocol version. And since _pq_ parameters thus always control some setting at the session level, we can simply store it as a GUC, because that's how we store all our parameters at a session level. > Regarding 0001, I considered making this change as part of > ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed but decided against it, > because it seemed like it was making the assumption that we always > wanted to initiate new connections with the latest protocol version > that we know how to accept, and I wasn't sure that would always be > true. I think given the automatic downgrade supported by the NegotiateProtocolVersion, there's no real down-side to requesting the newest version by default. The only downside I can see is when connecting to other applications (i.e. non PostgreSQL servers) that implement the postgres protocol, but don't implement NegotiateProtocolVersion. But for that I added the max_protocol_version connection option in 0006 (of my latest patchset). > I'm really unhappy with 0002-0004. Just to be clear, (afaict) your argument below seems to only really be about 0004, not about 0002 or 0003. Was there anything about 0002 & 0003 that you were unhappy with? 0002 & 0003 are not dependent an 0004 imho. Because even when not making _pq_ parameters map to GUCs, we'd still need to change libpq to not instantly close the connection whenever a _pq_ parameter (or new protocol version) is not supported by the server (which is what 0002 & 0003 do). > That left no room for any other > type of protocol modification, so that commit carved out an exception, > where when we something that starts with _pq_., it's assumed to be > setting some other kind of thing, not a GUC. Okay, our interpretation is very different here. From my perspective introducing a non-GUC namespace is NOT AT ALL the benefit of the _pq_ prefix. The main benefit (imho) is that it allows sending an "optional" parameter (i.e GUC) in the startup packet. So, one where if the server doesn't recognize it the connection attempt still succeeds. If you specify a normal GUC in the connection parameters and the server doesn't know about it, the server will close the connection. So, to be able to send a GUC that depends on the protocol and/or server version in an optional way, you'd need to wait for an extra network roundtrip until the server tells you what protocol and/or server version they are. > But 0004 throws that out > the window and decides, nope, those are just GUCs, too. Even if we > don't have a specific reason today why we'd ever want such a thing, it > seems short-sighted to give up on the possibility that in the future > we will. Since I believe a _pq_ parameter should only be used to control settings at the session level. I don't think it would be short-sighted to give-up on the possibility to store them as anything else as GUCs. Because in the many years that we've had GUCs, we've been able to store all session settings using that infrastructure. BUT PLEASE NOTE: I don't think we are giving up on the thing you're describing (see explanation in final part of this email) > With this
Re: Psql meta-command conninfo+
Hi Maiquel, On Thu, Feb 29, 2024 at 10:02:21PM +, Maiquel Grassi wrote: > Sorry for the delay. I will make the adjustments as requested soon. We have only a few weeks left before feature-freeze for v17. Do you think you will be able to send an updated patch soon? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add system identifier to backup manifest
On Thu, Mar 14, 2024 at 11:05 AM Amul Sul wrote: > Thank you, Robert. Thanks for the patch! -- Robert Haas EDB: http://www.enterprisedb.com
Re: Possibility to disable `ALTER SYSTEM`
On Tue, Feb 13, 2024 at 2:05 AM Joel Jacobson wrote: > > Wouldn't having system wide EVTs be a generic solution which could be the > > infrastructure for this requested change as well as others in the same area? > > +1 > > I like the wider vision of providing the necessary infrastructure to provide > a solution for the general case. We don't seem to be making much progress here. As far as I can see from reading the thread, most people agree that it's reasonable to have some way to disable ALTER SYSTEM, but there are at least six competing ideas about how to do that: 1. command-line option 2. GUC 3. event trigger 4. extension 5. sentinel file 6. remove permissions on postgresql.auto.conf As I see it, (5) or (6) are most convenient for the system administrator, since they let that person make changes without needing to log into the database or, really, worry very much about the database's usual configuration mechanisms at all, and (5) seems like less work to implement than (6), because (6) probably breaks a bunch of client tools in weird ways that might not be easy for us to discover during patch review. (1) doesn't allow changing things at runtime, and might require the system administrator to fiddle with the startup scripts, which seems like it could be inconvenient. (2) and (3) seem like they put the superuser in a position to easily reverse a policy about what the superuser ought to do, but in the case of (2), that can be mitigated if the GUC can only be set in postgresql.conf and not elsewhere. (4) has no real advantages except for allowing core to maintain the fiction that we don't support this while actually supporting it; I think we should reject that approach outright. So what I'd like to see is a patch that implements (5), or in the alternative (2) but with the GUC being PGC_SIGHUP and GUC_DISALLOW_IN_AUTO_FILE. I believe there would be adequate consensus to proceed with either of those approaches. Anybody feel like coding it up? -- Robert Haas EDB: http://www.enterprisedb.com
Re: DOCS: add helpful partitioning links
Hi Robert, On Thu, Mar 7, 2024 at 10:49 PM Robert Treat wrote: > This patch adds a link to the "attach partition" command section > (similar to the detach partition link above it) as well as a link to > "create table like" as both commands contain additional information > that users should review beyond what is laid out in this section. > There's also a couple of wordsmiths in nearby areas to improve > readability. > Thanks. The patch gives error when building html ddl.sgml:4300: element link: validity error : No declaration for attribute linked of element link CREATE TABLE ... LIKECREATE TABLE ... LIKEhttps://english.stackexchange.com/questions/9700/outside-or-outside-of. But I think the meaning of the sentence will be more clear if we rephrase it as in the attached patch. - convenient, as not only will the existing partitions become indexed, but - also any partitions that are created in the future will. One limitation is + convenient as not only will the existing partitions become indexed, but + any partitions created in the future will as well. One limitation is I am finding the current construct hard to read. The comma is misplaced as you have pointed out. The pair of commas break the "not only" ... "but also" construct. I have tried to simplify the sentence in the attached. Please review. - the partitioned table; such an index is marked invalid, and the partitions - do not get the index applied automatically. The indexes on partitions can - be created individually using CONCURRENTLY, and then + the partitioned table; such an index is marked invalid and the partitions + do not get the index applied automatically. The partition indexes can "indexes on partition" is clearer than "partition index". Fixed in the attached patch. Please review. -- Best Wishes, Ashutosh Bapat diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 8616a8e9cc..043717136c 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -4283,18 +4283,20 @@ CREATE TABLE measurement_y2008m02 PARTITION OF measurement TABLESPACE fasttablespace; - As an alternative, it is sometimes more convenient to create the - new table outside the partition structure, and attach it as a - partition later. This allows new data to be loaded, checked, and - transformed prior to it appearing in the partitioned table. - Moreover, the ATTACH PARTITION operation requires - only SHARE UPDATE EXCLUSIVE lock on the - partitioned table, as opposed to the ACCESS - EXCLUSIVE lock that is required by CREATE TABLE - ... PARTITION OF, so it is more friendly to concurrent - operations on the partitioned table. - The CREATE TABLE ... LIKE option is helpful - to avoid tediously repeating the parent table's definition: + As an alternative, it is sometimes more convenient to create the partition + as a standalone new table, and attach it to the partitioned table later. + This allows new data to be loaded, checked, and transformed prior to it + appearing in the partitioned table. Moreover, the ATTACH + PARTITION operation requires only SHARE UPDATE + EXCLUSIVE lock on the partitioned table, as opposed to the + ACCESS EXCLUSIVE lock that is required by + CREATE TABLE ... PARTITION OF, so it is more friendly + to concurrent operations on the partitioned table; see ALTER TABLE ... ATTACH + PARTITION for additional details. The CREATE TABLE ... + LIKE command can be helpful to avoid tediously repeating + the parent table's definition, for example: CREATE TABLE measurement_y2008m02 @@ -4345,16 +4347,14 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02 As explained above, it is possible to create indexes on partitioned tables - so that they are applied automatically to the entire hierarchy. - This is very - convenient, as not only will the existing partitions become indexed, but - also any partitions that are created in the future will. One limitation is + so that they are applied automatically to the entire hierarchy. The operation of creating an index on a partitioned table also creates corresponding indexes on its partitions. Any new partition created in future also inherits the indexes on the partitioned table. + One limitation is that it's not possible to use the CONCURRENTLY qualifier when creating such a partitioned index. To avoid long lock times, it is possible to use CREATE INDEX ON ONLY - the partitioned table; such an index is marked invalid, and the partitions + the partitioned table; such an index is marked invalid and the partitions do not get the index applied automatically. The indexes on partitions can - be created individually using CONCURRENTLY, and then + then be created individually using CONCURRENTLY and attached to the index on the parent using ALTER INDEX .. ATTACH
Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value
On Wed, Mar 13, 2024 at 01:09:18PM +0700, Yugo NAGATA wrote: > On Tue, 12 Mar 2024 22:07:17 -0500 > Nathan Bossart wrote: >> I did some light editing to prepare this for commit. > > Thank you. I confirmed the test you improved and I am fine with that. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Reports on obsolete Postgres versions
On 13.03.24 18:12, Bruce Momjian wrote: On Wed, Mar 13, 2024 at 09:21:27AM -0700, Jeremy Schneider wrote: It's not just roadmaps and release pages where we mix up these terms either, it's even in user-facing SQL and libpq routines: both PQserverVersion and current_setting('server_version_num') return the patch release version in the numeric patch field, rather than the numeric minor field (which is always 0). In my view, the best thing would be to move toward consistently using the word "patch" and moving away from the word "minor" for the PostgreSQL quarterly maintenance updates. I think "minor" is a better term since it contrasts with "major". We don't actually supply patches to upgrade minor versions. There are potentially different adjectives that could apply to "version" and "release". The version numbers can be called major and minor, because that just describes their ordering and significance. But I do agree that "minor release" isn't quite as clear, because one could also interpret that as "a release, but a bit smaller this time". (Also might not translate well, since "minor" and "small" could translate to the same thing.) One could instead, for example, describe those as "maintenance releases": "Are you on the latest maintenance release? Why not? Are you not maintaining your database?" That carries much more urgency than the same with "minor". A maintenance release corresponds to an increase in the minor version number. It's still tied together, but has different terminology. The last news item reads: "The PostgreSQL Global Development Group has released an update to all supported versions of PostgreSQL" which has no urgency, but consider "The PostgreSQL Global Development Group has published maintenance releases to all supported versions of PostgreSQL"
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Fri, Mar 8, 2024 at 6:47 AM Jelte Fennema-Nio wrote: > 1. 0001-0005 are needed for any protocol change, and hopefully > shouldn't require much discussion I feel bad arguing about the patches that you think are a slam-dunk, but I find myself disagreeing with the design choices. Regarding 0001, I considered making this change as part of ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed but decided against it, because it seemed like it was making the assumption that we always wanted to initiate new connections with the latest protocol version that we know how to accept, and I wasn't sure that would always be true. I don't think it would be catastrophic if this got committed or anything -- it could always be changed later if the need arose -- but I wanted to mention that I had a reason for not doing it, and I'm still not particularly convinced that we should do it. I'm really unhappy with 0002-0004. Before ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed, any parameter included in the startup message that wasn't in a short, hard-coded list was treated as a request to set a GUC. That left no room for any other type of protocol modification, so that commit carved out an exception, where when we something that starts with _pq_., it's assumed to be setting some other kind of thing, not a GUC. But 0004 throws that out the window and decides, nope, those are just GUCs, too. Even if we don't have a specific reason today why we'd ever want such a thing, it seems short-sighted to give up on the possibility that in the future we will. Because if we implement what this patch wants to do in this way, basically consuming the entire namespace that ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed created in on shot, and then later we want the sort of thing that I'm postulating, we'll have to manufacture another new namespace for that need. And it seems to me that there are other ways we could do this. For example, suppose we introduced just one new protocol parameter; for the sake of argument, I'll call it _pq_.protocol_set. If the client sends this parameter and the server accepts it, then the client knows that the server supports a new protocol message ProtocolSetParameter, which is the only way to set GUCs in the new PROTOCOL_EXTENSION category. New libpq functions with names like, I don't know, PQsetProtocolParameter(), are added to send such messages, and they return an error if there are network problems or whatever, or if the server didn't accept the _pq_.protocol_set option at startup time. With this kind of design, you're just consuming one element of the _pq_ namespace, and the next person who wants to do something can consume one more element, and we'll be able to go on for a very long time without running out of room. This is really how I intended this mechanism to be used, and the only real downside I see as compared to what you've done is that you can't set the protocol GUCs in the startup packet, but must set them afterward via separate messages. If that's a problem, then the proposal I just outline needs modification ... but no matter what we do exactly, I don't want the very first protocol extension we ever add to eat up all of _pq_. I intended that to support decades worth of extensibility, not just one patch. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
On 18/02/2024 00:31, Tomas Vondra wrote: Do you plan to work continue working on this patch? I did take a look, and on the whole it looks reasonable - it modifies the right places etc. +1 I think there are two things that may need an improvement: 1) Storing variable-length data in ParallelBitmapHeapState I agree with Robert the snapshot_and_stats name is not great. I see Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData - the reasons are somewhat different (phs_snapshot_off exists because we don't know which exact struct will be allocated), while here we simply need to allocate two variable-length pieces of memory. But it seems like it would work nicely for this. That is, we could try adding an offset for each of those pieces of memory: - snapshot_off - stats_off I don't like the GetSharedSnapshotData name very much, it seems very close to GetSnapshotData - quite confusing, I think. Dmitry also suggested we might add a separate piece of shared memory. I don't quite see how that would work for ParallelBitmapHeapState, but I doubt it'd be simpler than having two offsets. I don't think the extra complexity (paid by everyone) would be worth it just to make EXPLAIN ANALYZE work. I just removed phs_snapshot_data in commit 84c18acaf6. I thought that would make this moot, but now that I rebased this, there are stills some aesthetic questions on how best to represent this. In all the other node types that use shared instrumentation like this, the pattern is as follows: (using Memoize here as an example, but it's similar for Sort, IncrementalSort, Agg and Hash) /* * Shared memory container for per-worker memoize information * */ typedef struct SharedMemoizeInfo { int num_workers; MemoizeInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; } SharedMemoizeInfo; /* this struct is backend-private */ typedef struct MemoizeState { ScanState ss; /* its first field is NodeTag */ ... MemoizeInstrumentation stats; /* execution statistics */ SharedMemoizeInfo *shared_info; /* statistics for parallel workers */ } MemoizeState; While the scan is running, the node updates its private data in MemoizeState->stats. At the end of a parallel scan, the worker process copies the MemoizeState->stats to MemoizeState->shared_info->stats, which lives in shared memory. The leader process copies MemoizeState->shared_info->stats to its own backend-private copy, which it then stores in its MemoizeState->shared_info, replacing the pointer to the shared memory with a pointer to the private copy. That happens in ExecMemoizeRetrieveInstrumentation(). This is a little different for parallel bitmap heap scans, because a bitmap heap scan keeps some other data in shared memory too, not just instrumentation data. Also, the naming is inconsistent: the equivalent of SharedMemoizeInfo is actually called ParallelBitmapHeapState. I think we should rename it to SharedBitmapHeapInfo, to make it clear that it lives in shared memory, but I digress. We could now put the new stats at the end of ParallelBitmapHeapState as a varlen field. But I'm not sure that's a good idea. In ExecBitmapHeapRetrieveInstrumentation(), would we make a backend-private copy of the whole ParallelBitmapHeapState struct, even though the other fields don't make sense after the shared memory is released? Sounds confusing. Or we could introduce a separate struct for the stats, and copy just that: typedef struct SharedBitmapHeapInstrumentation { int num_workers; BitmapHeapScanInstrumentation sinstrument[FLEXIBLE_ARRAY_MEMBER]; } SharedBitmapHeapInstrumentation; typedef struct BitmapHeapScanState { ScanState ss; /* its first field is NodeTag */ ... SharedBitmapHeapInstrumentation sinstrument; } BitmapHeapScanState; that compiles, at least with my compiler, but I find it weird to have a variable-length inner struct embedded in an outer struct like that. Long story short, I think it's still better to store ParallelBitmapHeapInstrumentationInfo separately in the DSM chunk, not as part of ParallelBitmapHeapState. Attached patch does that, rebased over current master. I didn't address any of the other things that you, Tomas, pointed out, but I think they're valid concerns. -- Heikki Linnakangas Neon (https://neon.tech) From 04f11a37ee04b282c51e9dae68ead7c9c3d5fb3d Mon Sep 17 00:00:00 2001 From: David Geier Date: Tue, 8 Nov 2022 19:40:31 +0100 Subject: [PATCH v3 1/1] Parallel Bitmap Heap Scan reports per-worker stats Similarly to other nodes (e.g. hash join, sort, memoize), Bitmap Heap Scan now reports per-worker stats in the EXPLAIN ANALYZE output. Previously only the heap blocks stats for the leader were reported which was incomplete in parallel scans. Discussion:
Re: UUID v7
On 14.03.24 12:25, Andrey M. Borodin wrote: I think the behavior of uuid_extract_var(iant) is wrong. The code takes just two bits to return, but the draft document is quite clear that the variant is 4 bits (see Table 1). Well, it was correct only for implemented variant. I've made version that implements full table 1 from section 4.1. I think we are still interpreting this differently. I think uuid_extract_variant should just return whatever is in those four bits. Your function comment says "Can return only 0, 0b10, 0b110 and 0b111.", which I don't think it is correct. It should return 0 through 15. We will return "do not care" bits. This bits can confuse someone. E.g. for varaint 0b10 we can return 8, 9, 10 and 11 randomly. Is it OK? BTW for some reason document lists number 1-15, but your are correct that range is 0-15. I agree it's confusing. Before I studied the RFC 4122bis project, I didn't even know about variant vs. version. I think overall people will find this more confusing than useful. If you just want to know, "is this UUID of the kind specified in RFC 4122", you can query it with uuid_extract_version(x) IS NOT NULL. So maybe we don't need the _extract_variant function?
Re: Add system identifier to backup manifest
On Thu, Mar 14, 2024 at 12:48 AM Robert Haas wrote: > On Fri, Mar 8, 2024 at 12:14 AM Amul Sul wrote: > > Thank you for the improvement. > > > > The caller of verify_control_file() has the full path of the control > file that > > can pass it and avoid recomputing. With this change, it doesn't really > need > > verifier_context argument -- only the manifest's system identifier is > enough > > along with the control file path. Did the same in the attached delta > patch > > for v11-0002 patch, please have a look, thanks. > > Those seem like sensible changes. I incorporated them and committed. I > also: > > * ran pgindent, which changed a bit of your formatting > * changed some BAIL_OUT calls to die; I think we should hardly ever be > using BAIL_OUT, as that terminates the entire TAP test run, not just > the current file > Thank you, Robert. Regards, Amul
Re: psql: fix variable existence tab completion
On Sun, Mar 3, 2024 at 5:37 PM Erik Wienhold wrote: > On 2024-03-03 03:00 +0100, Steve Chavez wrote: > > psql has the :{?name} syntax for testing a psql variable existence. > > > > But currently doing \echo :{?VERB doesn't trigger tab completion. > > > > This patch fixes it. I've also included a TAP test. > > Thanks. The code looks good, all tests pass, and the tab completion > works as expected when testing manually. A nice improvement. I've checked why we have at all the '{' at WORD_BREAKS and if we're going to break anything by removing that. It seems that '{' here from the very beginning and it comes from the default value of rl_basic_word_break_characters [1]. It seems that :{?name} is the only usage of '{' sign in psql. So, removing it from WORD_BREAKS shouldn't break anything. I'm going to push this patch if no objections. Links. 1. https://tiswww.case.edu/php/chet/readline/readline.html#index-rl_005fbasic_005fword_005fbreak_005fcharacters -- Regards, Alexander Korotkov
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Thu, 14 Mar 2024 at 15:49, Amit Kapila wrote: > > On Thu, Mar 14, 2024 at 1:45 PM Masahiko Sawada wrote: > > > > On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila wrote: > > > > > > On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada > > > wrote: > > > > > > > > This fact makes me think that the slotsync worker might be able to > > > > accept the primary_conninfo value even if there is no dbname in the > > > > value. That is, if there is no dbname in the primary_conninfo, it uses > > > > the username in accordance with the specs of the connection string. > > > > Currently, the slotsync worker connects to the local database first > > > > and then establishes the connection to the primary server. But if we > > > > can reverse the two steps, it can get the dbname that has actually > > > > been used to establish the remote connection and use it for the local > > > > connection too. That way, the primary_conninfo generated by > > > > pg_basebackup could work even without the patch. For example, if the > > > > OS user executing pg_basebackup is 'postgres', the slotsync worker > > > > would connect to the postgres database. Given the 'postgres' database > > > > is created by default and 'postgres' OS user is used in common, I > > > > guess it could cover many cases in practice actually. > > > > > > > > > > I think this is worth investigating but I suspect that in most cases > > > users will end up using a replication connection without specifying > > > the user name and we may not be able to give a meaningful error > > > message when slotsync worker won't be able to connect. The same will > > > be true even when the dbname same as the username would be used. > > > > What do you mean by not being able to give a meaningful error message? > > > > If the slotsync worker uses the user name as the dbname, and such a > > database doesn't exist, the error message the user will get is > > "database "test_user" does not exist". ISTM the same is true when the > > user specifies the wrong database in the primary_conninfo. > > > > Right, the exact error message as mentioned by Shveta will be: > ERROR: could not connect to the primary server: connection to server > at "127.0.0.1", port 5433 failed: FATAL: database "bckp_user" does not > exist > > Now, without this idea, the ERROR message will be: > ERROR: slot synchronization requires dbname to be specified in > primary_conninfo > > I am not sure how much this matters but the second message sounds more useful. > > > > > > > > Having said that, even with (or without) the above change, we might > > > > want to change the pg_basebackup so that it writes the dbname to the > > > > primary_conninfo if -R option is specified. Since the database where > > > > the slotsync worker connects cannot be dropped while the slotsync > > > > worker is running, the user might want to change the database to > > > > connect, and it would be useful if they can do that using > > > > pg_basebackup instead of modifying the configuration file manually. > > > > > > > > While the current approach makes sense to me, I'm a bit concerned that > > > > we might end up having the pg_basebackup search the actual database > > > > name (e.g. 'dbname=template1') from the .pgpass file instead of > > > > 'dbname=replication'. As far as I tested on my environment, suppose > > > > that I execute: > > > > > > > > pg_basebackup -D tmp -d "dbname=testdb" -R > > > > > > > > The pg_basebackup established a replication connection but looked for > > > > the password of the 'testdb' database. This could be another > > > > inconvenience for the existing users who want to use the slot > > > > synchronization. > > > > > > > > > > This is true because it is internally using logical replication > > > connection (as we will set set replication=database). > > > > Did you mean the pg_basebackup is using a logical replication > > connection in this case? As far as I tested, even if we specify dbname > > to the -d option of pg_basebackup, it uses a physical replication > > connection. For example, it can take a backup even if I specify a > > non-existing database name. > > > > You are right. I misunderstood some part of the code in GetConnection. > However, I think my point is still valid that if the user has provided > dbname in the connection string it means that she wants that database > entry to be looked upon not "replication" entry. > > > > > > > > > > But it's still just an idea and I might be missing something. And > > > > given we're getting closer to the feature freeze, it would be a PG18 > > > > item. > > > > > > > > > > +1. At this stage, it is important to discuss whether we should allow > > > pg_baseback to write dbname (either a specified one or a default one) > > > along with other parameters in primary_conninfo? > > > > > > > True. While I basically agree that pg_basebackup writes dbname in > > primary_conninfo, I'm concerned that writing "dbname=replication" > > could be problematic. Quoting the case 3) Vignesh
Re: Make attstattarget nullable
On 3/14/24 11:13, Peter Eisentraut wrote: > On 12.03.24 14:32, Tomas Vondra wrote: >> On 3/12/24 13:47, Peter Eisentraut wrote: >>> On 06.03.24 22:34, Tomas Vondra wrote: 0001 1) I think this bit in ALTER STATISTICS docs is wrong: - >>> class="parameter">new_target + SET STATISTICS { >>> class="parameter">integer | DEFAULT } because it means we now have list entries for name, ..., new_name, new_schema, and then suddenly "SET STATISTICS { integer | DEFAULT }". That's a bit weird. >>> >>> Ok, how would you change it? List out the full clauses of the other >>> variants under Parameters as well? >> >> I'd go with a parameter, essentially exactly as it used to be, except >> for adding the DEFAULT option. So the list would define new_target, and >> mention DEFAULT as a special value. > > Ok, done that way (I think). > Seems OK to me. 2) The newtarget handling in AlterStatistics seems rather confusing. Why does it get set to -1 just to ignore the value later? For a while I was 99% sure ALTER STATISTICS ... SET STATISTICS DEFAULT will set the field to -1. Maybe ditching the first if block and directly checking stmt->stxstattarget before setting repl_val/repl_null would be better? >>> >>> But we also need to continue accepting -1 for default on input. The >>> current code achieves that, the proposed variant would not. >> >> OK, I did not realize that. But then maybe this should be explained in a >> comment before the new "if" block, because people won't realize why it >> needs to be this way. > > In the new version, I tried to write this more explicitly, and updated > tablecmds.c to match. WFM. It still seems a bit hard to read, but I don't know how to do it better. I guess it's how it has to be to deal with multiple default values in a backwards-compatible way. Good thing is it's localized in two places. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Built-in CTYPE provider
On 14.03.24 09:08, Jeff Davis wrote: 0001 (the C.UTF-8 locale) is also close. Considering that most of the infrastructure is already in place, that's not a large patch. You many have some comments about the way I'm canonicalizing and validating in initdb -- that could be cleaner, but it feels like I should refactor the surrounding code separately first. If have tested this against the libc locale C.utf8 that was available on the OS, and the behavior is consistent. I wonder if we should version the builtin locales too. We might make a mistake and want to change something sometime? Tiny comments: * src/bin/scripts/t/020_createdb.pl The two added tests should have different names that tells them apart (like the new initdb tests). * src/include/catalog/pg_collation.dat Maybe use 'and' instead of '&' in the description. 0002 (inlining utf8 functions) is also ready. Seems ok. For 0003 and beyond, I'd like some validation that it's what you had in mind. I'll look into those later.
Re: abi-compliance-checker
On Mon, Mar 4, 2024 at 7:50 AM Peter Eisentraut wrote: > Looking at this again, if we don't want the .xml files in the tree, then > we don't really need this patch series. Based on this, I've updated the status of this patch in the CommitFest application to Withdrawn. If that's not correct, please feel free to adjust. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Should we remove -Wdeclaration-after-statement?
On Wed, Feb 7, 2024 at 7:55 PM Noah Misch wrote: > > So my suggestion is for people to respond with -1, -0.5, +-0, +0.5, or > > +1 to indicate support against/for the change. > > I'm +1 for the change, for these reasons: > > - Fewer back-patch merge conflicts. The decls section of long functions is a > classic conflict point. > - A mid-func decl demonstrates that its var is unused in the first half of the > func. > - We write Perl in the mixed decls style, without problems. > > For me personally, the "inconsistency" concern is negligible. We allowed "for > (int i = 0", and that inconsistency has been invisible to me. This thread was interesting as an opinion poll, but it seems clear that the consensus is still against the proposed change, so I've marked the CommitFest entry rejected. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Catalog domain not-null constraints
On 14.03.24 15:03, Alvaro Herrera wrote: On 2024-Mar-14, Peter Eisentraut wrote: Perhaps it would make sense if we change the ALTER TABLE command to be like ALTER TABLE t1 ADD IF NOT EXISTS NOT NULL c1 Then the behavior is like one would expect. For ALTER TABLE, we would reject this command if IF NOT EXISTS is not specified. (Since this is mainly for pg_dump, it doesn't really matter for usability.) For ALTER DOMAIN, we could accept both variants. I don't understand why you want to change this behavior, though. Because in the abstract, the behavior of ALTER TABLE t1 ADD should be to add a constraint. In the current implementation, the behavior is different for different constraint types.
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Thu, Mar 14, 2024 at 12:24 PM Amit Kapila wrote: > > On Wed, Mar 13, 2024 at 9:24 PM Bharath Rupireddy > > > > Yes, there will be some sort of duplicity if we emit conflict_reason > > as a text field. However, I still think the better way is to turn > > conflict_reason text to conflict boolean and set it to true only on > > rows_removed and wal_level_insufficient invalidations. When conflict > > boolean is true, one (including all the tests that we've added > > recently) can look for invalidation_reason text field for the reason. > > This sounds reasonable to me as opposed to we just mentioning in the > > docs that "if invalidation_reason is rows_removed or > > wal_level_insufficient it's the reason for conflict with recovery". > > > Fair point. I think we can go either way. Bertrand, Nathan, and > others, do you have an opinion on this matter? While we wait to hear from others on this, I'm attaching the v9 patch set implementing the above idea (check 0001 patch). Please have a look. I'll come back to the other review comments soon. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 18855c08cd8bcbaf41aba10048f0ea23a246e546 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Thu, 14 Mar 2024 12:48:52 + Subject: [PATCH v9 1/4] Track invalidation_reason in pg_replication_slots Up until now, reason for replication slot invalidation is not tracked in pg_replication_slots. A recent commit 007693f2a added conflict_reason to show the reasons for slot invalidation, but only for logical slots. This commit adds a new column to show invalidation reasons for both physical and logical slots. And, this commit also turns conflict_reason text column to conflicting boolean column (effectively reverting commit 007693f2a). One now can look at the new invalidation_reason column for logical slots conflict with recovery. --- doc/src/sgml/ref/pgupgrade.sgml | 4 +- doc/src/sgml/system-views.sgml| 63 +++ src/backend/catalog/system_views.sql | 5 +- src/backend/replication/logical/slotsync.c| 2 +- src/backend/replication/slot.c| 8 +-- src/backend/replication/slotfuncs.c | 25 +--- src/bin/pg_upgrade/info.c | 4 +- src/include/catalog/pg_proc.dat | 6 +- src/include/replication/slot.h| 2 +- .../t/035_standby_logical_decoding.pl | 35 ++- .../t/040_standby_failover_slots_sync.pl | 4 +- src/test/regress/expected/rules.out | 7 ++- 12 files changed, 95 insertions(+), 70 deletions(-) diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index 58c6c2df8b..8de52bf752 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -453,8 +453,8 @@ make prefix=/usr/local/pgsql.new install All slots on the old cluster must be usable, i.e., there are no slots whose - pg_replication_slots.conflict_reason - is not NULL. + pg_replication_slots.conflicting + is not true. diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index be90edd0e2..f3fb5ba1b0 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2525,34 +2525,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx - conflict_reason text + conflicting bool - The reason for the logical slot's conflict with recovery. It is always - NULL for physical slots, as well as for logical slots which are not - invalidated. The non-NULL values indicate that the slot is marked - as invalidated. Possible values are: - - - - wal_removed means that the required WAL has been - removed. - - - - - rows_removed means that the required rows have - been removed. - - - - - wal_level_insufficient means that the - primary doesn't have a sufficient to - perform logical decoding. - - - + True if this logical slot conflicted with recovery (and so is now + invalidated). When this column is true, check + invalidation_reason column for the conflict + reason. @@ -2581,6 +2560,38 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx + + + invalidation_reason text + + + The reason for the slot's invalidation. NULL if the + slot is currently actively being used. The non-NULL values indicate that + the slot is marked as invalidated. Possible values are: + + + + wal_removed means that the required WAL has been + removed. + + + + +
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Wed, Oct 4, 2023 at 9:12 PM James Coleman wrote: > All right, attached is a v3 which attempts to fix the wrong > information with an economy of words. I may at some point submit a > separate patch that adds a broader pruning section, but this at least > brings the docs inline with reality insofar as they address it. I don't think this is as good as what I proposed back on October 2nd. IMHO, that version does a good job making the text accurate and clear, and is directly responsive to your original complaint, namely, that the root of the HOT chain can't be removed. But this version seems to contain a number of incidental changes that are unrelated to that point, e.g. "old versions" -> "old, no longer visible versions", "can be completely removed" -> "may be pruned", and the removal of the sentence "In summary, heap-only tuple updates can only be created - if columns used by indexes are not updated" which AFAICT is both completely correct as-is and unrelated to the original complaint. Maybe I shouldn't be, but I'm slightly frustrated here. I thought I had proposed an alternative which you found acceptable, but then you proposed several more versions that did other things instead, and I never really understood why we couldn't just adopt the version that you seemed to think was OK. If there's a problem with that, say what it is. If there's not, let's do that and move on. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add publisher and subscriber to glossary documentation.
On 2024-Mar-14, Shlok Kyal wrote: > Andrew Atkinson wrote: > > > Anyway, hopefully these examples show “node” and “database” are > > mixed and perhaps others agree using one consistently might help the > > goals of the docs. > > For me the existing content looks good, I felt let's keep it as it is > unless others feel differently. Actually it's these small terminology glitches that give me pause. If we're going to have terms that are interchangeable (in this case "node" and "database"), then they should be always interchangeable, not just in some unspecified cases. Maybe the idea of using "node" (which sounds like something that's instance-wide) is wrong for logical replication, which is necessarily something that happens database-locally. Then again, maybe defining "node" as something that exists at a database-local level when used in the context of logical replication is sufficient. In that case, it would be better to avoid defining it as a synonym of "instance". Then the terms are not always interchangeable, but it's clear when they are and when they aren't. "Node: in replication, each of the endpoints to which or from which data is replicated. In the context of physical replication, each node is an instance. In the context of logical replication, each node is a database". Does that make sense? I'd also look at altering "Primary" and "Standby" so that it's clearer that they're about physical replication, and don't mention "database" anymore, since that's the wrong level. Maybe turn them into "Primary (node)" and "Standby (node)" instead. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: BitmapHeapScan streaming read user and prelim refactoring
On 3/13/24 23:38, Thomas Munro wrote: > On Sun, Mar 3, 2024 at 11:41 AM Tomas Vondra > wrote: >> On 3/2/24 23:28, Melanie Plageman wrote: >>> On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra >>> wrote: With the current "master" code, eic=1 means we'll issue a prefetch for B and then read+process A. And then issue prefetch for C and read+process B, and so on. It's always one page ahead. >>> >>> Yes, that is what I mean for eic = 1 > > I spent quite a few days thinking about the meaning of eic=0 and eic=1 > for streaming_read.c v7[1], to make it agree with the above and with > master. Here's why I was confused: > > Both eic=0 and eic=1 are expected to generate at most 1 physical I/O > at a time, or I/O queue depth 1 if you want to put it that way. But > this isn't just about concurrency of I/O, it's also about computation. > Duh. > > eic=0 means that the I/O is not concurrent with executor computation. > So, to annotate an excerpt from [1]'s random.txt, we have: > > effective_io_concurrency = 0, range size = 1 > unpatched patched > == > pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 > *** executor now has page at 0x58000 to work on *** > pread(43,...,8192,0xb) = 8192 pread(82,...,8192,0xb) = 8192 > *** executor now has page at 0xb to work on *** > > eic=1 means that a single I/O is started and then control is returned > to the executor code to do useful work concurrently with the > background read that we assume is happening: > > effective_io_concurrency = 1, range size = 1 > unpatched patched > == > pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 > posix_fadvise(43,0xb,0x2000,...) posix_fadvise(82,0xb,0x2000,...) > *** executor now has page at 0x58000 to work on *** > pread(43,...,8192,0xb) = 8192 pread(82,...,8192,0xb) = 8192 > posix_fadvise(43,0x108000,0x2000,...) posix_fadvise(82,0x108000,0x2000,...) > *** executor now has page at 0xb to work on *** > pread(43,...,8192,0x108000) = 8192 pread(82,...,8192,0x108000) = 8192 > posix_fadvise(43,0x16,0x2000,...) posix_fadvise(82,0x16,0x2000,...) > > In other words, 'concurrency' doesn't mean 'number of I/Os running > concurrently with each other', it means 'number of I/Os running > concurrently with computation', and when you put it that way, 0 and 1 > are different. > Interesting. For some reason I thought with eic=1 we'd issue the fadvise for page #2 before pread of page #1, so that there'd be 2 IO requests in flight at the same time for a bit of time ... it'd give the fadvise more time to actually get the data into page cache. > Note that the first read is a bit special: by the time the consumer is > ready to pull a buffer out of the stream when we don't have a buffer > ready yet, it is too late to issue useful advice, so we don't bother. > FWIW I think even in the AIO future we would have a synchronous read > in that specific place, at least when using io_method=worker, because > it would be stupid to ask another process to read a block for us that > we want right now and then wait for it wake us up when it's done. > > Note that even when we aren't issuing any advice because eic=0 or > because we detected sequential access and we believe the kernel can do > a better job than us, we still 'look ahead' (= call the callback to > see which block numbers are coming down the pipe), but only as far as > we need to coalesce neighbouring blocks. (I deliberately avoid using > the word "prefetch" except in very general discussions because it > means different things to different layers of the code, hence talk of > "look ahead" and "advice".) That's how we get this change: > > effective_io_concurrency = 0, range size = 4 > unpatched patched > == > pread(43,...,8192,0x58000) = 8192 pread(82,...,8192,0x58000) = 8192 > pread(43,...,8192,0x5a000) = 8192 preadv(82,...,2,0x5a000) = 16384 > pread(43,...,8192,0x5c000) = 8192 pread(82,...,8192,0x5e000) = 8192 > pread(43,...,8192,0x5e000) = 8192 preadv(82,...,4,0xb) = 32768 > pread(43,...,8192,0xb) = 8192 preadv(82,...,4,0x108000) = 32768 > pread(43,...,8192,0xb2000) = 8192 preadv(82,...,4,0x16) = 32768 > > And then once we introduce eic > 0 to the picture with neighbouring > blocks that can be coalesced, "patched" starts to diverge even more > from "unpatched" because it tracks the number of wide I/Os in > progress, not the number of single blocks. > So, IIUC this means (1) the patched code is more aggressive wrt prefetching (because we prefetch more data overall, because master would
Re: Reports on obsolete Postgres versions
On Wed, Mar 13, 2024 at 3:05 PM Laurenz Albe wrote: > I think we are pretty conservative with backpatching changes to the > optimizer that could destabilize existing plans. We have gotten better about that, which is good. > I feel quite strongly that we should not use overly conservative language > there. If people feel that they have to test their applications for new > minor releases, the only effect will be that they won't install minor > releases. > Installing a minor release should be as routine as the operating system > patches that many companies apply automatically during weekend maintenance > windows. They can also introduce bugs, and everybody knows and accepts that. I don't agree with this. If we tell people that they don't need to test their applications after an upgrade, I do not think that the result will be that they skip the testing and everything works perfectly. I think that the result will be that we lose all credibility. Nobody who has been working on computers for longer than a week is going to believe that a software upgrade can't break anything, and someone whose entire business depends on things not breaking is going to want to validate that they haven't. And they're not wrong to think that way, either. I think that whatever we say here should focus on what we try to do or guarantee, not on what actions we think users ought to take, never mind must take. We can say that we try to avoid making any changes upon which an application might be relying -- but there surely is some weasel-wording there, because we have made such changes before in the name of security, and sometimes to fix bugs, and we will likely to do so again in the future. But it's not for us to decide how much testing is warranted. It's the user's system, not ours. In the end, while I certainly don't mind improving the web page, I think that a lot of what we're seeing here probably has to do with the growing popularity and success of PostgreSQL. If you have more people using your software, you're also going to have more people using out-of-date versions of your software. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MERGE ... RETURNING
On 2024-Mar-13, Dean Rasheed wrote: > On Wed, 13 Mar 2024 at 06:44, jian he wrote: > > > > > > [ WITH with_query [, ...] ] > > MERGE INTO [ ONLY ] > > > here the "WITH" part should have "[ RECURSIVE ]" > > Actually, no. MERGE doesn't support WITH RECURSIVE. > > It's not entirely clear to me why though. I did a quick test, removing > that restriction in the parse analysis code, and it seemed to work > fine. Alvaro, do you remember why that restriction is there? There's no real reason for it, other than I didn't want to have to think it through; I did suspect that it might Just Work, but I felt I would have had to come up with more nontrivial test cases than I wanted to write at the time. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "People get annoyed when you try to debug them." (Larry Wall)
Re: Catalog domain not-null constraints
On 2024-Mar-14, Peter Eisentraut wrote: > Perhaps it would make sense if we change the ALTER TABLE command to be like > > ALTER TABLE t1 ADD IF NOT EXISTS NOT NULL c1 > > Then the behavior is like one would expect. > > For ALTER TABLE, we would reject this command if IF NOT EXISTS is not > specified. (Since this is mainly for pg_dump, it doesn't really matter for > usability.) For ALTER DOMAIN, we could accept both variants. I don't understand why you want to change this behavior, though. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo"
Re: REVOKE FROM warning on grantor
On Thursday, March 14, 2024, Étienne BERSAC wrote: > > However, I'd prefer if Postgres fails properly. Because the GRANT is > actually not revoked. This prevent ldap2pg to report an issue in > handling privileges on such roles. > > What do you think of make this warning an error ? > > The choice of warning is made because after the command ends the grantmin question does not exist. The revoke was a no-op and the final state is as the user intended. Historically doing this didn’t give any message at all which was confusing so we added a warning so the semantics of not failing were preserved but there was some indication that something was amiss. I don’t have a compelling argument to,change the long-standing behavior. Client code can and probably should look for a show errors reported by the backend. It is indeed possibly to treat this warning more serverly than the server chooses to. David J.
Re: BitmapHeapScan streaming read user and prelim refactoring
On 14/03/2024 12:55, Dilip Kumar wrote: On Thu, Mar 14, 2024 at 4:07 PM Heikki Linnakangas wrote: _SPI_execute_plan() has code to deal with the possibility that the active snapshot is not set. That seems fishy; do we really support SPI without any snapshot? I'm inclined to turn that into an error. I ran the regression tests with an "Assert(ActiveSnapshotSet())" there, and everything worked. IMHO, we can call SPI_Connect() and SPI_Execute() from any C extension, so I don't think there we can guarantee that the snapshot must be set, do we? I suppose, although the things you could do without a snapshot would be pretty limited. The query couldn't access any tables. Could it even look up functions in the parser? Not sure. Maybe for now we can just handle this specific case to remove the snapshot serializing for the BitmapHeapScan as you are doing in the patch. After looking into the code your theory seems correct that we are just copying the ActiveSnapshot while building the query descriptor and from there we are copying into the Estate so logically there should not be any reason for these two to be different. Ok, committed that for now. Thanks for looking! -- Heikki Linnakangas Neon (https://neon.tech)
Re: type cache cleanup improvements
One thing that first pops out to me is that we can do the refactor of hash_initial_lookup() as an independent piece, without the extra paths introduced. But rather than returning the bucket hash and have the bucket number as an in/out argument of hash_initial_lookup(), there is an argument for reversing them: hash_search_with_hash_value() does not care about the bucket number. Ok, no problem 02-hash_seq_init_with_hash_value.v5.patch - introduces a hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as inline, but I suppose, modern compilers are smart enough to inline it automatically. Likely so, though that does not hurt to show the intention to the reader. Agree So I would like to suggest the attached patch for this first piece. What do you think? I have not any objections It may also be an idea to use `git format-patch` when generating a series of patches. That makes for easier reviews. Thanks, will try -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: small_cleanups around login event triggers
On Thu, Mar 14, 2024 at 8:21 AM Daniel Gustafsson wrote: > > > On 14 Mar 2024, at 02:47, Robert Treat wrote: > > > I was taking a look at the login event triggers work (nice work btw) > > Thanks for reviewing committed code, that's something which doesn't happen > often enough and is much appreciated. > > > and saw a couple of minor items that I thought would be worth cleaning > > up. This is mostly just clarifying the exiting docs and code comments. > > + either in a connection string or configuration file. Alternativly, you > can > This should be "Alternatively" I think. > Yes. > - canceling connection in psql wouldn't cancel > + canceling a connection in psql will not > cancel > Nitpickery (perhaps motivated by english not being my first language), but > since psql only deals with one connection I would expect this to read "the > connection". > My interpretation of this is that "a connection" is more correct because it could be your connection or someone else's connection (ie, you are canceling one of many possible connections). Definitely nitpickery either way. > - * Returns true iff the lock was acquired. > + * Returns true if the lock was acquired. > Using "iff" here is being consistent with the rest of the file (and > technically > correct): > > $ grep -c "if the lock was" src/backend/storage/lmgr/lmgr.c > 1 > $ grep -c "iff the lock was" src/backend/storage/lmgr/lmgr.c > 5 > Ah, yeah, I was pretty focused on the event trigger stuff and didn't notice it being used elsewhere; thought it was a typo, but I guess it's meant as shorthand for "if and only if", I wonder how many people are familiar with that. Robert Treat https://xzilla.net