Re: Add support for AT LOCAL
On Sat, Sep 23, 2023 at 12:54:01AM +0200, Vik Fearing wrote: > On 9/22/23 23:46, cary huang wrote: >> I think this feature can be a useful addition in dealing with time >> zones. I have applied and tried out the patch, The feature works as >> described and seems promising. The problem with compilation failure >> was probably reported on CirrusCI when compiled on different >> platforms. I have run the latest patch on my own Cirrus CI environment >> and everything checked out fine. > > Thank you for reviewing! +| a_expr AT LOCAL%prec AT +{ +/* Use the value of the session's time zone */ +FuncCall *tz = makeFuncCall(SystemFuncName("current_setting"), + list_make1(makeStringConst("TimeZone", -1)), +COERCE_SQL_SYNTAX, +-1); +$$ = (Node *) makeFuncCall(SystemFuncName("timezone"), + list_make2(tz, $1), + COERCE_SQL_SYNTAX, + @2); As the deparsing code introduced by this patch is showing, this leads to a lot of extra complexity. And, actually, this can be quite expensive as well with these two layers of functions. Note also that in comparison to SQLValueFunctions, COERCE_SQL_SYNTAX does less inlining. So here comes my question: why doesn't this stuff just use one underlying function to do this job? -- Michael signature.asc Description: PGP signature
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Thu, Sep 28, 2023 at 6:08 PM Zhijie Hou (Fujitsu) wrote: > > On Thursday, September 28, 2023 5:32 PM Bharath Rupireddy > wrote: > > > Perhaps, a function in logical/decode.c returning the WAL record as valid > > if the > > record type is any of the above. A note in replication/decode.h and/or > > access/rmgrlist.h asking rmgr adders to categorize the WAL record type in > > the > > new function based on its decoding operation might help with future new WAL > > record type additions. > > > > Thoughts? > > I think this approach can work, but I am not sure if it's better than other > approaches. Mainly because it has almost the same maintaince burden as the > current approach, i.e. we need to verify and update the check function each > time we add a new WAL record type. I think that's not a big problem if we have comments in replication/decode.h, access/rmgrlist.h, docs to categorize the new WAL records as decodable. Currently, the WAL record types adders will have to do certain things based on notes in comments or docs anyways. Another idea to enforce categorizing decodability of WAL records is to have a new RMGR API rm_is_record_decodable or such, the RMGR implementers will then add respective functions returning true/false if a given WAL record is decodable or not: void(*rm_decode) (struct LogicalDecodingContext *ctx, struct XLogRecordBuffer *buf); bool(*rm_is_record_decodable) (uint8 type); } RmgrData; PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_is_record_decodable), then the xlog_is_record_decodable can look something like [1]. This approach can also enforce/help custom RMGR implementers to define the decodability of the WAL records. > Apart from the WAL scan approach, we also considered alternative approach that > do not impose an additional maintenance burden and could potentially be less > complex. For example, we can add a new field in pg_controldata to record the > last checkpoint that happens in non-upgrade mode, so that we can compare the > slot's confirmed_flush_lsn with this value, If they are the same, the WAL > should have been consumed otherwise we disallow upgrading this slot. I would > appreciate if you can share your thought about this approach. I read this https://www.postgresql.org/message-id/CAA4eK1JVKZGRHLOEotWi%2Be%2B09jucNedqpkkc-Do4dh5FTAU%2B5w%40mail.gmail.com and I agree with the concern on adding a new filed in pg_controldata just for this purpose and spreading the IsBinaryUpgrade code in checkpointer. Another concern for me with a new filed in pg_controldata approach is that it makes it hard to make this patch support back branches. Therefore, -1 for this approach from me. > And if we decided to use WAL scan approach, instead of checking each record, > we > could directly check if the WAL record can be decoded into meaningful results > by use test_decoding to decode them. This approach also doesn't add new > maintenance burden as we anyway need to update the test_decoding if any decode > logic for new record changes. This was also mentioned [1]. > > What do you think ? > > [1] > https://www.postgresql.org/message-id/OS0PR01MB5716FC0F814D78E82E4CC3B894C3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com -1 for decoding the WAL with test_decoding, I don't think it's a great idea to create temp slots and launch walsenders during upgrade. IMO, WAL scanning approach looks better. However, if were to optimize it by not scanning WAL records for every replication slot confirmed_flush_lsn (CFL), start with lowest CFL (min of all slots CFL), and scan till the end of WAL. The binary_upgrade_validate_wal_logical_end function can return an array of LSNs at which decodable WAL records are found. Then, use CFLs of all other slots and this array to determine if the slots have unconsumed WAL. Following is an illustration of this idea: 1. Slots s1, s2, s3, s4, s5 with CFLs 100, 90, 110, 70, 80 respectively. 2. Min of all CFLs is 70 for slot s4. 3. Start scanning WAL from min CFL 70 for slot s4, say there are unconsumed WAL at LSN {85, 89}. 4. Now, without scanning WAL for rest of the slots, determine if they have unconsumed WAL. 5.1. CFL of slot s1 is 100 and no unconsumed WAL at or after LSN 100 - look at the array of unconsumed WAL LSNs {85, 89}. 5.2. CFL of slot s2 is 90 and no unconsumed WAL at or after LSN 90 - look at the array of unconsumed WAL LSNs {85, 89}. 5.3. CFL of slot s3 is 110 and no unconsumed WAL at or after LSN 110 - look at the array of unconsumed WAL LSNs {85, 89}. 5.4. CFL of slot s4 is 70 and there's unconsumed WAL at or after LSN 70 - look at the array of unconsumed WAL LSNs {85, 89}. 5.5. CFL of slot s5 is 80 and there's unconsumed WAL at or after LSN 80 - look at the array of unconsumed WAL LSNs {85, 89}. With this approach, the WAL is scanned only once as opposed to the current approach the patch implements. Thoughts? [1] bool xlog_is_record_decodable(uint8
Re: pg_resetwal tests, logging, and docs update
On 26.09.23 17:19, Aleksander Alekseev wrote: Attached is an updated patch set where I have split the changes into smaller pieces. The last two patches still have some open questions about what certain constants mean etc. The other patches should be settled. The patches 0001..0005 seem to be ready and rather independent. I suggest merging them and continue discussing the rest of the patches. I have committed 0001..0005, and also posted a separate patch to discuss and correct the behavior of the -c option. I expect that we will carry over this patch set to the next commit fest.
pg_resetwal regression: could not upgrade after 1d863c2504
Dear hackers, (CC: Peter Eisentraut - committer of the problematic commit) While developing pg_upgrade patch, I found a candidate regression for pg_resetwal. It might be occurred due to 1d863c2504. Is it really regression, or am I missing something? # Phenomenon pg_resetwal with relative path cannot be executed. It could be done at 7273945, but could not at 1d863. At 1d863: ``` $ pg_resetwal -n data_N1/ pg_resetwal: error: could not read permissions of directory "data_N1/": No such file or directory ``` At 7273945: ``` $ pg_resetwal -n data_N1/ Current pg_control values: pg_control version number:1300 Catalog version number: 202309251 ... ``` # Environment Attached script was executed on RHEL 7.9, gcc was 8.3.1. I used meson build system with following options: meson setup -Dcassert=true -Ddebug=true -Dc_args="-ggdb -O0 -g3 -fno-omit-frame-pointer" # My analysis I found that below part in GetDataDirectoryCreatePerm() returns false, it was a cause. ``` /* * If an error occurs getting the mode then return false. The caller is * responsible for generating an error, if appropriate, indicating that we * were unable to access the data directory. */ if (stat(dataDir, &statBuf) == -1) return false; ``` Also, I found that the value DataDir in main() has relative path. Based on that, upcoming stat() may not able to detect the given location because the process has already located inside the directory. ``` (gdb) break chdir Breakpoint 1 at 0x4016f0 (gdb) run -n data_N1 ... Breakpoint 1, 0x778e1390 in chdir () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64 (gdb) print DataDir $1 = 0x7fffe25c "data_N1" (gdb) frame 1 #1 0x004028d7 in main (argc=3, argv=0x7fffdf58) at ../postgres/src/bin/pg_resetwal/pg_resetwal.c:348 348 if (chdir(DataDir) < 0) (gdb) print DataDir $2 = 0x7fffe25c "data_N1" ``` # How to fix One alternative approach is to call chdir() several times. PSA the patch. (I'm not sure the commit should be reverted) # Appendix - How did I find? Originally, I found an issue when attached script was executed. It creates two clusters and executes pg_upgrade, but failed with following output. (I also attached whole output, please see result_*.out) ``` Performing Consistency Checks - Checking cluster versions ok pg_resetwal: error: could not read permissions of directory "data_N1": No such file or directory ``` Best Regards, Hayato Kuroda FUJITSU LIMITED test.sh Description: test.sh result_7273945ca.out Description: result_7273945ca.out result_1d863c2504.out Description: result_1d863c2504.out fix.patch Description: fix.patch
Re: wal recycling problem
Yes, barman replication can keep up with primary, wals segments size are under max_wal_size (24Gb in our configuration) Here is pg_replication_slots view: barman_ge physical f t39409 1EE2/4900 reservedf barman_be physical f t39434 1EE2/3D00 reservedf on the other hand there are 2 slots for logical replication which display status extended. I don't understand why given that the confirmed_flush_lsn field that is up to date. The restart_lsn remains frozen, for what reason? pgoutput │ logical │ 2667915 │ db019a00 │ f │ t │1880162 │ │ 68512101 │ 1ECA/37C3F1B8 │ 1EE2/4D6BDCF8 │ extended │ │ f │ pgoutput │ logical │ 2668584 │ db038a00 │ f │ t │ 363230 │ │ 68512101 │ 1ECA/37C3F1B8 │ 1EE2/4D6BDCF8 │ extended │ │ f │ Regards Fabrice On Thu, Sep 28, 2023 at 7:59 PM Christoph Moench-Tegeder wrote: > ## Fabrice Chapuis (fabrice636...@gmail.com): > > > We have a cluster of 2 members (1 primary and 1 standby) with Postgres > > version 14.9 and 2 barman server, slots are only configured for barman, > > barman is version 3.7. > > The obvious question here is: can both of those barmans keep up with > your database, or are you seeing WAL retention due to exactly these > replication slots? (Check pg_replication_slots). > > Regards, > Christoph > > -- > Spare Space >
Re: [PGDOCS] change function linkend to refer to a more relevant target
> On 29 Sep 2023, at 06:51, Peter Smith wrote: > I found a link in the docs that referred to the stats "views" section, > instead of the more relevant (IMO) stats "functions" section. Agreed. This link was added in 2007 (in 7d3b7011b), and back in 7.3/7.4 days the functions were listed in the same section as the views, so the anchor was at the time pointing to the right section. In 2012 it was given its own section (in aebe989477) but the link wasn't updated. Thanks for the patch, I'll go ahead with it. -- Daniel Gustafsson
Re: pg_resetwal regression: could not upgrade after 1d863c2504
On 29.09.23 09:39, Hayato Kuroda (Fujitsu) wrote: pg_resetwal with relative path cannot be executed. It could be done at 7273945, but could not at 1d863. Ok, I have reverted the offending patch.
how to manage Cirrus on personal repository
I have a private repository on GitHub where I park PostgreSQL development work. I also have Cirrus activated on that repository, to build those branches, using the existing Cirrus configuration. Now under the new system of limited credits that started in September, this blew through the credits about half-way through the month. Does anyone have an idea how to manage this better? Is there maybe a way to globally set "only trigger manually", or could we make one? I suppose one way to deal with this is to make a second repository and only activate Cirrus on that one, and push there on demand. Any ideas?
Re: Remove MSVC scripts from the tree
On 22.09.23 03:12, Michael Paquier wrote: It has been mentioned a few times now that, as Meson has been integrated in the tree, the next step would be to get rid of the custom scripts in src/tools/msvc/ and moving forward only support Meson when building with VS compilers. As far as I understand, nobody has sent a patch to do that yet, so here is one. Please find attached a patch to move the needle in this sense. Your patch still leaves various mentions of Mkvcbuild.pm and Project.pm in other files, including in config/perl.m4 meson.build src/bin/pg_basebackup/Makefile src/bin/pgevent/meson.build src/common/Makefile src/common/meson.build src/interfaces/libpq/Makefile src/port/Makefile A few of these comments are like "see $otherfile for the reason", which means that if we delete $otherfile, we should move that information to a new site somehow.
Re: On login trigger: take three
> On 28 Sep 2023, at 23:50, Alexander Korotkov wrote: > I don't think I can reproduce the performance regression pointed out > by Pavel Stehule [1]. > I can't confirm the measurable overhead. Running the same pgbench command on my laptop looking at the average connection times, and the averaging that over five runs (low/avg/high) I see ~5% increase over master with the patched version (compiled without assertions and debug): Patched event_triggers on: 6.858 ms/7.038 ms/7.434 ms Patched event_triggers off: 6.601 ms/6.958 ms/7.539 ms Master: 6.676 ms/6.697 ms/6.760 ms This is all quite unscientific with a lot of jitter so grains of salt are to be applied, but I find it odd that you don't see any measurable effect. Are you seeing the same/similar connection times between master and with this patch applied? A few small comments on the patch: + prevent successful login to the system. Such bugs may be fixed by + restarting the system in single-user mode (as event triggers are This paragraph should be reworded to recommend the GUC instead of single-user mode (while retaining mention of single-user mode, just not as the primary option). + Also, it's recommended to evade long-running queries in s/evade/avoid/ perhaps? Thanks for working on this! -- Daniel Gustafsson
Re: Testing autovacuum wraparound (including failsafe)
> On 27 Sep 2023, at 14:39, Masahiko Sawada wrote: > I've attached new version patches. 0001 patch adds an option to > background_psql to specify the timeout seconds, and 0002 patch is the > main regression test patch. -=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params) +=item PostgreSQL::Test::BackgroundPsql->new(interactive, @params, timeout) Looking at this I notice that I made a typo in 664d757531e, the =item line should have "@psql_params" and not "@params". Perhaps you can fix that minor thing while in there? + $timeout = $params{timeout} if defined $params{timeout}; I think this should be documented in the background_psql POD docs. + Not enabled by default it is resource intensive. This sentence is missing a "because", should read: "..by default *because* it is.." +# Bump the query timeout to avoid false negatives on slow test systems. +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default; Should we bump the timeout like this for all systems? I interpreted Noah's comment such that it should be possible for slower systems to override, not that it should be extended everywhere, but I might have missed something. -- Daniel Gustafsson
Re: Testing autovacuum wraparound (including failsafe)
On Thu, 28 Sept 2023 at 03:55, Masahiko Sawada wrote: > > Sorry for the late reply. > > On Sun, Sep 3, 2023 at 2:48 PM Noah Misch wrote: > > > > On Wed, Jul 12, 2023 at 01:47:51PM +0200, Daniel Gustafsson wrote: > > > > On 12 Jul 2023, at 09:52, Masahiko Sawada wrote: > > > > Agreed. The timeout can be set by manually setting > > > > PG_TEST_TIMEOUT_DEFAULT, but I bump it to 10 min by default. And it > > > > now require setting PG_TET_EXTRA to run it. > > > > > > +# bump the query timeout to avoid false negatives on slow test syetems. > > > typo: s/syetems/systems/ > > > > > > > > > +# bump the query timeout to avoid false negatives on slow test syetems. > > > +$ENV{PG_TEST_TIMEOUT_DEFAULT} = 600; > > > Does this actually work? Utils.pm read the environment variable at > > > compile > > > time in the BEGIN block so this setting won't be seen? A quick > > > testprogram > > > seems to confirm this but I might be missing something. > > > > The correct way to get a longer timeout is "IPC::Run::timer(4 * > > $PostgreSQL::Test::Utils::timeout_default);". Even if changing env worked, > > that would be removing the ability for even-slower systems to set timeouts > > greater than 10min. > > Agreed. > > I've attached new version patches. 0001 patch adds an option to > background_psql to specify the timeout seconds, and 0002 patch is the > main regression test patch. Few comments: 1) Should we have some validation for the inputs given: +PG_FUNCTION_INFO_V1(consume_xids_until); +Datum +consume_xids_until(PG_FUNCTION_ARGS) +{ + FullTransactionId targetxid = FullTransactionIdFromU64((uint64) PG_GETARG_INT64(0)); + FullTransactionId lastxid; + + if (!FullTransactionIdIsNormal(targetxid)) + elog(ERROR, "targetxid %llu is not normal", (unsigned long long) U64FromFullTransactionId(targetxid)); If not it will take inputs like -1 and 999. Also the notice messages might confuse for the above values, as it will show a different untilxid value like the below: postgres=# SELECT consume_xids_until(999); NOTICE: consumed up to 0:1809 / 232830:2764472319 2) Should this be added after worker_spi as we generally add it in the alphabetical order: diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index fcd643f6f1..4054bde84c 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -10,6 +10,7 @@ subdir('libpq_pipeline') subdir('plsample') subdir('spgist_name_ops') subdir('ssl_passphrase_callback') +subdir('xid_wraparound') subdir('test_bloomfilter') 3) Similarly here too: index e81873cb5a..a4c845ab4a 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -13,6 +13,7 @@ SUBDIRS = \ libpq_pipeline \ plsample \ spgist_name_ops \ + xid_wraparound \ test_bloomfilter \ 4) The following includes are not required transam.h, fmgr.h, lwlock.h + * src/test/modules/xid_wraparound/xid_wraparound.c + * + * - + */ +#include "postgres.h" + +#include "access/transam.h" +#include "access/xact.h" +#include "fmgr.h" +#include "miscadmin.h" +#include "storage/lwlock.h" +#include "storage/proc.h" Regards, Vignesh
Re: how to manage Cirrus on personal repository
> On 29 Sep 2023, at 11:13, Peter Eisentraut wrote: > Does anyone have an idea how to manage this better? Is there maybe a way to > globally set "only trigger manually", or could we make one? On my personal repo I only build via doing pull-requests, such that I can control when and what is built and rate-limit myself that way. Using the Github CLI client it's quite easy to script "push-and-build". Not sure if it's better, it's just what I got used to from doing personal CI on Github before we had Cirrus support in the tree. -- Daniel Gustafsson
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Fri, Sep 29, 2023 at 1:00 PM Bharath Rupireddy wrote: > > On Thu, Sep 28, 2023 at 6:08 PM Zhijie Hou (Fujitsu) > wrote: > > IMO, WAL scanning approach looks better. However, if were to optimize > it by not scanning WAL records for every replication slot > confirmed_flush_lsn (CFL), start with lowest CFL (min of all slots > CFL), and scan till the end of WAL. > Earlier, I also thought something like that but I guess it won't matter much as most of the slots will be up-to-date at shutdown time. That would mean we would read just one or two records. Personally, I feel it is better to build consensus on the WAL scanning approach, basically, is it okay to decide as the patch is currently doing or whether we should expose an API from the decode module as you are proposing? OTOH, if we want to go with other approach like adding field in pg_controldata then we don't need to deal with WAL record types at all. -- With Regards, Amit Kapila.
Re: how to manage Cirrus on personal repository
Hi, On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut wrote: > > I have a private repository on GitHub where I park PostgreSQL > development work. I also have Cirrus activated on that repository, to > build those branches, using the existing Cirrus configuration. > > Now under the new system of limited credits that started in September, > this blew through the credits about half-way through the month. > > Does anyone have an idea how to manage this better? Is there maybe a > way to globally set "only trigger manually", or could we make one? You can create another repository / branch with only a .cirrus.yml file which will only have the 'trigger_type: manual' line. Then, you can go to your private repository's settings on Cirrus CI and set REPO_CI_CONFIG_GIT_URL=github.com/{user}/{repository}/.cirrus.yml@{branch} environment variable. This will write contents of the newly created .cirrus.yml file to your private repository's .cirrus.yml configuration while running the CI. You can look at the .cirrus.star file for more information. I just tested this on a public repository and it worked but I am not sure if something is different for private repositories. I hope these help. Regards, Nazir Bilal Yavuz Microsoft
Re: Synchronizing slots from primary to standby
On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand wrote: > > On 9/25/23 6:10 AM, shveta malik wrote: > > On Fri, Sep 22, 2023 at 3:48 PM Amit Kapila wrote: > >> > >> On Thu, Sep 21, 2023 at 9:16 AM shveta malik > >> wrote: > >>> > >>> On Tue, Sep 19, 2023 at 10:29 AM shveta malik > >>> wrote: > >>> > >>> Currently in patch001, synchronize_slot_names is a GUC on both primary > >>> and physical standby. This GUC tells which all logical slots need to > >>> be synced on physical standbys from the primary. Ideally it should be > >>> a GUC on physical standby alone and each physical standby should be > >>> able to communicate the value to the primary (considering the value > >>> may vary for different physical replicas of the same primary). The > >>> primary on the other hand should be able to take UNION of these values > >>> and let the logical walsenders (belonging to the slots in UNION > >>> synchronize_slots_names) wait for physical standbys for confirmation > >>> before sending those changes to logical subscribers. The intent is > >>> logical subscribers should never be ahead of physical standbys. > >>> > >> > >> Before getting into the details of 'synchronize_slot_names', I would > >> like to know whether we really need the second GUC > >> 'standby_slot_names'. Can't we simply allow all the logical wal > >> senders corresponding to 'synchronize_slot_names' to wait for just the > >> physical standby(s) (physical slot corresponding to such physical > >> standby) that have sent ' synchronize_slot_names'list? We should have > >> one physical standby slot corresponding to one physical standby. > >> > > > > yes, with the new approach (to be implemented next) where we plan to > > send synchronize_slot_names from each physical standby to primary, the > > standby_slot_names GUC should no longer be needed on primary. The > > physical standbys sending requests should automatically become the > > ones to be waited for confirmation on the primary. > > > > I think that standby_slot_names could be used to do some filtering (means > for which standby(s) we don't want the logical replication on the primary to > go > ahead and for which standby(s) one would allow it). > Isn't it implicit that the physical standby that has requested 'synchronize_slot_names' should be ahead of their corresponding logical walsenders? Otherwise, after the switchover to the new physical standby, the logical subscriptions won't work. > I think that removing the GUC would: > > - remove this flexibility > I think if required we can add such a GUC later as well. Asking users to set more parameters also makes the feature less attractive, so I am trying to see if we can avoid this GUC. > - probably open corner cases like: what if a standby is down? would that mean > that synchronize_slot_names not being send to the primary would allow the > decoding > on the primary to go ahead? > Good question. BTW, irrespective of whether we have 'standby_slot_names' parameters or not, how should we behave if standby is down? Say, if 'synchronize_slot_names' is only specified on standby then in such a situation primary won't be even aware that some of the logical walsenders need to wait. OTOH, one can say that users should configure 'synchronize_slot_names' on both primary and standby but note that this value could be different for different standby's, so we can't configure it on primary. -- With Regards, Amit Kapila.
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Bharath, Thanks for giving your idea! > > I think this approach can work, but I am not sure if it's better than other > > approaches. Mainly because it has almost the same maintaince burden as the > > current approach, i.e. we need to verify and update the check function each > > time we add a new WAL record type. > > I think that's not a big problem if we have comments in > replication/decode.h, access/rmgrlist.h, docs to categorize the new > WAL records as decodable. Currently, the WAL record types adders will > have to do certain things based on notes in comments or docs anyways. > > Another idea to enforce categorizing decodability of WAL records is to > have a new RMGR API rm_is_record_decodable or such, the RMGR > implementers will then add respective functions returning true/false > if a given WAL record is decodable or not: > void(*rm_decode) (struct LogicalDecodingContext *ctx, > struct XLogRecordBuffer *buf); > bool(*rm_is_record_decodable) (uint8 type); > } RmgrData; > > PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, > NULL, NULL, xlog_is_record_decodable), then the > xlog_is_record_decodable can look something like [1]. > > This approach can also enforce/help custom RMGR implementers to define > the decodability of the WAL records. Yeah, the approach enforces developers to check the decodability. But the benefit seems smaller than required efforts for it because the function would be used only by pg_upgrade. Could you tell me if you have another use case in mind? We may able to adopt if we have... Also, this approach cannot be backported. Anyway, let's see how senior members say. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_upgrade and logical replication
On Wed, Sep 27, 2023 at 9:14 AM Michael Paquier wrote: > > On Tue, Sep 26, 2023 at 09:40:48AM +0530, Amit Kapila wrote: > > On Mon, Sep 25, 2023 at 11:43 AM Michael Paquier > > wrote: > >> Sure, that's assuming that the publisher side is upgraded. > > > > At some point, user needs to upgrade publisher and subscriber could > > itself have some publications defined which means the downstream > > subscribers will have the same problem. > > Not always. I take it as a valid case that one may want to create a > logical setup only for the sake of an upgrade, and trashes the > publisher after a failover to an upgraded subscriber node after the > latter has done a sync up of the data that's been added to the > relations tracked by the publications while the subscriber was > pg_upgrade'd. > Such a use case is possible to achieve even without this patch. Sawada-San has already given an alternative to slightly tweak the steps mentioned by Julien to achieve it. Also, there are other ways to achieve it by slightly changing the steps. OTOH, it will create a problem for normal logical replication set up after upgrade as discused. -- With Regards, Amit Kapila.
Re: how to manage Cirrus on personal repository
On Fri, Sep 29, 2023 at 7:11 AM Daniel Gustafsson wrote: > > > On 29 Sep 2023, at 11:13, Peter Eisentraut wrote: > > > Does anyone have an idea how to manage this better? Is there maybe a way > > to globally set "only trigger manually", or could we make one? > > On my personal repo I only build via doing pull-requests, such that I can > control when and what is built and rate-limit myself that way. Using the > Github CLI client it's quite easy to script "push-and-build". Not sure if > it's > better, it's just what I got used to from doing personal CI on Github before > we > had Cirrus support in the tree. It is not a global configuration solution, but I just add an empty ci-os-only: tag to my commit messages so it doesn't trigger CI. I'm sure this is not what you are looking for, though - Melanie
Re: Eager page freeze criteria clarification
On Wed, Sep 27, 2023 at 7:09 PM Melanie Plageman wrote: > At the risk of seeming too execution-focused, I want to try and get more > specific. Here is a description of an example implementation to test my > understanding: > > In table-level stats, save two numbers: younger_than_cpt/older_than_cpt > storing the number of instances of unfreezing a page which is either > younger or older than the start of the most recent checkpoint at the > time of its unfreezing When I first read this, I thought this sounded right, except for the naming which doesn't mention freezing, but on further thought, this feels like it might not be the right thing. Why do I care how many super-old pages (regardless of how exactly we define super-old) I unfroze? I think what I care about is more like: how often do I have to unfreeze a recently-frozen page in order to complete a DML operation? For example, consider a table with a million pages, 10 of which are frozen. 1 was frozen a long time ago, 9 were frozen recently, and the other 999,990 are unfrozen. I update every page in the table. Well, now older_than_cpt = 1 and younger_than_cpt = 9, so the ratio of the two is terrible: 90% of the frozen pages I encountered were recently-frozen. But that's irrelevant. What matters is that if I had frozen less aggressively, I could have saved myself 9 unfreeze operations across a million page modifications. So actually I'm doing great: only 0.0009% of my page modifications required an unfreeze that I maybe should have avoided and didn't. Likewise, if all of the pages had been frozen, but only 9 were frozen recently, I'm doing exactly equally great. But if the number of recently frozen pages were higher, then I'd be doing less great. So I think comparing the number of young pages unfrozen to the number of old pages unfrozen is the wrong test, and comparing it instead to the number of pages modified is a better test. But I don't think it's completely right, either. Imagine a table with 100 recently frozen pages. I modify the table and unfreeze one of them. So, 100% of the pages that I unfroze were recently-frozen. Does this mean that we should back off and freeze less aggressively? Not really. It seems like I actually got this case 99% right. So another idea could be to look at what percentage of frozen pages I end up un-freezing shortly thereafter. That gets this example right. But it gets the earlier example with a million pages wrong. Maybe there's something to combining these ideas. If the percentage of page modifications (of clean pages, say, so that we don't skew the stats as much when a page is modified many times in a row) that have to un-freeze a recently-frozen page is low, then that means there's no real performance penalty for whatever aggressive freezing we're doing. Even if we un-freeze a zillion pages, if that happens over the course of modifying 100 zillion pages, it's not material. On the flip side, if the percentage of frozen pages that get unfrozen soon thereafter is low, then it feels worth doing even if most page modifications end up un-freezing a recently-frozen page, because on net we're still ending up with a lot of extra stuff frozen. Said differently, we're freezing too aggressively if un-freezing is both adding a significant expense (i.e. the foreground work we're doing often requires un-freezing ages) and ineffective (i.e. we don't end up with frozen pages left over). If it has one of those problems, it's still OK, but if it has both, we need to back off. Maybe that's still not quite right, but it's the best I've got this morning. > You would likely want to combine it with one of the other heuristics we > discussed. > > For example: > For a table with only 20% younger unfreezings, when vacuuming that page, > > if insert LSN - RedoRecPtr < insert LSN - page LSN > page is older than the most recent checkpoint start, so freeze it > regardless of whether or not it would emit an FPI > > What aggressiveness levels should there be? What should change at each > level? What criteria should pages have to meet to be subject to the > aggressiveness level? My guess is that, when we're being aggressive, we should be aiming to freeze all pages that can be completely frozen with no additional criterion. Any additional criterion that we add here is likely to mess up the insert-only table case, and I'm not really sure what it saves us from. On the other hand, when we're being non-aggressive, we might still sometimes want to freeze in situations where we historically haven't. Consider the following three examples: 1. pgbench_accounts table, standard run 2. pgbench_accounts table, highly skewed run 3. slowly growing table with a hot tail. records are inserted, updated heavily for a while, thereafter updated only very occasionally. Aggressive freezing could misfire in all of these cases, because all of them have some portion of the table where rows are being updated very frequently. But in the second and third cases, the
Re: Testing autovacuum wraparound (including failsafe)
On Fri, Sep 29, 2023 at 12:17:04PM +0200, Daniel Gustafsson wrote: > +# Bump the query timeout to avoid false negatives on slow test systems. > +my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default; > > Should we bump the timeout like this for all systems? I interpreted Noah's > comment such that it should be possible for slower systems to override, not > that it should be extended everywhere, but I might have missed something. This is the conventional way to do it. For an operation far slower than a typical timeout_default situation, the patch can and should dilate the default timeout like this. The patch version as of my last comment was extending the timeout but also blocking users from extending it further via PG_TEST_TIMEOUT_DEFAULT. The latest version restores PG_TEST_TIMEOUT_DEFAULT reactivity, resolving my comment.
Re: [PGDOCS] change function linkend to refer to a more relevant target
> On 29 Sep 2023, at 10:54, Daniel Gustafsson wrote: > >> On 29 Sep 2023, at 06:51, Peter Smith wrote: > >> I found a link in the docs that referred to the stats "views" section, >> instead of the more relevant (IMO) stats "functions" section. > > Agreed. This link was added in 2007 (in 7d3b7011b), and back in 7.3/7.4 days > the functions were listed in the same section as the views, so the anchor was > at the time pointing to the right section. In 2012 it was given its own > section (in aebe989477) but the link wasn't updated. > > Thanks for the patch, I'll go ahead with it. Applied to all supported branches, thanks! -- Daniel Gustafsson
Re: Annoying build warnings from latest Apple toolchain
Hi, On 2023-09-28 22:53:09 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2023-09-28 19:20:27 -0700, Andres Freund wrote: > >> Thus the easiest fix looks to be to use this: > >> - export_fmt = '-exported_symbols_list=@0@' > >> + export_fmt = '-Wl,-exported_symbols_list,@0@' > >> I don't have anything older than Ventura to check though. > > I don't have meson installed on my surviving Catalina box, but > I tried the equivalent thing in the Makefile universe: > > diff --git a/src/Makefile.shlib b/src/Makefile.shlib > index f94d59d1c5..f2ed222cc7 100644 > --- a/src/Makefile.shlib > +++ b/src/Makefile.shlib > @@ -136,7 +136,7 @@ ifeq ($(PORTNAME), darwin) >BUILD.exports= $(AWK) '/^[^\#]/ {printf "_%s\n",$$1}' $< > >$@ >exports_file = $(SHLIB_EXPORTS:%.txt=%.list) >ifneq (,$(exports_file)) > -exported_symbols_list = -exported_symbols_list $(exports_file) > +exported_symbols_list = -Wl,-exported_symbols_list,$(exports_file) >endif > endif > > That builds and produces correctly-symbol-trimmed shlibs, so I'd > say it's fine. Thanks for testing! I'll go and push that 16/HEAD then. > (Perhaps we should apply the above to HEAD alongside the meson.build fix, to > get more test coverage?) The macos animals BF seem to run Ventura, so I think it'd not really provide additional coverage that CI and your manual testing already has. So probably not worth it from that angle? > > Attached is the above change and a commit to change CI over to Sonoma. Not > > sure when we should switch, but it seems useful to include for testing > > purposes at the very least. > > No opinion on when to switch CI. Sonoma is surely pretty bleeding edge > yet. Yea, it does feel like that... Greetings, Andres Freund
Re: how to manage Cirrus on personal repository
Hi, On 2023-09-29 11:13:24 +0200, Peter Eisentraut wrote: > I have a private repository on GitHub where I park PostgreSQL development > work. I also have Cirrus activated on that repository, to build those > branches, using the existing Cirrus configuration. > > Now under the new system of limited credits that started in September, this > blew through the credits about half-way through the month. :( > Does anyone have an idea how to manage this better? Is there maybe a way to > globally set "only trigger manually", or could we make one? One way is to configure to run under a custom compute account. If we get credits from google, that might be realistic to provide to many hackers, it's not that expensive to do. Another thing is to work on our tests - we apparently waste *tremendous* amounts of time due to tap tests forking psql over and over. We definitely can provide a way to configure on a repo level which tests run automatically. I think that would be useful not just for turning things manual, but also the opposite, enabling tests that we don't want everybody to run to automatically run as part of cfbot. E.g. running mingw tests during cfbot while keeping it manual for everyone else. Maybe we should have two environment variables, which can be overwritten set on a repository level, one to make manual tasks run by default, and one the other way? > I suppose one way to deal with this is to make a second repository and only > activate Cirrus on that one, and push there on demand. You already can control which platforms are run via commit messages, fwiw. By adding, e.g.: ci-os-only: linux, macos Greetings, Andres Freund
Re: Eager page freeze criteria clarification
On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan wrote: > But isn't the main problem *not* freezing when we could and > should have? (Of course the cost of freezing is very relevant, but > it's still secondary.) Perhaps this is all in how you look at it, but I don't see it this way. It's easy to see how to solve the "not freezing" problem: just freeze everything as often as possible. If that were cheap enough that we could just do it, then we'd just do it and be done here. The problem is that, at least in my opinion, that seems too expensive in some cases. I'm starting to believe that those cases are narrower than I once thought, but I think they do exist. So now, I'm thinking that maybe the main problem is identifying when you've got such a case, so that you know when you need to be less aggressive. > Won't the algorithm that you've sketched always think that > "unfreezing" pages doesn't affect recently frozen pages with such a > workload? Isn't the definition of "recently frozen" that emerges from > this algorithm not in any way related to the order delivery time, or > anything like that? You know, rather like vacuum_freeze_min_age. FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant to endorse changes in this area mostly because I fear replacing one bad idea with another, not because I think that what we have now is particularly good. It's better to be wrong in the same way in every release than to have every release be equally wrong but in a different way. Also, I think the question of what "recently frozen" means is a good one, but I'm not convinced that it ought to relate to the order delivery time. If we insert into a table and 12-14 hours go buy before it's updated, it doesn't seem particularly bad to me if we froze that data meanwhile (regardless of what metric drove that freezing). Same thing if it's 2-4 hours. What seems bad to me is if we're constantly updating the table and vacuum comes sweeping through and freezing everything to no purpose over and over again and then it gets un-frozen a few seconds or minutes later. Now maybe that's the wrong idea. After all, as a percentage, the overhead is the same either way, regardless of whether we're talking about WAL volume or CPU cycles. But somehow it feels worse to make the same mistakes every few minutes or potentially even tens of seconds than it does to make the same mistakes every few hours. The absolute cost is a lot higher. > On a positive note, I like that what you've laid out freezes eagerly > when an FPI won't result -- this much we can all agree on. I guess > that that part is becoming uncontroversial. I don't think that we're going to be able to get away with freezing rows in a small, frequently-updated table just because no FPI will result. I think Melanie's results show that the cost is not negligible. But Andres's pseudocode algorithm, although it is more aggressive in that case, doesn't necessarily seem bad to me, because it still has some ability to hold off freezing in such cases if our statistics show that it isn't working out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Annoying build warnings from latest Apple toolchain
Andres Freund writes: > On 2023-09-28 22:53:09 -0400, Tom Lane wrote: >> (Perhaps we should apply the above to HEAD alongside the meson.build fix, to >> get more test coverage?) > The macos animals BF seem to run Ventura, so I think it'd not really provide > additional coverage that CI and your manual testing already has. So probably > not worth it from that angle? My thought was that if it's in the tree we'd get testing from non-buildfarm sources. FWIW, I'm going to update sifaka/indri to Sonoma before too much longer (they're already using Xcode 15.0 which is the Sonoma toolchain). I recently pulled longfin up to Ventura, and plan to leave it on that for the next year or so. I don't think anyone else is running macOS animals. regards, tom lane
Re: Annoying build warnings from latest Apple toolchain
On Thu Sep 28, 2023 at 5:22 PM CDT, Andres Freund wrote: Hi, On 2023-09-28 16:46:08 -0400, Tom Lane wrote: > There's still one duplicate warning > from the backend link: > > ld: warning: ignoring duplicate libraries: '-lpam' > > I'm a bit baffled why that's showing up; there's no obvious > double reference to pam. I think it's because multiple libraries/binaries depend on it. Meson knows how to deduplicate libraries found via pkg-config (presumably because that has enough information for a topological sort), but apparently not when they're found as "raw" libraries. Until now that was also just pretty harmless, so I understand not doing anything about it. I see a way to avoid the warnings, but perhaps it's better to ask the meson folks to put in a generic way of dealing with this. I wonder if this Meson PR[0] will help. [0]: https://github.com/mesonbuild/meson/pull/12276 -- Tristan Partin Neon (https://neon.tech)
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 7:55 AM Robert Haas wrote: > On Thu, Sep 28, 2023 at 12:03 AM Peter Geoghegan wrote: > > But isn't the main problem *not* freezing when we could and > > should have? (Of course the cost of freezing is very relevant, but > > it's still secondary.) > > Perhaps this is all in how you look at it, but I don't see it this > way. It's easy to see how to solve the "not freezing" problem: just > freeze everything as often as possible. If that were cheap enough that > we could just do it, then we'd just do it and be done here. The > problem is that, at least in my opinion, that seems too expensive in > some cases. I'm starting to believe that those cases are narrower than > I once thought, but I think they do exist. So now, I'm thinking that > maybe the main problem is identifying when you've got such a case, so > that you know when you need to be less aggressive. Assuming your concern is more or less limited to those cases where the same page could be frozen an unbounded number of times (once or almost once per VACUUM), then I think we fully agree. We ought to converge on the right behavior over time, but it's particularly important that we never converge on the wrong behavior instead. > > Won't the algorithm that you've sketched always think that > > "unfreezing" pages doesn't affect recently frozen pages with such a > > workload? Isn't the definition of "recently frozen" that emerges from > > this algorithm not in any way related to the order delivery time, or > > anything like that? You know, rather like vacuum_freeze_min_age. > > FWIW, I agree that vacuum_freeze_min_age sucks. I have been reluctant > to endorse changes in this area mostly because I fear replacing one > bad idea with another, not because I think that what we have now is > particularly good. It's better to be wrong in the same way in every > release than to have every release be equally wrong but in a different > way. > > Also, I think the question of what "recently frozen" means is a good > one, but I'm not convinced that it ought to relate to the order > delivery time. I don't think it should, either. The TPC-C scenario is partly interesting because it isn't actually obvious what the most desirable behavior is, even assuming that you had perfect information, and were not subject to practical considerations about the complexity of your algorithm. There doesn't seem to be perfect clarity on what the goal should actually be in such scenarios -- it's not like the problem is just that we can't agree on the best way to accomplish those goals with this specific workload. If performance/efficiency and performance stability are directly in tension (as they sometimes are), how much do you want to prioritize one or the other? It's not an easy question to answer. It's a value judgement as much as anything else. > If we insert into a table and 12-14 hours go buy before > it's updated, it doesn't seem particularly bad to me if we froze that > data meanwhile (regardless of what metric drove that freezing). Same > thing if it's 2-4 hours. What seems bad to me is if we're constantly > updating the table and vacuum comes sweeping through and freezing > everything to no purpose over and over again and then it gets > un-frozen a few seconds or minutes later. Right -- we agree here. I even think that it makes sense to freeze pages knowing for sure that the pages will be unfrozen on that sort of timeline (at least with a large and ever growing table like this). It may technically be less efficient, but not when you consider how everything else is affected by the disruptive impact of freezing a great deal of stuff all at once. (Of course it's also true that we don't really know what will happen, which is all the more reason to freeze eagerly.) > Now maybe that's the wrong idea. After all, as a percentage, the > overhead is the same either way, regardless of whether we're talking > about WAL volume or CPU cycles. But somehow it feels worse to make the > same mistakes every few minutes or potentially even tens of seconds > than it does to make the same mistakes every few hours. The absolute > cost is a lot higher. I agree. Another problem with the algorithm that Andres sketched is that it supposes that vacuum_freeze_min_age means something relevant -- that's how we decide whether or not freezing should count as "opportunistic". But there really isn't that much difference between (say) an XID age of 25 million and 50 million. At least not with a table like the TPC-C tables, where VACUUMs are naturally big operations that take place relatively infrequently. Assume a default vacuum_freeze_min_age of 50 million. How can it make sense to deem freezing a page "opportunistic" when its oldest XID has only attained an age of 25 million, if the subsequent unfreezing happens when that same XID would have attained an age of 75 million, had we not frozen it? And if you agree that it doesn't make sense, how can we compensate for this effect, as
Re: Remove distprep
Hi, On 2023-08-23 12:46:45 +0200, Peter Eisentraut wrote: > Subject: [PATCH v4] Remove distprep > > A PostgreSQL release tarball contains a number of prebuilt files, in > particular files produced by bison, flex, perl, and well as html and > man documentation. We have done this consistent with established > practice at the time to not require these tools for building from a > tarball. Some of these tools were hard to get, or get the right > version of, from time to time, and shipping the prebuilt output was a > convenience to users. > > Now this has at least two problems: > > One, we have to make the build system(s) work in two modes: Building > from a git checkout and building from a tarball. This is pretty > complicated, but it works so far for autoconf/make. It does not > currently work for meson; you can currently only build with meson from > a git checkout. Making meson builds work from a tarball seems very > difficult or impossible. One particular problem is that since meson > requires a separate build directory, we cannot make the build update > files like gram.h in the source tree. So if you were to build from a > tarball and update gram.y, you will have a gram.h in the source tree > and one in the build tree, but the way things work is that the > compiler will always use the one in the source tree. So you cannot, > for example, make any gram.y changes when building from a tarball. > This seems impossible to fix in a non-horrible way. I think it might be possible to fix in a non-horrible way, just that the effort doing so could be much better spent on other things. It's maybe also worth mentioning that this does *not* work reliably with vpath make builds either... > The make maintainer-clean target, whose job it is to remove the > prebuilt files in addition to what make distclean does, is now just an > alias to make distprep. (In practice, it is probably obsolete given > that git clean is available.) FWIW, I find a "full clean" target useful to be sure that we don't produce "untracked" build products. Do a full build, then run "full clean", then see what remains. > 88 files changed, 169 insertions(+), 409 deletions(-) It might be worthwhile to split this into a bit smaller chunks, e.g. depending on perl, bison, flex, and then separately the various makefile bits that are all over the tree. Greetings, Andres Freund
Re: Annoying build warnings from latest Apple toolchain
I wrote: > Andres Freund writes: >> On 2023-09-28 16:46:08 -0400, Tom Lane wrote: >>> Well, it's only important on platforms where we can't restrict >>> libpq.so from exporting all symbols. I don't know how close we are >>> to deciding that such cases are no longer interesting to worry about. >>> Makefile.shlib seems to know how to do it everywhere except Windows, >>> and I imagine we know how to do it over in the MSVC scripts. >> Hm, then I'd argue that we don't need to care about it anymore. The meson >> build does the necessary magic on windows, as do the current msvc scripts. > If we take that argument seriously, then I'm inclined to adjust my > upthread patch for Makefile.global.in so that it removes the extra > inclusions of libpgport/libpgcommon everywhere, not only macOS. > The rationale would be that it's not worth worrying about ABI > stability details on any straggler platforms. Looking closer, it's only since v16 that we have export list support on all officially-supported platforms. Therefore, I think the prudent thing to do in the back branches is use the patch I posted before, to suppress the duplicate -l switches only on macOS. In HEAD, I propose we simplify life by doing it everywhere, as attached. regards, tom lane diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 18240b5fef..7b66590801 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -589,19 +589,27 @@ endif libpq = -L$(libpq_builddir) -lpq # libpq_pgport is for use by client executables (not libraries) that use libpq. -# We force clients to pull symbols from the non-shared libraries libpgport -# and libpgcommon rather than pulling some libpgport symbols from libpq just -# because libpq uses those functions too. This makes applications less -# dependent on changes in libpq's usage of pgport (on platforms where we -# don't have symbol export control for libpq). To do this we link to -# pgport before libpq. This does cause duplicate -lpgport's to appear -# on client link lines, since that also appears in $(LIBS). +# We used to use this to force libpgport and libpgcommon to be linked before +# libpq, ensuring that clients would pull symbols from those libraries rather +# than possibly getting them from libpq (and thereby having an unwanted +# dependency on which symbols libpq uses). However, now that we can prevent +# libpq from exporting those symbols on all platforms of interest, we don't +# worry about that anymore. The previous technique resulted in duplicate +# libraries in link commands, since those libraries also appear in $(LIBS). +# Some platforms warn about that, so avoiding those warnings is now more +# important. Hence, $(libpq_pgport) is now equivalent to $(libpq), but we +# still recommend using it for client executables in case some other reason +# appears to handle them differently. +libpq_pgport = $(libpq) + # libpq_pgport_shlib is the same idea, but for use in client shared libraries. +# We need those clients to use the shlib variants. (Ideally, users of this +# macro would strip libpgport and libpgcommon from $(LIBS), but no harm is +# done if they don't, since they will have satisfied all their references +# from these libraries.) ifdef PGXS -libpq_pgport = -L$(libdir) -lpgcommon -lpgport $(libpq) libpq_pgport_shlib = -L$(libdir) -lpgcommon_shlib -lpgport_shlib $(libpq) else -libpq_pgport = -L$(top_builddir)/src/common -lpgcommon -L$(top_builddir)/src/port -lpgport $(libpq) libpq_pgport_shlib = -L$(top_builddir)/src/common -lpgcommon_shlib -L$(top_builddir)/src/port -lpgport_shlib $(libpq) endif
Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL
On Fri, Sep 29, 2023 at 09:16:35AM +0900, Michael Paquier wrote: > You mean when upgrading from an instance of 9.6 or older as c30f177 is > not there, right? No - while upgrading from v15 to v16. I'm not clear on how we upgraded *to* v15 without hitting the issue, nor how the "not null" got dropped... > Anyway, it seems like the patch from [1] has no need to run this check > when the old cluster's version is 10 or newer. And perhaps it should > mention that this check could be removed from pg_upgrade once v10 > support is out of scope, in the shape of a comment. You're still thinking of PRIMARY KEY as the only way to hit this, right? But Ali Akbar already pointed out how to reproduce the problem with DROP NOT NULL - which still applies to both v16 and v17.
Re: POC, WIP: OR-clause support for indexes
I'm sorry I didn't write for a long time, but I really had a very difficult month, now I'm fully back to work. *I was able to implement the patches to the end and moved the transformation of "OR" expressions to ANY.* I haven't seen a big difference between them yet, one has a transformation before calculating selectivity (v7.1-Replace-OR-clause-to-ANY.patch), the other after (v7.2-Replace-OR-clause-to-ANY.patch). Regression tests are passing, I don't see any problems with selectivity, nothing has fallen into the coredump, but I found some incorrect transformations. What is the reason for these inaccuracies, I have not found, but, to be honest, they look unusual). Gave the error below. In the patch, I don't like that I had to drag three libraries from parsing until I found a way around it.The advantage of this approach compared to the other ([1]) is that at this stage all possible or transformations are performed, compared to the patch, where the transformation was done at the parsing stage. That is, here, for example, there are such optimizations in the transformation: I took the common element out of the bracket and the rest is converted to ANY, while, as noted by Peter Geoghegan, we did not have several bitmapscans, but only one scan through the array. postgres=# explain analyze SELECT p1.oid, p1.proname FROM pg_proc as p1 WHERE prolang = 13 AND prolang=1 OR prolang = 13 AND prolang = 2 OR prolang = 13 AND prolang = 3; QUERY PLAN --- Seq Scan on pg_proc p1 (cost=0.00..151.66 rows=1 width=68) (actual time=1.167..1.168 rows=0 loops=1) Filter: ((prolang = '13'::oid) AND (prolang = ANY (ARRAY['1'::oid, '2'::oid, '3'::oid]))) Rows Removed by Filter: 3302 Planning Time: 0.146 ms Execution Time: 1.191 ms (5 rows) *Falls into coredump at me:* explain analyze SELECT p1.oid, p1.proname FROM pg_proc as p1 WHERE prolang = 13 OR prolang = 2 AND prolang = 2 OR prolang = 13; I continue to try to move transformations of "OR" expressions at the optimization stage, unfortunately I have not been able to figure out coredump yet, but I saw an important thing that it is already necessary to process RestrictInfo expressions here. I corrected it. To be honest, despite some significant advantages in the fact that we are already processing pre-converted "or" expressions (logical transformations have been performed and duplicates have been removed), I have big doubts about this approach. We already have quite a lot of objects at this stage that can refer to the RestrictInfo variable in ReplOptInfo, and updating these links can be costly for us. By the way, right now I suspect that the current coredump appeared precisely because there is a link somewhere that refers to an un-updated RestrictInfo, but so far I can't find this place. coredump occurs at the request execution stage, looks like this: Core was generated by `postgres: alena regression [local] SELECT '. --Type for more, q to quit, c to continue without paging-- Program terminated with signal SIGSEGV, Segmentation fault. #0 0x5565f3ec4947 in ExecInitExprRec (node=0x5565f530b290, state=0x5565f53383d8, resv=0x5565f53383e0, resnull=0x5565f53383dd) at execExpr.c:1331 1331 Expr *arg = (Expr *) lfirst(lc); (gdb) bt #0 0x5565f3ec4947 in ExecInitExprRec (node=0x5565f530b290, state=0x5565f53383d8, resv=0x5565f53383e0, resnull=0x5565f53383dd) at execExpr.c:1331 #1 0x5565f3ec2708 in ExecInitQual (qual=0x5565f531d950, parent=0x5565f5337948) at execExpr.c:258 #2 0x5565f3f2f080 in ExecInitSeqScan (node=0x5565f5309700, estate=0x5565f5337700, eflags=32) at nodeSeqscan.c:172 #3 0x5565f3ee70c9 in ExecInitNode (node=0x5565f5309700, estate=0x5565f5337700, eflags=32) at execProcnode.c:210 #4 0x5565f3edbe3a in InitPlan (queryDesc=0x5565f53372f0, eflags=32) at execMain.c:968 #5 0x5565f3edabe3 in standard_ExecutorStart (queryDesc=0x5565f53372f0, eflags=32) at execMain.c:266 #6 0x5565f3eda927 in ExecutorStart (queryDesc=0x5565f53372f0, eflags=0) at execMain.c:145 #7 0x5565f419921e in PortalStart (portal=0x5565f52ace90, params=0x0, eflags=0, snapshot=0x0) at pquery.c:517 #8 0x5565f4192635 in exec_simple_query ( query_string=0x5565f5233af0 "SELECT p1.oid, p1.proname\nFROM pg_proc as p1\nWHERE prolang = 13 AND (probin IS NULL OR probin = '' OR probin = '-');") at postgres.c:1233 #9 0x5565f41976ef in PostgresMain (dbname=0x5565f526ad10 "regression", username=0x5565f526acf8 "alena") at postgres.c:4652 #10 0x5565f40b8417 in BackendRun (port=0x5565f525f830) at postmaster.c:4439 #11 0x5565f40b7ca3 in BackendStartup (port=0x5565f525f830) at postmaster.c:4167 #12 0x5565f40b40f1 in ServerLoop () at postmaster.c:1781 #13 0x5565f40b
[DOCS] HOT - correct claim about indexes not referencing old line pointers
Hello, While working on my talk for PGConf.NYC next week I came across this bullet in the docs on heap only tuples: > Old versions of updated rows can be completely removed during normal > operation, including SELECTs, instead of requiring periodic vacuum > operations. (This is possible because indexes do not reference their page > item identifiers.) But when a HOT update happens the entry in an (logically unchanged) index still points to the original heap tid, and that line item is updated with a pointer to the new line pointer in the same page. Assuming I'm understanding this correctly, attached is a patch correcting the description. Thanks, James Coleman v1-0001-Correct-HOT-docs-to-account-for-LP_REDIRECT.patch Description: Binary data
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 10:45 AM James Coleman wrote: > Hello, > > While working on my talk for PGConf.NYC next week I came across this > bullet in the docs on heap only tuples: > > > Old versions of updated rows can be completely removed during normal > > operation, including SELECTs, instead of requiring periodic vacuum > > operations. (This is possible because indexes do not reference their page > > item identifiers.) > > But when a HOT update happens the entry in an (logically unchanged) > index still points to the original heap tid, and that line item is > updated with a pointer to the new line pointer in the same page. > > Assuming I'm understanding this correctly, attached is a patch > correcting the description. > > I think we want to somehow distinguish between the old tuple that is the root of the chain and old tuples that are not. This comment refers to pruning the chain and removing intermediate links in the chain that are no longer relevant because the root has been updated to point to the live tuple. In README.HOT, tuple 2 in the example after 1 points to 3. https://github.com/postgres/postgres/blob/ec99d6e9c87a8ff0f4805cc0c6c12cbb89c48e06/src/backend/access/heap/README.HOT#L109 David J.
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 10:39 AM James Coleman wrote: > > Old versions of updated rows can be completely removed during normal > > operation, including SELECTs, instead of requiring periodic vacuum > > operations. (This is possible because indexes do not reference their page > > item identifiers.) > > But when a HOT update happens the entry in an (logically unchanged) > index still points to the original heap tid, and that line item is > updated with a pointer to the new line pointer in the same page. It's true that the original root heap tuple (which is never a heap-only tuple) must have its line pointer changed from LP_NORMAL to LP_REDIRECT the first time pruning takes place that affects its HOT chain. But I don't think that referring to the root item as something along the lines of "an obsolescent/old tuple's line pointer" is particularly helpful. Changing from LP_NORMAL to LP_REDIRECT during the initial prune isn't terribly different from changing an existing LP_REDIRECT's redirect-TID link during every subsequent prune. In both cases you're just updating where the first heap-only tuple begins. The really important point is that the TID (which maps to the root item of the HOT chain) has a decent chance of being stable over time, no matter how many versions the HOT chain churns through. And that that can break (or at least weaken) our dependence on VACUUM with some workloads. -- Peter Geoghegan
CREATE EXTENSION forces an library initialization - is it bug?
Hi I had to fix plpgsql_check issue https://github.com/okbob/plpgsql_check/issues/155 The problem is in execution of _PG_init() in CREATE EXTENSION time. It is a problem for any extension that uses plpgsql debug API, because it is quietly activated. Is it necessary? Regards Pavel
Re: CREATE EXTENSION forces an library initialization - is it bug?
Pavel Stehule writes: > I had to fix plpgsql_check issue > https://github.com/okbob/plpgsql_check/issues/155 > The problem is in execution of _PG_init() in CREATE EXTENSION time. > It is a problem for any extension that uses plpgsql debug API, because it > is quietly activated. > Is it necessary? Yes, I think so. If the extension has any C functions, then when its script executes those CREATE FUNCTION commands then the underlying library will be loaded (so we can check that the library is loadable and the functions really exist). That's always happened and I do not think it is negotiable. regards, tom lane
Re: CREATE EXTENSION forces an library initialization - is it bug?
pá 29. 9. 2023 v 20:14 odesílatel Tom Lane napsal: > Pavel Stehule writes: > > I had to fix plpgsql_check issue > > https://github.com/okbob/plpgsql_check/issues/155 > > > The problem is in execution of _PG_init() in CREATE EXTENSION time. > > > It is a problem for any extension that uses plpgsql debug API, because it > > is quietly activated. > > > Is it necessary? > > Yes, I think so. If the extension has any C functions, then when its > script executes those CREATE FUNCTION commands then the underlying > library will be loaded (so we can check that the library is loadable > and the functions really exist). That's always happened and I do not > think it is negotiable. > ok thank you for info > > regards, tom lane >
Re: Fix incorrect comment reference
On Mon, Jan 23, 2023 at 06:42:45PM -0500, James Coleman wrote: > On Mon, Jan 23, 2023 at 4:07 PM James Coleman wrote: > > > > On Mon, Jan 23, 2023 at 3:41 PM Robert Haas wrote: > > > > > > On Mon, Jan 23, 2023 at 3:19 PM James Coleman wrote: > > > > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas > > > > wrote: > > > > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman > > > > > wrote: > > > > > > See the attached for a simple comment fix -- the referenced > > > > > > generate_useful_gather_paths call isn't in grouping_planner it's in > > > > > > apply_scanjoin_target_to_paths. > > > > > > > > > > The intended reading of the comment is not clear. Is it telling you to > > > > > look at grouping_planner because that's where we > > > > > generate_useful_gather_paths, or is it telling you to look there to > > > > > see how we get the final target list together? If it's the former, > > > > > then your fix is correct. If the latter, it's fine as it is. > > > > > > > > > > The real answer is probably that some years ago both things happened > > > > > in that function. We've moved on from there, but I'm still not sure > > > > > what the most useful phrasing of the comment is. > > > > > > > > Yeah, almost certainly, and the comments just didn't keep up. > > > > > > > > Would you prefer something that notes both that the broader concern is > > > > happening via the grouping_planner() stage but still points to the > > > > proper callsite (so that people don't go looking for that confused)? > > > > > > I don't really have a strong view on what the best thing to do is. I > > > was just pointing out that the comment might not be quite so obviously > > > wrong as you were supposing. > > > > "Wrong" is certainly too strong; my apologies. > > > > I'm really just hoping to improve it for future readers to save them > > some confusion I had initially reading it. > > Updated patch attached. Patch applied. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 11:57 AM Peter Geoghegan wrote: > Assuming your concern is more or less limited to those cases where the > same page could be frozen an unbounded number of times (once or almost > once per VACUUM), then I think we fully agree. We ought to converge on > the right behavior over time, but it's particularly important that we > never converge on the wrong behavior instead. I think that more or less matches my current thinking on the subject. A caveat might be: If it were once per two vacuums rather than once per vacuum, that might still be an issue. But I agree with the idea that the case that matters is *repeated* wasteful freezing. I don't think freezing is expensive enough that individual instances of mistaken freezing are worth getting too stressed about, but as you say, the overall pattern does matter. > The TPC-C scenario is partly interesting because it isn't actually > obvious what the most desirable behavior is, even assuming that you > had perfect information, and were not subject to practical > considerations about the complexity of your algorithm. There doesn't > seem to be perfect clarity on what the goal should actually be in such > scenarios -- it's not like the problem is just that we can't agree on > the best way to accomplish those goals with this specific workload. > > If performance/efficiency and performance stability are directly in > tension (as they sometimes are), how much do you want to prioritize > one or the other? It's not an easy question to answer. It's a value > judgement as much as anything else. I think that's true. For me, the issue is what a user is practically likely to notice and care about. I submit that on a not-particularly-busy system, it would probably be fine to freeze aggressively in almost every situation, because you're only incurring costs you can afford to pay. On a busy system, it's more important to be right, or at least not too badly wrong. But even on a busy system, I think that when the time between data being written and being frozen is more than a few tens of minutes, it's very doubtful that anyone is going to notice the contribution that freezing makes to the overall workload. They're much more likely to notice an annoying autovacuum than they are to notIce a bit of excess freezing that ends up getting reversed. But when you start cranking the time between writing data and freezing it down into the single-digit numbers of minutes, and even more if you push down to tens of seconds or less, now I think people are going to care more about useless freezing work than about long-term autovacuum risks. Because now their database is really busy so they care a lot about performance, and seemingly most of the data involved is ephemeral anyway. > Even if you're willing to assume that vacuum_freeze_min_age isn't just > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age > is applied by VACUUM, at the point that it scans pages. If VACUUM were > infinitely fast, and new VACUUMs were launched constantly, then > vacuum_freeze_min_age (and this bucketing scheme) might make more > sense. But, you know, they're not. So whether or not VACUUM (with > Andres' algorithm) deems a page that it has frozen to have been > opportunistically frozen or not is greatly influenced by factors that > couldn't possibly be relevant. I'm not totally sure that I'm understanding what you're concerned about here, but I *think* that the issue you're worried about here is: if we have various rules that can cause freezing, let's say X Y and Z, and we adjust the aggressiveness of rule X based on the performance of rule Y, that would be stupid and might suck. Assuming that the previous sentence is a correct framing, let's take X to be "freezing based on the page LSN age" and Y to be "freezing based on vacuum_freeze_min_age". I think the problem scenario here would be if it turns out that, under some set of circumstances, Y freezes more aggressively than X. For example, suppose the user runs VACUUM FREEZE, effectively setting vacuum_freeze_min_age=0 for that operation. If the table is being modified at all, it's likely to suffer a bunch of unfreezing right afterward, which could cause us to decide to make future vacuums freeze less aggressively. That's not necessarily what we want, because evidently the user, at least at that moment in time, thought that previous freezing hadn't been aggressive enough. They might be surprised to find that flash-freezing the table inhibited future automatic freezing. Or suppose that they just have a very high XID consumption rate compared to the rate of modifications to this particular table, such that criteria related to vacuum_freeze_min_age tend to be satisfied a lot, and thus vacuums tend to freeze a lot no matter what the page LSN age is. This scenario actually doesn't seem like a problem, though. In this case the freezing criterion based on page LSN age is already not getting used, so it doesn't really matter wheth
Re: Fix incorrect comment reference
On Fri, Sep 29, 2023 at 2:26 PM Bruce Momjian wrote: > > On Mon, Jan 23, 2023 at 06:42:45PM -0500, James Coleman wrote: > > On Mon, Jan 23, 2023 at 4:07 PM James Coleman wrote: > > > > > > On Mon, Jan 23, 2023 at 3:41 PM Robert Haas wrote: > > > > > > > > On Mon, Jan 23, 2023 at 3:19 PM James Coleman wrote: > > > > > On Mon, Jan 23, 2023 at 1:26 PM Robert Haas > > > > > wrote: > > > > > > On Mon, Jan 23, 2023 at 8:31 AM James Coleman > > > > > > wrote: > > > > > > > See the attached for a simple comment fix -- the referenced > > > > > > > generate_useful_gather_paths call isn't in grouping_planner it's > > > > > > > in > > > > > > > apply_scanjoin_target_to_paths. > > > > > > > > > > > > The intended reading of the comment is not clear. Is it telling you > > > > > > to > > > > > > look at grouping_planner because that's where we > > > > > > generate_useful_gather_paths, or is it telling you to look there to > > > > > > see how we get the final target list together? If it's the former, > > > > > > then your fix is correct. If the latter, it's fine as it is. > > > > > > > > > > > > The real answer is probably that some years ago both things happened > > > > > > in that function. We've moved on from there, but I'm still not sure > > > > > > what the most useful phrasing of the comment is. > > > > > > > > > > Yeah, almost certainly, and the comments just didn't keep up. > > > > > > > > > > Would you prefer something that notes both that the broader concern is > > > > > happening via the grouping_planner() stage but still points to the > > > > > proper callsite (so that people don't go looking for that confused)? > > > > > > > > I don't really have a strong view on what the best thing to do is. I > > > > was just pointing out that the comment might not be quite so obviously > > > > wrong as you were supposing. > > > > > > "Wrong" is certainly too strong; my apologies. > > > > > > I'm really just hoping to improve it for future readers to save them > > > some confusion I had initially reading it. > > > > Updated patch attached. > > Patch applied. Thanks! Regards, James Coleman
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 11:04 AM Peter Geoghegan wrote: > > But when a HOT update happens the entry in an (logically unchanged) > > index still points to the original heap tid, and that line item is > > updated with a pointer to the new line pointer in the same page. > > It's true that the original root heap tuple (which is never a > heap-only tuple) must have its line pointer changed from LP_NORMAL to > LP_REDIRECT the first time pruning takes place that affects its HOT > chain. But I don't think that referring to the root item as something > along the lines of "an obsolescent/old tuple's line pointer" is > particularly helpful. To be clear, the existing wording seems correct to me. Even heap-only tuples require line pointers. These line pointers are strictly guaranteed to never be *directly* referenced from indexes (if they are then we'll actually detect it and report data corruption on recent versions). The name "heap-only tuple" quite literally means that the tuple and its line pointer are only represented in the heap, and never in indexes. There is a related rule about what is allowed to happen to any heap-only tuple's line pointer: it can only change from LP_NORMAL to LP_UNUSED (never LP_DEAD or LP_REDIRECT). You can think of a heap-only tuple as "skipping the LP_DEAD step" that regular heap tuples must go through. We don't need LP_DEAD tombstones precisely because there cannot possibly be any references to heap-only tuples in indexes -- so we can't break index scans by going straight from LP_NORMAL to LP_UNUSED for heap-only tuple line pointers. -- Peter Geoghegan
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 2:39 PM Peter Geoghegan wrote: > > On Fri, Sep 29, 2023 at 11:04 AM Peter Geoghegan wrote: > > > But when a HOT update happens the entry in an (logically unchanged) > > > index still points to the original heap tid, and that line item is > > > updated with a pointer to the new line pointer in the same page. > > > > It's true that the original root heap tuple (which is never a > > heap-only tuple) must have its line pointer changed from LP_NORMAL to > > LP_REDIRECT the first time pruning takes place that affects its HOT > > chain. But I don't think that referring to the root item as something > > along the lines of "an obsolescent/old tuple's line pointer" is > > particularly helpful. > > To be clear, the existing wording seems correct to me. Even heap-only > tuples require line pointers. These line pointers are strictly > guaranteed to never be *directly* referenced from indexes (if they are > then we'll actually detect it and report data corruption on recent > versions). The name "heap-only tuple" quite literally means that the > tuple and its line pointer are only represented in the heap, and never > in indexes. Hmm, to my reading the issue is that "old versions" doesn't say anything about "old HOT versions; it seems to be describing what happens generally when a heap-only tuple is written -- which would include the first time a heap-only tuple is written. And when it's the first heap-only tuple the "old version" would be the original version, which would not be a heap-only tuple. I can work up a tweaked version of the patch that shows there are two paths here (original tuple is being updated versus an intermediate heap-only tuple is being updated); would you be willing to consider that? Thanks, James Coleman
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 11:45 AM James Coleman wrote:my reading the issue is that "old versions" doesn't say > anything about "old HOT versions; it seems to be describing what > happens generally when a heap-only tuple is written -- which would > include the first time a heap-only tuple is written. I think that it's talking about what happens during opportunistic pruning, in particular what happens to HOT chains. (Though pruning does almost the same amount of useful work with non-heap-only tuples, so it's a bit unfortunate that the name "HOT pruning" seems to have stuck.) > And when it's the > first heap-only tuple the "old version" would be the original version, > which would not be a heap-only tuple. The docs say "Old versions of updated rows can be completely removed during normal operation". Opportunistic pruning removes dead heap-only tuples completely, and makes their line pointers LP_UNUSED right away. But it can also entail removing storage for the original root item heap tuple, and making its line pointer LP_REDIRECT right away (not LP_DEAD or LP_UNUSED) at most once in the life of each HOT chain. So yeah, we're not quite limited to removing storage for heap-only tuples when pruning a HOT chain. Does that distinction really matter, though? There isn't even any special case handling for it in pruneheap.c (we only have assertions that make sure that we're performing "valid transitions" for each tuple/line pointer). That is, we don't really care about the difference between calling ItemIdSetRedirect() for an LP_NORMAL item versus an existing LP_REDIRECT item at the code level (we just do it and let PageRepairFragmentation() clean things up). -- Peter Geoghegan
Re: Allow deleting enumerated values from an existing enumerated data type
On 2023-09-28 Th 14:46, Tom Lane wrote: Andrew Dunstan writes: I wonder if we could have a boolean flag in pg_enum, indicating that setting an enum to that value was forbidden. Yeah, but that still offers no coherent solution to the problem of what happens if there's a table that already contains such a value. It doesn't seem terribly useful to forbid new entries if you can't get rid of old ones. Admittedly, a DISABLE flag would at least offer a chance at a race-condition-free scan to verify that no such values remain in tables. But as somebody already mentioned upthread, that wouldn't guarantee that the value doesn't appear in non-leaf index pages. So basically you could never get rid of the pg_enum row, short of a full dump and restore. or a reindex, I think, although getting the timing right would be messy. I agree the non-leaf index pages are rather pesky in dealing with this. I guess the alternative would be to create a new enum with the to-be-deleted value missing, and then alter the column type to the new enum type. For massive tables that would be painful. We went through all these points years ago when the enum feature was first developed, as I recall. Nobody thought that the ability to remove an enum value was worth the amount of complexity it'd entail. That's quite true, and I accept my part in this history. But I'm not sure we were correct back then. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: SHARED locks barging behaviour
On Tue, Jan 17, 2023 at 12:18:28PM -0500, Arul Ajmani wrote: > I'm trying to better understand the following barging behaviour with SHARED > locks. ... > Given there is a transaction waiting to acquire a FOR UPDATE lock, I was > surprised to see the second FOR SHARE transaction return immediately instead > of > waiting. I have two questions: > > 1) Could this barging behaviour potentially starve out the transaction waiting > to acquire the FOR UPDATE lock, if there is a continuous queue of transactions > that acquire a FOR SHARE lock briefly? Yes, see below. > 2) Assuming this is by design, I couldn't find (in code) where this explicit > policy choice is made. I was looking around LockAcquireExtended, but it seems > like the decision is made above this layer. Could someone more familiar with > this code point me at the right place? I know this from January, but I do have an answer. First, looking at parser/gram.y, I see: | FOR SHARE { $$ = LCS_FORSHARE; } Looking for LCS_FORSHARE, I see in optimizer/plan/planner.c: case LCS_FORSHARE: return ROW_MARK_SHARE; Looking for ROW_MARK_SHARE, I see in executor/nodeLockRows.c: case ROW_MARK_SHARE: lockmode = LockTupleShare; Looking for LockTupleShare, I see in access/heap/heapam.c: else if (mode == LockTupleShare) { /* * If we're requesting Share, we can similarly avoid sleeping if * there's no update and no exclusive lock present. */ if (HEAP_XMAX_IS_LOCKED_ONLY(infomask) && !HEAP_XMAX_IS_EXCL_LOCKED(infomask)) { LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); /* * Make sure it's still an appropriate lock, else start over. * See above about allowing xmax to change. */ if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask) || HEAP_XMAX_IS_EXCL_LOCKED(tuple->t_data->t_infomask)) goto l3; require_sleep = false; } } and this is basically saying that if the row is locked (HEAP_XMAX_IS_LOCKED_ONLY), but not exclusively locked (!HEAP_XMAX_IS_EXCL_LOCKED), then there is no need to sleep waiting for the lock. I hope that helps. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas wrote: > > Even if you're willing to assume that vacuum_freeze_min_age isn't just > > an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age > > is applied by VACUUM, at the point that it scans pages. If VACUUM were > > infinitely fast, and new VACUUMs were launched constantly, then > > vacuum_freeze_min_age (and this bucketing scheme) might make more > > sense. But, you know, they're not. So whether or not VACUUM (with > > Andres' algorithm) deems a page that it has frozen to have been > > opportunistically frozen or not is greatly influenced by factors that > > couldn't possibly be relevant. > > I'm not totally sure that I'm understanding what you're concerned > about here, but I *think* that the issue you're worried about here is: > if we have various rules that can cause freezing, let's say X Y and Z, > and we adjust the aggressiveness of rule X based on the performance of > rule Y, that would be stupid and might suck. Your summary is pretty close. There are a couple of specific nuances to it, though: 1. Anything that uses XID age or even LSN age necessarily depends on when VACUUM shows up, which itself depends on many other random things. With small to medium sized tables that don't really grow, it's perhaps reasonable to expect this to not matter. But with tables like the TPC-C order/order lines table, or even pgbench_history, the next VACUUM operation will reliably be significantly longer and more expensive than the last one, forever (ignoring the influence of aggressive mode, and assuming typical autovacuum settings). So VACUUMs get bigger and less frequent as the table grows. As the table continues to grow, at some point we reach a stage where many XIDs encountered by VACUUM will be significantly older than vacuum_freeze_min_age, while others will be significantly younger. And so whether we apply the vacuum_freeze_min_age rule (or some other age based rule) is increasingly a matter of random happenstance (i.e. is more and more due to when VACUUM happens to show up), and has less and less to do with what the workload signals we should do. This is a moving target, but (if I'm not mistaken) under the scheme described by Andres we're not even trying to compensate for that. Separately, I have a practical concern: 2. It'll be very hard to independently track the effectiveness of rules X, Y, and Z as a practical matter, because the application of each rule quite naturally influences the application of every other rule over time. They simply aren't independent things in any practical sense. Even if this wasn't an issue, I can't think of a reasonable cost model. Is it good or bad if "opportunistic freezing" results in unfreezing 50% of the time? AFAICT that's an *extremely* complicated question. You cannot just interpolate from the 0% case (definitely good) and the 100% case (definitely bad) and expect to get a sensible answer. You can't split the difference -- even if we allow ourselves to ignore tricky value judgement type questions. > Assuming that the previous sentence is a correct framing, let's take X > to be "freezing based on the page LSN age" and Y to be "freezing based > on vacuum_freeze_min_age". I think the problem scenario here would be > if it turns out that, under some set of circumstances, Y freezes more > aggressively than X. For example, suppose the user runs VACUUM FREEZE, > effectively setting vacuum_freeze_min_age=0 for that operation. If the > table is being modified at all, it's likely to suffer a bunch of > unfreezing right afterward, which could cause us to decide to make > future vacuums freeze less aggressively. That's not necessarily what > we want, because evidently the user, at least at that moment in time, > thought that previous freezing hadn't been aggressive enough. They > might be surprised to find that flash-freezing the table inhibited > future automatic freezing. I didn't think of that one myself, but it's a great example. > Or suppose that they just have a very high XID consumption rate > compared to the rate of modifications to this particular table, such > that criteria related to vacuum_freeze_min_age tend to be satisfied a > lot, and thus vacuums tend to freeze a lot no matter what the page LSN > age is. This scenario actually doesn't seem like a problem, though. In > this case the freezing criterion based on page LSN age is already not > getting used, so it doesn't really matter whether we tune it up or > down or whatever. It would have to be a smaller table, which I'm relatively unconcerned about. > > Okay then. I guess it's more accurate to say that we'll have a strong > > bias in the direction of freezing when an FPI won't result, though not > > an infinitely strong bias. We'll at least have something that can be > > thought of as an improved version of the FPI thing for 17, I think -- > > which is definitely significant progress. > > I do kind of wonder whether we're going to care about
Re: how to manage Cirrus on personal repository
On Sat, Sep 30, 2023 at 3:35 AM Nazir Bilal Yavuz wrote: > On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut wrote: > > I have a private repository on GitHub where I park PostgreSQL > > development work. I also have Cirrus activated on that repository, to > > build those branches, using the existing Cirrus configuration. > > > > Now under the new system of limited credits that started in September, > > this blew through the credits about half-way through the month. [Replying to wrong person because I never saw Peter's original email, something must be jammed in the intertubes...] It's annoying, but on the bright side... if you're making it halfway through the month, I think that means there's a chance you'd have enough credit if we can depessimise the known problems with TAP query execution[1], and there are some more obviously stupid things too (sleep/poll for replication progress where PostgreSQL should offer an event-based wait-for-replay, running all the tests when only docs changed, running the regression tests fewer times, ...). [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJoEO33K%3DZynsH%3DxkiEyfBMZjOoqBK%2BgouBdTGW2-woig%40mail.gmail.com
Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)
Tomas Vondra writes: > On 9/27/23 15:38, Tom Lane wrote: >> Tomas Vondra writes: >>> Hmm, yeah. Which FreeBSD image did you install? armv7 or aarch64? >> https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/14.0/FreeBSD-14.0-BETA3-arm64-aarch64-RPI.img.xz > Thanks, that's the image I've used. This is really strange ... I've now laid my hands on a Pi 4B, and with that exact same SD card plugged in, I get the same results I did with the 3B+: pltcl regression tests pass, and so does the manual check with tclsh8.[67]. So it seems like the "different CPU" theory doesn't survive contact with reality either. I'm completely baffled, but I do notice that "clock scan" without a -format option is deprecated according to the Tcl man page. Maybe we should stop relying on deprecated behavior and put in a -format option? regards, tom lane
Re: document the need to analyze partitioned tables
On Wed, Sep 6, 2023 at 05:53:56AM +0200, Laurenz Albe wrote: > > We may have different mental models here. This relates to the part > > that I wasn't keen on in your patch, i.e: > > > > + The partitions of a partitioned table are normal tables and get > > processed > > + by autovacuum > > > > While I agree that the majority of partitions are likely to be > > relkind='r', which you might ordinarily consider a "normal table", you > > just might change your mind when you try to INSERT or UPDATE records > > that would violate the partition constraint. Some partitions might > > also be themselves partitioned tables and others might be foreign > > tables. That does not really matter much when it comes to what > > autovacuum does or does not do, but I'm not really keen to imply in > > our documents that partitions are "normal tables". > > Agreed, there are differences between partitions and normal tables. > And this is not the place in the documentation where we would like to > get into detail about the differences. > > Attached is the next version of my patch. I adjusted your patch to be shorter and clearer, patch attached. I am planning to apply this back to PG 11. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 9cf9d030a8..70b5576037 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -861,10 +861,16 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu -Partitioned tables are not processed by autovacuum. Statistics -should be collected by running a manual ANALYZE when it is -first populated, and again whenever the distribution of data in its -partitions changes significantly. +Partitioned tables do not directly store tuples and consequently do not +require autovacuum to perform automated VACUUM. +(Autovacuum does perform VACUUM on table +partitions just like other tables.) Unfortunately, this also means +that autovacuum doesn't run ANALYZE on partitioned +tables, and this can cause suboptimal plans for queries that reference +partitioned table statistics. You can work around this problem by +manually running ANALYZE on partitioned tables +when they are first populated, and again whenever the distribution +of data in their partitions changes significantly.
Re: document the need to analyze partitioned tables
On Fri, 2023-09-29 at 18:08 -0400, Bruce Momjian wrote: > On Wed, Sep 6, 2023 at 05:53:56AM +0200, Laurenz Albe wrote: > > > We may have different mental models here. This relates to the part > > > that I wasn't keen on in your patch, i.e: > > > > > > + The partitions of a partitioned table are normal tables and get > > > processed > > > + by autovacuum > > > > > > While I agree that the majority of partitions are likely to be > > > relkind='r', which you might ordinarily consider a "normal table", you > > > just might change your mind when you try to INSERT or UPDATE records > > > that would violate the partition constraint. Some partitions might > > > also be themselves partitioned tables and others might be foreign > > > tables. That does not really matter much when it comes to what > > > autovacuum does or does not do, but I'm not really keen to imply in > > > our documents that partitions are "normal tables". > > > > Agreed, there are differences between partitions and normal tables. > > And this is not the place in the documentation where we would like to > > get into detail about the differences. > > > > Attached is the next version of my patch. > > I adjusted your patch to be shorter and clearer, patch attached. I am > planning to apply this back to PG 11. Thanks for looking at this. I am mostly fine with your version, but it does not directly state that autovacuum does not process partitioned tables. I think this should be clarified in the beginning. Yours, Laurenz Albe
Re: SHARED locks barging behaviour
On Fri, 2023-09-29 at 17:45 -0400, Bruce Momjian wrote: > On Tue, Jan 17, 2023 at 12:18:28PM -0500, Arul Ajmani wrote: > > I'm trying to better understand the following barging behaviour with SHARED > > locks. > ... > > Given there is a transaction waiting to acquire a FOR UPDATE lock, I was > > surprised to see the second FOR SHARE transaction return immediately > > instead of > > waiting. I have two questions: > > > > 1) Could this barging behaviour potentially starve out the transaction > > waiting > > to acquire the FOR UPDATE lock, if there is a continuous queue of > > transactions > > that acquire a FOR SHARE lock briefly? > > Yes, see below. > > > 2) Assuming this is by design, I couldn't find (in code) where this explicit > > policy choice is made. I was looking around LockAcquireExtended, but it > > seems > > like the decision is made above this layer. Could someone more familiar with > > this code point me at the right place? > > I know this from January, but I do have an answer. [...] You answer the question where this is implemented. But the more important question is whether this is intentional. This code was added by 0ac5ad5134f (introducing FOR KEY SHARE and FOR NO KEY UPDATE). My feeling is that it is not intentional that a continuous stream of share row locks can starve out an exclusive row lock, since PostgreSQL behaves differently with other locks. On the other hand, if nobody has complained about it in these ten years, perhaps it is just fine the way it is, if by design or not. Yours, Laurenz Albe
Re: how to manage Cirrus on personal repository
On 9/29/23 18:02, Thomas Munro wrote: On Sat, Sep 30, 2023 at 3:35 AM Nazir Bilal Yavuz wrote: On Fri, 29 Sept 2023 at 13:24, Peter Eisentraut wrote: I have a private repository on GitHub where I park PostgreSQL development work. I also have Cirrus activated on that repository, to build those branches, using the existing Cirrus configuration. Now under the new system of limited credits that started in September, this blew through the credits about half-way through the month. [Replying to wrong person because I never saw Peter's original email, something must be jammed in the intertubes...] It's annoying, but on the bright side... if you're making it halfway through the month, I think that means there's a chance you'd have enough credit if we can depessimise the known problems with TAP query execution[1], and there are some more obviously stupid things too (sleep/poll for replication progress where PostgreSQL should offer an event-based wait-for-replay, running all the tests when only docs changed, running the regression tests fewer times, ...). I also have Cirrus setup for my cloned Postgres repo and it is annoying that it will run CI whenever I push up new commits that I pulled from git.p.o. My strategy for this (on other projects) is to require branches that need CI to end in -ci. Since I use multiple CI services, I further allow -cic (Cirrus), -cig (Github Actions), etc. PRs are also included. That way, nothing runs through CI unless I want it to. Here's an example of how this works on Cirrus: # Build the branch if it is a pull request, or ends in -ci/-cic (-cic targets only Cirrus CI) only_if: $CIRRUS_PR != '' || $CIRRUS_BRANCH =~ '.*-ci$' || $CIRRUS_BRANCH =~ '.*-cic$' Regards, -David
Re: False "pg_serial": apparent wraparound” in logs
> I don't really understand what exactly the problem is, or how this fixes > it. But this doesn't feel right: As the repro show, false reports of "pg_serial": apparent wraparound” messages are possible. For a very busy system which checkpoints frequently and heavy usage of serializable isolation, this will flood the error logs, and falsely cause alarm to the user. It also prevents the SLRU from being truncated. In my repro, I end up seeing, even though the SLRU does not wraparound. " LOG: could not truncate directory "pg_serial": apparent wraparound" > Firstly, isn't headPage == 0 also a valid value? We initialize headPage > to -1 when it's not in use. Yes. You are correct. This is wrong. > Secondly, shouldn't we set it to the page corresponding to headXid > rather than tailXid. > Thirdly, I don't think this code should have any business setting > latest_page_number directly. latest_page_number is set in > SimpleLruZeroPage(). Correct, after checking again, I do realize the patch is wrong. > Are we missing a call to SimpleLruZeroPage() somewhere? That is a good point. The initial idea was to advance the latest_page_number during SerialSetActiveSerXmin, but the initial approach is obviously wrong. When SerialSetActiveSerXmin is called for a new active serializable xmin, and at that point we don't need to keep any any earlier transactions, should SimpleLruZeroPage be called to ensure there is a target page for the xid? I tried something like below, which fixes my repro, by calling SimpleLruZeroPage at the end of SerialSetActiveSerXmin. @@ -953,6 +953,8 @@ SerialGetMinConflictCommitSeqNo(TransactionId xid) static void SerialSetActiveSerXmin(TransactionId xid) { + int targetPage = SerialPage(xid); + LWLockAcquire(SerialSLRULock, LW_EXCLUSIVE); /* @@ -992,6 +994,9 @@ SerialSetActiveSerXmin(TransactionId xid) serialControl->tailXid = xid; + if (serialControl->headPage != targetPage) + SimpleLruZeroPage(SerialSlruCtl, targetPage); + LWLockRelease(SerialSLRULock); } Regards, Sami
Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)
Does the image lack a /etc/localtime file/link, but perhaps one of you did something to create it? This came up with the CI image: https://www.postgresql.org/message-id/flat/20230731191510.pebqeiuo2sbmlcfh%40awork3.anarazel.de Also mentioned at: https://wiki.tcl-lang.org/page/clock+scan
Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)
Thomas Munro writes: > Does the image lack a /etc/localtime file/link, but perhaps one of you > did something to create it? Hah! I thought it had to be some sort of locale effect, but I failed to think of that as a contributor :-(. My installation does have /etc/localtime, and removing it duplicates Tomas' syndrome. I also find that if I add "-gmt 1" to the clock invocation, it's happy with or without /etc/localtime. So I think we should modify the test case to use that to reduce its environmental sensitivity. Will go make it so. regards, tom lane
Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)
Thomas Munro writes: > This came up with the CI image: > https://www.postgresql.org/message-id/flat/20230731191510.pebqeiuo2sbmlcfh%40awork3.anarazel.de BTW, after re-reading that thread, I think the significant difference is that these FreeBSD images don't force you to select a timezone during setup, unlike what I recall seeing when installing x86_64 FreeBSD. You're not forced to run bsdconfig at all, and even if you do it doesn't make you enter the sub-menu where you can pick a timezone. I recall that I did do that while setting mine up, but I'll bet Tomas skipped it. I'm not sure at this point whether FreeBSD changed behavior since 13.x, or this is a difference between their preferred installation processes for x86 vs. ARM. But in any case, it's clearly easier to get into the no-/etc/localtime state with these systems than I thought before. regards, tom lane
Re: Eager page freeze criteria clarification
On Fri, Sep 29, 2023 at 11:27 AM Robert Haas wrote: > I think that's true. For me, the issue is what a user is practically > likely to notice and care about. I submit that on a > not-particularly-busy system, it would probably be fine to freeze > aggressively in almost every situation, because you're only incurring > costs you can afford to pay. On a busy system, it's more important to > be right, or at least not too badly wrong. But even on a busy system, > I think that when the time between data being written and being frozen > is more than a few tens of minutes, it's very doubtful that anyone is > going to notice the contribution that freezing makes to the overall > workload. They're much more likely to notice an annoying autovacuum > than they are to notIce a bit of excess freezing that ends up getting > reversed. But when you start cranking the time between writing data > and freezing it down into the single-digit numbers of minutes, and > even more if you push down to tens of seconds or less, now I think > people are going to care more about useless freezing work than about > long-term autovacuum risks. Because now their database is really busy > so they care a lot about performance, and seemingly most of the data > involved is ephemeral anyway. I think that that's all true. As I believe I pointed out at least once in the past, my thinking about the practical requirements from users was influenced by this paper/talk: https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/ For the types of applications that Postgres is a natural fit for, most of the data is cold -- very cold ("freezing" cold, even). If you have a 50GB pgbench_accounts table, Postgres will always perform suboptimally. But so will SQL Server, Oracle, and MySQL. While pgbench makes a fine stress-test, for the most part its workload is highly unrealistic. And yet we seem to think that it's just about the most important benchmark of all. If we're not willing to get over even small regressions in pgbench, I fear we'll never make significant progress in this area. It's at least partly a cultural problem IMV. The optimal freezing strategy for pgbench_accounts is to never freeze, even once. It doesn't matter if you vary the distributions of the accounts table updates -- it's still inevitable that each and every row will get updated before too long. In fact, it's not just freezing that should always be avoided here -- same applies with pruning by VACUUM. As I suggested in my last email, the lesson offered by pgbench_accounts table seems to be "never VACUUM at all, except perhaps to advance relfrozenxid" (which shouldn't actually require any freezing even one page). If you haven't tuned heap fill factor, then you might want to VACUUM a bit, at first. But, overall, vacuuming is bad. That is the logical though absurd conclusion. It completely flies in the face of practical experience. -- Peter Geoghegan
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 4:06 PM Peter Geoghegan wrote: > > On Fri, Sep 29, 2023 at 11:45 AM James Coleman > wrote:my reading the issue is that "old versions" doesn't say > > anything about "old HOT versions; it seems to be describing what > > happens generally when a heap-only tuple is written -- which would > > include the first time a heap-only tuple is written. > > I think that it's talking about what happens during opportunistic > pruning, in particular what happens to HOT chains. (Though pruning > does almost the same amount of useful work with non-heap-only tuples, > so it's a bit unfortunate that the name "HOT pruning" seems to have > stuck.) That's very likely what the intention was. I read it again, and the same confusion still sticks out to me: it doesn't say anything explicitly about opportunistic pruning (I'm not sure if that term is "public docs" level, so that's probably fine), and it doesn't scope the claim to intermediate tuples in a HOT chain -- indeed the context is the HOT feature generally. This is why I discovered it: it says "indexes do not reference their page item identifiers", which is manifestly not true when talking about the root item, and in fact would defeat the whole purpose of HOT (at least in a old-to-new chain like Postgres uses). Assuming people can be convinced this is confusing (I realize you may not be yet), I see two basic options: 1. Update this to discuss both intermediate tuples and root items separately. This could entail either one larger paragraph or splitting such that instead of "two optimizations" we say "three" optimizations. 2. Change "old versions" to something like "intermediate versions in a series of updates". I prefer some form of (1) since it more fully describes the behavior, but we could tweak further for concision. > > And when it's the > > first heap-only tuple the "old version" would be the original version, > > which would not be a heap-only tuple. > > The docs say "Old versions of updated rows can be completely removed > during normal operation". Opportunistic pruning removes dead heap-only > tuples completely, and makes their line pointers LP_UNUSED right away. > But it can also entail removing storage for the original root item > heap tuple, and making its line pointer LP_REDIRECT right away (not > LP_DEAD or LP_UNUSED) at most once in the life of each HOT chain. So > yeah, we're not quite limited to removing storage for heap-only tuples > when pruning a HOT chain. Does that distinction really matter, though? Given pageinspect can show you the original tuple still exists and that the index still references it...I think it does. I suppose very few people go checking that out, of course, but I'd like to be precise. Regards, James Coleman
Re: [DOCS] HOT - correct claim about indexes not referencing old line pointers
On Fri, Sep 29, 2023 at 6:27 PM James Coleman wrote: > On Fri, Sep 29, 2023 at 4:06 PM Peter Geoghegan wrote: > > I think that it's talking about what happens during opportunistic > > pruning, in particular what happens to HOT chains. (Though pruning > > does almost the same amount of useful work with non-heap-only tuples, > > so it's a bit unfortunate that the name "HOT pruning" seems to have > > stuck.) > > That's very likely what the intention was. I read it again, and the > same confusion still sticks out to me: it doesn't say anything > explicitly about opportunistic pruning (I'm not sure if that term is > "public docs" level, so that's probably fine), and it doesn't scope > the claim to intermediate tuples in a HOT chain -- indeed the context > is the HOT feature generally. It doesn't mention opportunistic pruning by name, but it does say: "Old versions of updated rows can be completely removed during normal operation, including SELECTs, instead of requiring periodic vacuum operations." There is a strong association between HOT and pruning (particularly opportunistic pruning) in the minds of some hackers (and perhaps some users), because both features appeared together in 8.3, and both are closely related at the implementation level. It's nevertheless not quite accurate to say that HOT "provides two optimizations" -- since pruning (the second of the two bullet points) isn't fundamentally different for pages that don't have any HOT chains. Not at the level of the heap pages, at least (indexes are another matter). Explaining these sorts of distinctions through prose is very difficult. You really need diagrams for something like this IMV. Without that, the only way to make all of this less confusing is to avoid all discussion of pruning...but then you can't really make the point about breaking the dependency on VACUUM, which is a relatively important point -- one with real practical relevance. > This is why I discovered it: it says "indexes do not reference their > page item identifiers", which is manifestly not true when talking > about the root item, and in fact would defeat the whole purpose of HOT > (at least in a old-to-new chain like Postgres uses). Yeah, but...that's not what was intended. Obviously, the index hasn't changed, and we expect index scans to continue to give correct answers. So it is pretty strongly implied that it continues to point to something valid. > Assuming people can be convinced this is confusing (I realize you may > not be yet), I see two basic options: > > 1. Update this to discuss both intermediate tuples and root items > separately. This could entail either one larger paragraph or splitting > such that instead of "two optimizations" we say "three" optimizations. > > 2. Change "old versions" to something like "intermediate versions in a > series of updates". > > I prefer some form of (1) since it more fully describes the behavior, > but we could tweak further for concision. Bruce authored these docs. I was mostly just glad to have anything at all about HOT in the user-facing docs, quite honestly. -- Peter Geoghegan
Re: document the need to analyze partitioned tables
On Sat, Sep 30, 2023 at 12:39:43AM +0200, Laurenz Albe wrote: > On Fri, 2023-09-29 at 18:08 -0400, Bruce Momjian wrote: > > On Wed, Sep 6, 2023 at 05:53:56AM +0200, Laurenz Albe wrote: > > > > We may have different mental models here. This relates to the part > > > > that I wasn't keen on in your patch, i.e: > > > > > > > > + The partitions of a partitioned table are normal tables and get > > > > processed > > > > + by autovacuum > > > > > > > > While I agree that the majority of partitions are likely to be > > > > relkind='r', which you might ordinarily consider a "normal table", you > > > > just might change your mind when you try to INSERT or UPDATE records > > > > that would violate the partition constraint. Some partitions might > > > > also be themselves partitioned tables and others might be foreign > > > > tables. That does not really matter much when it comes to what > > > > autovacuum does or does not do, but I'm not really keen to imply in > > > > our documents that partitions are "normal tables". > > > > > > Agreed, there are differences between partitions and normal tables. > > > And this is not the place in the documentation where we would like to > > > get into detail about the differences. > > > > > > Attached is the next version of my patch. > > > > I adjusted your patch to be shorter and clearer, patch attached. I am > > planning to apply this back to PG 11. > > Thanks for looking at this. > > I am mostly fine with your version, but it does not directly state that > autovacuum does not process partitioned tables. I think this should be > clarified in the beginning. Very good point! Updated patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 9cf9d030a8..be1c522575 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -861,10 +861,16 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu -Partitioned tables are not processed by autovacuum. Statistics -should be collected by running a manual ANALYZE when it is -first populated, and again whenever the distribution of data in its -partitions changes significantly. +Partitioned tables do not directly store tuples and consequently +autovacuum does not VACUUM them. (Autovacuum does +perform VACUUM on table partitions just like other +tables.) Unfortunately, this also means that autovacuum doesn't +run ANALYZE on partitioned tables, and this +can cause suboptimal plans for queries that reference partitioned +table statistics. You can work around this problem by manually +running ANALYZE on partitioned tables when they +are first populated, and again whenever the distribution of data in +their partitions changes significantly.