Re: automating RangeTblEntry node support
On 18.02.24 00:06, Matthias van de Meent wrote: I'm not sure that the cleanup which is done when changing a RTE's rtekind is also complete enough for this purpose. Things like inline_cte_walker change the node->rtekind, which could leave residual junk data in fields that are currently dropped during serialization (as the rtekind specifically ignores those fields), but which would add overhead when the default omission is expected to handle these fields; as they could then contain junk. It looks like there is some care about zeroing now unused fields, but I haven't checked that it covers all cases and fields to the extent required so that removing this specialized serializer would have zero impact on size once the default omission patch is committed. An additional patch with a single function that for this purpose clears junk fields from RTEs that changed kind would be appreciated: it is often hand-coded at those locations the kind changes, but that's more sensitive to programmer error. Yes, interesting idea. Or maybe an assert-like function that checks an existing structure for consistency. Or maybe both. I'll try this out. In the meantime, if there are no remaining concerns, I propose to commit the first two patches Remove custom Constraint node read/write implementations Remove custom _jumbleRangeTblEntry()
pg_restore option --clean
Hi, The --clean option of pg_restore allows you to replace an object before being imported. However, dependencies such as foreign keys or views prevent the deletion of the object. Is there a way to add the cascade option to force the deletion? Thanks for helping Fabrice
Re: Switching XLog source from archive to streaming when primary available
On Tue, Feb 20, 2024 at 11:54 AM Japin Li wrote: > > > On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy > wrote: > > On Mon, Feb 19, 2024 at 8:25 PM Japin Li wrote: > >> [2] > >> +# Ensure checkpoint doesn't come in our way > >> +$primary->append_conf('postgresql.conf', qq( > >> +min_wal_size = 2MB > >> +max_wal_size = 1GB > >> +checkpoint_timeout = 1h > >> + autovacuum = off > >> +)); > >> > >> Keeping the same indentation might be better. > > > > The autovacuum line looks mis-indented in the patch file. However, I > > now ran src/tools/pgindent/perltidyrc > > src/test/recovery/t/041_wal_source_switch.pl on it. > > > > Thanks for updating the patch. It seems still with the wrong indent. Thanks. perltidyrc didn't complain about anything on v19. However, I kept the alignment same as other TAP tests for multi-line append_conf. If that's not correct, I'll leave it to the committer to decide. PSA v20 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From beea9f0b8bbc76bb48dd1a5d64a5b52bafd09e6f Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 20 Feb 2024 07:31:29 + Subject: [PATCH v20] Allow standby to switch WAL source from archive to streaming A standby typically switches to streaming replication (get WAL from primary), only when receive from WAL archive finishes (no more WAL left there) or fails for any reason. Reading WAL from archive may not always be as efficient and fast as reading from primary. This can be due to the differences in disk types, IO costs, network latencies etc.. All of these can impact the recovery performance on standby and increase the replication lag on primary. In addition, the primary keeps accumulating WAL needed for the standby while the standby reads WAL from archive because the standby replication slot stays inactive. To avoid these problems, one can use this parameter to make standby switch to stream mode sooner. This feature adds a new GUC that specifies amount of time after which standby attempts to switch WAL source from WAL archive to streaming replication (getting WAL from primary). However, standby exhausts all the WAL present in pg_wal before switching. If standby fails to switch to stream mode, it falls back to archive mode. Author: Bharath Rupireddy Reviewed-by: Cary Huang, Nathan Bossart Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com --- doc/src/sgml/config.sgml | 48 doc/src/sgml/high-availability.sgml | 15 ++- src/backend/access/transam/xlogrecovery.c | 115 -- src/backend/utils/misc/guc_tables.c | 12 ++ src/backend/utils/misc/postgresql.conf.sample | 4 + src/include/access/xlogrecovery.h | 1 + src/test/recovery/meson.build | 1 + src/test/recovery/t/041_wal_source_switch.pl | 110 + 8 files changed, 287 insertions(+), 19 deletions(-) create mode 100644 src/test/recovery/t/041_wal_source_switch.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ffd711b7f2..2fbf1ad6e1 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4872,6 +4872,54 @@ ANY num_sync ( + streaming_replication_retry_interval (integer) + + streaming_replication_retry_interval configuration parameter + + + + +Specifies amount of time after which standby attempts to switch WAL +source from archive to streaming replication (i.e., getting WAL from +primary). However, the standby exhausts all the WAL present in +pg_wal directory before switching. If the standby +fails to switch to stream mode, it falls back to archive mode. If this +parameter value is specified without units, it is taken as +milliseconds. Default is 5min. With a lower value +for this parameter, the standby makes frequent WAL source switch +attempts. To avoid this, it is recommended to set a reasonable value. +A setting of 0 disables the feature. When disabled, +the standby typically switches to stream mode only after receiving WAL +from archive finishes (i.e., no more WAL left there) or fails for any +reason. This parameter can only be set in the +postgresql.conf file or on the server command +line. + + + + Standby may not always attempt to switch source from WAL archive to + streaming replication at exact + streaming_replication_retry_interval intervals. For + example, if the parameter is set to 1min and + fetching WAL file from archive takes about 2min, + then the source switch attempt happens for the next WAL file after + current WAL file fetched from archive is fully applied. +
Re: confirmed flush lsn seems to be move backward in certain error cases
On Sat, 17 Feb 2024 at 12:03, Amit Kapila wrote: > > On Fri, Feb 16, 2024 at 5:53 PM vignesh C wrote: > > > > > > After the insert operation is replicated to the subscriber, the > > subscriber will set the lsn value sent by the publisher in the > > replication origin (in my case it was 0/1510978). publisher will then > > send keepalive messages with the current WAL position in the publisher > > (in my case it was 0/15109B0), but subscriber will simply send this > > position as the flush_lsn to the publisher as there are no ongoing > > transactions. Then since the publisher is started, it will identify > > that publication does not exist and stop the walsender/apply worker > > process. When the apply worker is restarted, we will get the > > remote_lsn(in my case it was 0/1510978) of the origin and set it to > > origin_startpos. We will start the apply worker with this > > origin_startpos (origin's remote_lsn). This position will be sent as > > feedback to the walsender process from the below stack: > > run_apply_worker->start_apply->LogicalRepApplyLoop->send_feedback. > > It will use the following send_feedback function call of > > LogicalRepApplyLoop function as in below code here as nothing is > > received from walsender: > > LogicalRepApplyLoop function > > ... > > len = walrcv_receive(LogRepWorkerWalRcvConn, &buf, &fd); > > if (len != 0) > > { > >/* Loop to process all available data (without blocking). */ > >for (;;) > >{ > > CHECK_FOR_INTERRUPTS(); > > ... > >} > > } > > > > /* confirm all writes so far */ > > send_feedback(last_received, false, false); > > ... > > > > In send_feedback, we will set flushpos to replication origin's > > remote_lsn and send it to the walsender process. Walsender process > > will receive this information and set confirmed_flush in: > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation > > > > Then immediately we are trying to stop the publisher instance, > > shutdown checkpoint process will be triggered. In this case: > > confirmed_flush = 0/1510978 will be lesser than > > last_saved_confirmed_flush = 0/15109B0 which will result in Assertion > > failure. > > > > This issue is happening because we allow setting the confirmed_flush > > to a backward position. > > > > I see your point. > > > There are a couple of ways to fix this: > > a) One way it not to update the confirm_flush if the lsn sent is an > > older value like in Confirm_flush_dont_allow_backward.patch > > > > @@ -1839,7 +1839,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn) > > SpinLockAcquire(&MyReplicationSlot->mutex); > > - MyReplicationSlot->data.confirmed_flush = lsn; > + if (lsn > MyReplicationSlot->data.confirmed_flush) > + MyReplicationSlot->data.confirmed_flush = lsn; > > /* if we're past the location required for bumping xmin, do so */ > if (MyReplicationSlot->candidate_xmin_lsn != InvalidXLogRecPtr && > @@ -1904,7 +1905,8 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn) > else > { > SpinLockAcquire(&MyReplicationSlot->mutex); > - MyReplicationSlot->data.confirmed_flush = lsn; > + if (lsn > MyReplicationSlot->data.confirmed_flush) > + MyReplicationSlot->data.confirmed_flush = lsn; > > BTW, from which code path does it update the prior value of > confirmed_flush? The confirmed_flush is getting set in the else condition for this scenario. If it is through the else check, then can we see if > it may change the confirm_flush to the prior position via the first > code path? I am asking because in the first code path, we can even > flush the re-treated value of confirm_flush LSN. I was not able to find any scenario to set a prior position with the first code path. I tried various scenarios like adding delay in walsender, add delay in apply worker, restart the instances and with various DML operations. It was always setting it to either to the same value as previous or greater value. Regards, Vignesh
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Feb 19, 2024 at 7:47 PM John Naylor wrote: > > On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada wrote: > > > > I think that vacuum and tidbitmap (and future users) would end up > > having the same max block size calculation. And it seems slightly odd > > layering to me that max-block-size-specified context is created on > > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by > > tidstore layer, and the passed context is used for leaves (if > > varlen-value is used) on radix tree layer. > > That sounds slightly more complicated than I was thinking of, but we > could actually be talking about the same thing: I'm drawing a > distinction between "used = must be detected / #ifdef'd" and "used = > actually happens to call allocation". I meant that the passed context > would _always_ be used for leaves, regardless of varlen or not. So > with fixed-length values short enough to live in child pointer slots, > that context would still be used for iteration etc. > > > Another idea is to create a > > max-block-size-specified context on the tidstore layer. That is, > > vacuum and tidbitmap pass a work_mem and a flag indicating whether the > > tidstore can use the bump context, and tidstore creates a (aset of > > bump) memory context with the calculated max block size and passes it > > to the radix tree. > > That might be a better abstraction since both uses have some memory limit. I've drafted this idea, and fixed a bug in tidstore.c. Here is the summary of updates from v62: - removed v62-0007 patch as we discussed - squashed v62-0006 and v62-0008 patches into 0003 patch - v63-0008 patch fixes a bug in tidstore. - v63-0009 patch is a draft idea of cleanup memory context handling. > > > As for using the bump memory context, I feel that we need to store > > iterator struct in aset context at least as it can be individually > > freed and re-created. Or it might not be necessary to allocate the > > iterator struct in the same context as radix tree. > > Okay, that's one thing I was concerned about. Since we don't actually > have a bump context yet, it seems simple to assume aset for non-nodes, > and if we do get it, we can adjust slightly. Anyway, this seems like a > good thing to try to clean up, but it's also not a show-stopper. > > On that note: I will be going on honeymoon shortly, and then to PGConf > India, so I will have sporadic connectivity for the next 10 days and > won't be doing any hacking during that time. Thank you for letting us know. Enjoy yourself! Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v63-ART.tar.gz Description: GNU Zip compressed data
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Fri, Feb 9, 2024 at 1:12 PM Bertrand Drouvot wrote: > > I think "conflict" is an important topic and does contain several reasons. The > slot "first" conflict and then leads to slot "invalidation". > > > They both are the same internally, so why > > confuse the users? > > I don't think that would confuse the users, I do think that would be easier to > check for conflicting slots. I've added a separate column for invalidation reasons for now. I'll see how others think on this as the time goes by. > I did not look closely at the code, just played a bit with the patch and was > able > to produce something like: > > postgres=# select > slot_name,slot_type,active,active_pid,wal_status,invalidation_reason from > pg_replication_slots; > slot_name | slot_type | active | active_pid | wal_status | > invalidation_reason > -+---++++- > rep1| physical | f || reserved | > master_slot | physical | t |1482441 | unreserved | wal_removed > (2 rows) > > does that make sense to have an "active/working" slot "ivalidated"? Thanks. Can you please provide the steps to generate this error? Are you setting max_slot_wal_keep_size on primary to generate "wal_removed"? Attached v5 patch set after rebasing and addressing review comments. Please review it further. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v5-0001-Track-invalidation_reason-in-pg_replication_slots.patch Description: Binary data v5-0002-Add-XID-based-replication-slot-invalidation.patch Description: Binary data v5-0003-Track-inactive-replication-slot-information.patch Description: Binary data v5-0004-Add-inactive_timeout-based-replication-slot-inval.patch Description: Binary data
Re: Patch: Add parse_type Function
On Tue, Feb 20, 2024 at 11:06 AM David E. Wheeler wrote: > > LOL, I missed that, just wrote it myself in the last hour. :-) v6 attached. > +SELECT to_regtypemod('interval nonesuch'); -- grammar error expected +ERROR: syntax error at or near "nonesuch" +LINE 1: SELECT to_regtypemod('interval nonesuch'); + ^ +CONTEXT: invalid type name "interval nonesuch" +SELECT to_regtypemod('year(4)'); -- grammar error expected + to_regtypemod +--- + +(1 row) + the second hint `-- grammar error expected` seems to contradict with the results?
Re: Switching XLog source from archive to streaming when primary available
On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy wrote: > On Mon, Feb 19, 2024 at 8:25 PM Japin Li wrote: >> [2] >> +# Ensure checkpoint doesn't come in our way >> +$primary->append_conf('postgresql.conf', qq( >> +min_wal_size = 2MB >> +max_wal_size = 1GB >> +checkpoint_timeout = 1h >> + autovacuum = off >> +)); >> >> Keeping the same indentation might be better. > > The autovacuum line looks mis-indented in the patch file. However, I > now ran src/tools/pgindent/perltidyrc > src/test/recovery/t/041_wal_source_switch.pl on it. > Thanks for updating the patch. It seems still with the wrong indent. diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl index 082680bf4a..b5eddba1d5 100644 --- a/src/test/recovery/t/041_wal_source_switch.pl +++ b/src/test/recovery/t/041_wal_source_switch.pl @@ -18,9 +18,9 @@ $primary->init( # Ensure checkpoint doesn't come in our way $primary->append_conf( 'postgresql.conf', qq( -min_wal_size = 2MB -max_wal_size = 1GB -checkpoint_timeout = 1h + min_wal_size = 2MB + max_wal_size = 1GB + checkpoint_timeout = 1h autovacuum = off )); $primary->start; @@ -85,7 +85,7 @@ my $offset = -s $standby->logfile; my $apply_delay = $retry_interval * 5; $standby->append_conf( 'postgresql.conf', qq( -recovery_min_apply_delay = '${apply_delay}ms' + recovery_min_apply_delay = '${apply_delay}ms' )); $standby->start;
Re: Shared detoast Datum proposal
Hi, I didn't another round of self-review. Comments, variable names, the order of function definition are improved so that it can be read as smooth as possible. so v6 attached. -- Best Regards Andy Fan >From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Tue, 20 Feb 2024 11:11:53 +0800 Subject: [PATCH v6 1/2] Introduce a Bitset data struct. While Bitmapset is designed for variable-length of bits, Bitset is designed for fixed-length of bits, the fixed length must be specified at the bitset_init stage and keep unchanged at the whole lifespan. Because of this, some operations on Bitset is simpler than Bitmapset. The bitset_clear unsets all the bits but kept the allocated memory, this capacity is impossible for bit Bitmapset for some solid reasons and this is the main reason to add this data struct. [1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com [2] https://postgr.es/m/875xzqxbv5.fsf%40163.com --- src/backend/nodes/bitmapset.c | 200 +- src/backend/nodes/outfuncs.c | 51 + src/include/nodes/bitmapset.h | 28 +++ src/include/nodes/nodes.h | 4 + src/test/modules/test_misc/Makefile | 11 + src/test/modules/test_misc/README | 4 +- .../test_misc/expected/test_bitset.out| 7 + src/test/modules/test_misc/meson.build| 17 ++ .../modules/test_misc/sql/test_bitset.sql | 3 + src/test/modules/test_misc/test_misc--1.0.sql | 5 + src/test/modules/test_misc/test_misc.c| 118 +++ src/test/modules/test_misc/test_misc.control | 4 + src/tools/pgindent/typedefs.list | 1 + 13 files changed, 441 insertions(+), 12 deletions(-) create mode 100644 src/test/modules/test_misc/expected/test_bitset.out create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql create mode 100644 src/test/modules/test_misc/test_misc.c create mode 100644 src/test/modules/test_misc/test_misc.control diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 65805d4527..40cfea2308 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b) * It makes no difference in simple loop usage, but complex iteration logic * might need such an ability. */ -int -bms_next_member(const Bitmapset *a, int prevbit) + +static int +bms_next_member_internal(int nwords, const bitmapword *words, int prevbit) { - int nwords; int wordnum; bitmapword mask; - Assert(bms_is_valid_set(a)); - - if (a == NULL) - return -2; - nwords = a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { - bitmapword w = a->words[wordnum]; + bitmapword w = words[wordnum]; /* ignore bits before prevbit */ w &= mask; @@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit) return -2; } +int +bms_next_member(const Bitmapset *a, int prevbit) +{ + Assert(a == NULL || IsA(a, Bitmapset)); + + Assert(bms_is_valid_set(a)); + + if (a == NULL) + return -2; + + return bms_next_member_internal(a->nwords, a->words, prevbit); +} + /* * bms_prev_member - find prev member of a set * @@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize) return !bms_equal(*((const Bitmapset *const *) key1), *((const Bitmapset *const *) key2)); } + +/* + * bitset_init - create a Bitset. the set will be round up to nwords; + */ +Bitset * +bitset_init(size_t size) +{ + int nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD; + Bitset *result; + + if (size == 0) + return NULL; + + result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword)); + result->nwords = nword; + + return result; +} + +/* + * bitset_clear - clear the bits only, but the memory is still there. + */ +void +bitset_clear(Bitset *a) +{ + if (a != NULL) + memset(a->words, 0, sizeof(bitmapword) * a->nwords); +} + +void +bitset_free(Bitset *a) +{ + if (a != NULL) + pfree(a); +} + +bool +bitset_is_empty(Bitset *a) +{ + int i; + + if (a == NULL) + return true; + + for (i = 0; i < a->nwords; i++) + { + bitmapword w = a->words[i]; + + if (w != 0) + return false; + } + + return true; +} + +Bitset * +bitset_copy(Bitset *a) +{ + Bitset *result; + + if (a == NULL) + return NULL; + + result = bitset_init(a->nwords * BITS_PER_BITMAPWORD); + + memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords); + return result; +} + +void +bitset_add_member(Bitset *a, int x) +{ + int wordnum, +bitnum; + + Assert(x >= 0); + + wordnum = WORDNUM(x); + bitnum = BITNUM(x); + + Assert(wordnum < a->nwords); + + a->words[wordnum] |= ((bitmapword) 1 << bitnum); +} + +void +bitset_del_member(Bitset *a, int x) +{ + int wo
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Sat, Feb 17, 2024 at 10:27 AM Bharath Rupireddy wrote: > > On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis wrote: > > > > > Here, I'm with v23 patch set: > > > > Thank you, I'll look at these. > > Thanks. Here's the v24 patch set after rebasing. Ran pgperltidy on the new TAP test file added. Please see the attached v25 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v25-0001-Add-check-in-WALReadFromBuffers-against-requeste.patch Description: Binary data v25-0002-Add-test-module-for-verifying-read-from-WAL-buff.patch Description: Binary data v25-0003-Use-WALReadFromBuffers-in-more-places.patch Description: Binary data v25-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patch Description: Binary data
Re: Add test module for verifying backtrace functionality
On Tue, Feb 13, 2024 at 2:11 AM Bharath Rupireddy wrote: > > Hi, > > Postgres has a good amount of code for dealing with backtraces - two > GUCs backtrace_functions and backtrace_on_internal_error, > errbacktrace; all of which use core function set_backtrace from > elog.c. I've not seen this code being tested at all, see code coverage > report - > https://coverage.postgresql.org/src/backend/utils/error/elog.c.gcov.html. > > I think adding a simple test module (containing no .c files) with only > TAP tests will help cover this code. I ended up having it as a > separate module under src/test/modules/test_backtrace as I was not > able to find an existing TAP file in src/test to add these tests. I'm > able to verify the backtrace related code with the attached patch > consistently. The TAP tests rely on the fact that the server emits > text "BACKTRACE: " to server logs before logging the backtrace, and > the backtrace contains the function name in which the error occurs. > I've turned off query statement logging (set log_statement = none, > log_min_error_statement = fatal) so that the tests get to see the > functions only in the backtrace. Although the CF bot is happy with the > attached patch > https://github.com/BRupireddy2/postgres/tree/add_test_module_for_bcktrace_functionality_v1, > there might be some more flakiness to it. > > Thoughts? Ran pgperltidy on the new TAP test file added. Please see the attached v2 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 2cfe9c87dea980efa2f319e9c2e873e83e561b9e Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Tue, 20 Feb 2024 05:48:41 + Subject: [PATCH v2] Add test module for backtrace functionality --- src/test/modules/Makefile | 1 + src/test/modules/meson.build | 1 + src/test/modules/test_backtrace/.gitignore| 4 + src/test/modules/test_backtrace/Makefile | 14 +++ src/test/modules/test_backtrace/README| 4 + src/test/modules/test_backtrace/meson.build | 12 +++ .../modules/test_backtrace/t/001_backtrace.pl | 98 +++ 7 files changed, 134 insertions(+) create mode 100644 src/test/modules/test_backtrace/.gitignore create mode 100644 src/test/modules/test_backtrace/Makefile create mode 100644 src/test/modules/test_backtrace/README create mode 100644 src/test/modules/test_backtrace/meson.build create mode 100644 src/test/modules/test_backtrace/t/001_backtrace.pl diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index 89aa41b5e3..3967311048 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -13,6 +13,7 @@ SUBDIRS = \ libpq_pipeline \ plsample \ spgist_name_ops \ + test_backtrace \ test_bloomfilter \ test_copy_callbacks \ test_custom_rmgrs \ diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build index 8fbe742d38..3b21b389af 100644 --- a/src/test/modules/meson.build +++ b/src/test/modules/meson.build @@ -12,6 +12,7 @@ subdir('libpq_pipeline') subdir('plsample') subdir('spgist_name_ops') subdir('ssl_passphrase_callback') +subdir('test_backtrace') subdir('test_bloomfilter') subdir('test_copy_callbacks') subdir('test_custom_rmgrs') diff --git a/src/test/modules/test_backtrace/.gitignore b/src/test/modules/test_backtrace/.gitignore new file mode 100644 index 00..5dcb3ff972 --- /dev/null +++ b/src/test/modules/test_backtrace/.gitignore @@ -0,0 +1,4 @@ +# Generated subdirectories +/log/ +/results/ +/tmp_check/ diff --git a/src/test/modules/test_backtrace/Makefile b/src/test/modules/test_backtrace/Makefile new file mode 100644 index 00..b908864e89 --- /dev/null +++ b/src/test/modules/test_backtrace/Makefile @@ -0,0 +1,14 @@ +# src/test/modules/test_backtrace/Makefile + +TAP_TESTS = 1 + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/test_backtrace +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/test_backtrace/README b/src/test/modules/test_backtrace/README new file mode 100644 index 00..ae1366bd58 --- /dev/null +++ b/src/test/modules/test_backtrace/README @@ -0,0 +1,4 @@ +This directory doesn't actually contain any extension module. + +Instead it is a home for backtrace tests that we want to run using TAP +infrastructure. diff --git a/src/test/modules/test_backtrace/meson.build b/src/test/modules/test_backtrace/meson.build new file mode 100644 index 00..4adffcc914 --- /dev/null +++ b/src/test/modules/test_backtrace/meson.build @@ -0,0 +1,12 @@ +# Copyright (c) 2024, PostgreSQL Global Development Group + +tests += { + 'name': 'test_backtrace', + 'sd': meson.current_source_dir(), + 'bd': meson.current_build_dir(), + 'tap': { +'tests': [ + 't/001_backtrace.pl', +], + },
Re: partitioning and identity column
20.02.2024 07:57, Ashutosh Bapat wrote: Could you please name functions, which you suspect, for me to recheck them? Perhaps we should consider fixing all of such functions, in light of b0f7dd915 and d57b7cc33... Looks like the second commit has fixed all other places I knew except Identity related functions. So worth fixing identity related functions too. I see dropconstraint_internal() has two calls to check_stack_depth() back to back. The second one is not needed? Yeah, that's funny. It looks like such a double protection emerged because Alvaro protected the function (in b0f7dd915), which was waiting for adding check_stack_depth() in the other thread (resulted in d57b7cc33). Thank you for spending time on this! Best regards, Alexander
Re: logical decoding and replication of sequences, take 2
On Thu, Dec 21, 2023 at 6:47 PM Tomas Vondra wrote: > > On 12/19/23 13:54, Christophe Pettus wrote: > > Hi, > > > > I wanted to hop in here on one particular issue: > > > >> On Dec 12, 2023, at 02:01, Tomas Vondra > >> wrote: > >> - desirability of the feature: Random IDs (UUIDs etc.) are likely a much > >> better solution for distributed (esp. active-active) systems. But there > >> are important use cases that are likely to keep using regular sequences > >> (online upgrades of single-node instances, existing systems, ...). > > > > +1. > > > > Right now, the lack of sequence replication is a rather large > > foot-gun on logical replication upgrades. Copying the sequences > > over during the cutover period is doable, of course, but: > > > > (a) There's no out-of-the-box tooling that does it, so everyone has > > to write some scripts just for that one function. > > > > (b) It's one more thing that extends the cutover window. > > > > I agree it's an annoying gap for this use case. But if this is the only > use cases, maybe a better solution would be to provide such tooling > instead of adding it to the logical decoding? > > It might seem a bit strange if most data is copied by replication > directly, while sequences need special handling, ofc. > One difference between the logical replication of tables and sequences is that we can guarantee with synchronous_commit (and synchronous_standby_names) that after failover transactions data is replicated or not whereas for sequences we can't guarantee that because of their non-transactional nature. Say, there are two transactions T1 and T2, it is possible that T1's entire table data and sequence data are committed and replicated but T2's sequence data is replicated. So, after failover to logical subscriber in such a case if one routes T2 again to the new node as it was not successful previously then it would needlessly perform the sequence changes again. I don't how much that matters but that would probably be the difference between the replication of tables and sequences. I agree with your point above that for upgrades some tool like pg_copysequence where we can provide a way to copy sequence data to subscribers from the publisher would suffice the need. -- With Regards, Amit Kapila.
Re: Switching XLog source from archive to streaming when primary available
On Mon, Feb 19, 2024 at 8:25 PM Japin Li wrote: > > > Strengthened tests a bit by using recovery_min_apply_delay to mimic > > standby spending some time fetching from archive. PSA v18 patch. > > Here are some minor comments: Thanks for taking a look at it. > [1] > +primary). However, the standby exhausts all the WAL present in pg_wal > > s|pg_wal|pg_wal|g Done. > [2] > +# Ensure checkpoint doesn't come in our way > +$primary->append_conf('postgresql.conf', qq( > +min_wal_size = 2MB > +max_wal_size = 1GB > +checkpoint_timeout = 1h > + autovacuum = off > +)); > > Keeping the same indentation might be better. The autovacuum line looks mis-indented in the patch file. However, I now ran src/tools/pgindent/perltidyrc src/test/recovery/t/041_wal_source_switch.pl on it. Please see the attached v19 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v19-0001-Allow-standby-to-switch-WAL-source-from-archive-.patch Description: Binary data
Re: JIT compilation per plan node
On Tue, 20 Feb 2024 at 18:31, Tom Lane wrote: > FWIW, I seriously doubt that an extra walk of the plan tree is even > measurable compared to the number of cycles JIT compilation will > expend if it's called. So I don't buy your argument here. > We would be better off to do this in a way that's clean and doesn't > add overhead for non-JIT-enabled builds. The extra walk of the tree would need to be done for every plan, not just the ones where we do JIT. I'd rather find a way to not add this extra plan tree walk, especially since the vast majority of cases on an average instance won't be doing any JIT. David
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote: > Prefixing these with "initial_" is fine, IMO. That shows the > intention that these come from the slot's data before doing the > termination. So I'm OK with what's been proposed in v3. I was looking at that a second time, and just concluded that this is OK, so I've applied that down to 16, wordsmithing a bit the comments. > Both are the same to me, so I have no extra opinion to offer. ;) I've kept this one as-is, though. -- Michael signature.asc Description: PGP signature
Re: JIT compilation per plan node
David Rowley writes: > On Tue, 20 Feb 2024 at 05:26, Tomas Vondra > wrote: >> Wouldn't it be simpler to just build the plan as we do now, and then >> have an expression_tree_walker that walks the complete plan top-down, >> inspects the nodes, enables JIT where appropriate and so on? That can >> have arbitrary context, no problem with that. > Why walk the entire plan tree again to do something we could do when > building it in the first place? FWIW, I seriously doubt that an extra walk of the plan tree is even measurable compared to the number of cycles JIT compilation will expend if it's called. So I don't buy your argument here. We would be better off to do this in a way that's clean and doesn't add overhead for non-JIT-enabled builds. regards, tom lane
Re: JIT compilation per plan node
On Tue, 20 Feb 2024 at 05:26, Tomas Vondra wrote: > I doubt CreatePlanContext is a great way to achieve this. For one, it > breaks the long-standing custom that PlannerInfo is the first parameter, > usually followed by RelOptInfo, etc. CreatePlanContext is added to some > functions (but not all), which makes it ... unpredictable. I suggested this to Melih as a way to do this based on what Andy wrote in [1]. I agree with Andy that it's not great to add est_calls to every function in createplan.c. I feel that CreatePlanContext is a way to take the hit of adding a parameter once and hopefully never having to do it again to this degree. I wondered if PlannerInfo could be a field in CreatePlanContext. > FWIW it's not clear to me if/how this solves the problem with early > create_plan() calls for subplans. Or is it still the same? > > Wouldn't it be simpler to just build the plan as we do now, and then > have an expression_tree_walker that walks the complete plan top-down, > inspects the nodes, enables JIT where appropriate and so on? That can > have arbitrary context, no problem with that. Why walk the entire plan tree again to do something we could do when building it in the first place? Recursively walking trees isn't great from a performance point of view. It would be nice to avoid this if we can find some other way to handle subplans. I do have a few other reasons up my sleeve that subplan creation should be delayed until later, so maybe we should fix that to unblock those issues. > Considering we decide JIT pretty late anyway (long after costing and > other stuff that might affect the plan selection), the result should be > exactly the same, without the extensive createplan.c disruption ... > > (usual caveat: I haven't tried, maybe there's something that means this > can't work) It's not like we can look at the top-node's cost as a pre-check to skip the recursive step for cheap plans as it's perfectly valid that a node closer to the root of the plan tree have a lower total cost than that node's subnodes. e.g LIMIT 1. > > - The number of estimated calls are especially useful where a node is > > expected to be rescanned, such as Gather. Gather Merge, Memoize and Nested > > Loop. Knowing the estimated number of calls for a node allows us to rely on > > total cost multiplied by the estimated calls instead of only total cost for > > the node. > > Not sure I follow. Why would be these nodes (Gather, Gather Merge, ...) > more likely to be rescanned compared to other nodes? I think Melih is listing nodes that can change the est_calls. Any node can be rescanned, but only a subset of nodes can adjust the number of times they call their subnode vs how many times they themselves are called. > > - For each node, the planner considers if the node should be JITed. If the > > cost of the node * the number of estimated calls is greater than > > jit_above_cost, it's decided to be JIT compiled. Note that this changes > > the meaning of jit_above_cost, it's now a threshold for a single plan node > > and not the whole query. Additionally, this change in JIT consideration is > > only for JIT compilations. Inlining and optimizations continue to be for > > the whole query and based on the overall cost of the query. > > It's not clear to me why JIT compilation is decided for each node, while > the inlining/optimization is decided for the plan as a whole. I'm not > familiar with the JIT stuff, so maybe it's obvious to others ... This is a problem with LLVM, IIRC. The problem is it's a decision that has to be made for an entire compilation unit and it can't be decided at the expression level. This is pretty annoying as it's pretty hard to decide the best logic to use to enable optimisations and inlining :-( I think the best thing we could discuss right now is, is this the best way to fix the JIT costing problem. In [2] I did link to a complaint about the JIT costings. See [3]. The OP there wanted to keep the plan private, but I did get to see it and described the problem on the list. Also, I don't happen to think the decision about JITting per plan node is perfect as the node's total costs can be high for reasons other than the cost of evaluating expressions. Also, the number of times a given expression is evaluated can vary quite a bit based on when the expression is evaluated. For example, a foreign table scan that does most of the filtering remotely, but has a non-shippable expr that needs to be evaluated locally. The foreign scan might be very expensive, especially if lots of filtering is done by a Seq Scan and not many rows might make it back to the local server to benefit from JITting the non-shippable expression. A counter-example is the join condition of a non-parameterized nested loop. Those get evaluated n_outer_rows * n_inner_rows times. I think the savings JIT gives us on evaluation of expressions is going to be more closely tied to the number of times an expression is evaluated than the total
Re: Memory consumed by paths during partitionwise join planning
On Tue, Feb 20, 2024 at 8:19 AM Andrei Lepikhov wrote: > > On 19/2/2024 19:25, Ashutosh Bapat wrote: > > On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov > > wrote: > >> Live example: right now, I am working on the code like MSSQL has - a > >> combination of NestLoop and HashJoin paths and switching between them in > >> real-time. It requires both paths in the path list at the moment when > >> extensions are coming. Even if one of them isn't referenced from the > >> upper pathlist, it may still be helpful for the extension. > > > > There is no guarantee that every path presented to add_path will be > > preserved. Suboptimal paths are freed as and when add_path discovers > > that they are suboptimal. So I don't think an extension can rely on > > existence of a path. But having a refcount makes it easy to preserve > > the required paths by referencing them. > I don't insist, just provide my use case. It would be ideal if you would > provide some external routines for extensions that allow for sticking > the path in pathlist even when it has terrible cost estimation. With refcounts you can reference it and store it somewhere other than pathlist. The path won't be lost until it is dereferrenced. RelOptInfo::Pathlist is for optimal paths. > > > >> > >> > > IIUC, you are suggesting that instead of planning each > > partition/partitionwise join, we only create paths with the strategies > > which were found to be optimal with previous partitions. That's a good > > heuristic but it won't work if partition properties - statistics, > > indexes etc. differ between groups of partitions. > Sure, but the "Symmetry" strategy assumes that on the scope of a > thousand partitions, especially with parallel append involved, it > doesn't cause sensible performance degradation if we find a bit > suboptimal path in a small subset of partitions. Does it make sense? > As I see, when people use 10-100 partitions for the table, they usually > strive to keep indexes symmetrical for all partitions. > I agree that we need something like that. In order to do that, we need machinery to prove that all partitions have similar properties. Once that is proved, we can skip creating paths for similar partitions. But that's out of scope of this work and complements it. -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On Fri, Feb 16, 2024 at 1:57 AM Tomas Vondra wrote: > For me, the part that I feel most uneasy about is the decoding while the > snapshot is still being built (and can flip to consistent snapshot > between the relfilenode creation and sequence change, confusing the > logic that decides which changes are transactional). > > It seems "a bit weird" that we either keep the "simple" logic that may > end up with incorrect "non-transactional" result, but happens to then > work fine because we immediately discard the change. > > But it still feels better than the alternative, which requires us to > start decoding stuff (relfilenode creation) before building a proper > snapshot is consistent, which we didn't do before - or at least not in > this particular way. While I don't have a practical example where it > would cause trouble now, I have a nagging feeling it might easily cause > trouble in the future by making some new features harder to implement. I don't understand the issues here well enough to comment. Is there a good write-up someplace I can read to understand the design here? Is the rule that changes are transactional if and only if the current transaction has assigned a new relfilenode to the sequence? Why does the logic get confused if the state of the snapshot changes? My naive reaction is that it kinda sounds like you're relying on two different mistakes cancelling each other out, and that might be a bad idea, because maybe there's some situation where they don't. But I don't understand the issue well enough to have an educated opinion at this point. -- Robert Haas EDB: http://www.enterprisedb.com
Re: partitioning and identity column
On Mon, Feb 19, 2024 at 8:30 PM Alexander Lakhin wrote: > > Hello Ashutosh, > > 19.02.2024 15:17, Ashutosh Bapat wrote: > > > >> Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too, > >> so I think they can be exploited as well. > > not just Identity related functions, but many other functions in > > tablecmds.c have that problem as I mentioned earlier. > > > > Could you please name functions, which you suspect, for me to recheck them? > Perhaps we should consider fixing all of such functions, in light of > b0f7dd915 and d57b7cc33... Looks like the second commit has fixed all other places I knew except Identity related functions. So worth fixing identity related functions too. I see dropconstraint_internal() has two calls to check_stack_depth() back to back. The second one is not needed? -- Best Wishes, Ashutosh Bapat
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Tue, Feb 20, 2024 at 5:04 AM Ian Lawrence Barwick wrote: > > > With the addition of "pg_sync_replication_slots()", there is now a use-case > for > including "dbname" in "primary_conninfo" and the docs have changed from > stating [1]: > > Do not specify a database name in the primary_conninfo string. > > to [2]: > > For replication slot synchronization (see Section 48.2.3), it is also > necessary to specify a valid dbname in the primary_conninfo string. This > will > only be used for slot synchronization. It is ignored for streaming. > > [1] > https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY > [2] > https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY > > However, when setting up a standby (with the intent of creating a logical > standby) with pg_basebackup, providing the -R/-write-recovery-conf option > results in a "primary_conninfo" string being generated without a "dbname" > parameter, which requires a manual change should one intend to use > "pg_sync_replication_slots()". > > I can't see any reason for continuing to omit "dbname", so suggest it should > only continue to be omitted for 16 and earlier; see attached patch. > > Note that this does mean that if the conninfo string provided to pg_basebackup > does not contain "dbname", the generated "primary_conninfo" GUC will default > to > "dbname=replication". > When I tried, it defaulted to user name: Default case connection string: primary_conninfo = 'user=KapilaAm passfile=''C:UserskapilaamAppDataRoaming/postgresql/pgpass.conf'' channel_binding=disable dbname=KapilaAm port=5432 sslmode=disable sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable' When I specified -d "dbname = postgres" during backup: primary_conninfo = 'user=KapilaAm passfile=''C:UserskapilaamAppDataRoaming/postgresql/pgpass.conf'' channel_binding=disable dbname=KapilaAm port=5432 sslmode=disable sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable' I think it makes sense to include dbname in the connection string programmatically (with some option) for slot sync functionality but it is not clear to me if there is any impact of the same when the standby is not set up to sync up logical slots. I haven't checked but if there is a way to distinguish the case where we write dbname only when specified by the user then that would be great but this is worth considering even without that. -- With Regards, Amit Kapila.
Re: RFC: Logging plan of the running query
On Fri, Feb 16, 2024 at 12:29 AM Andres Freund wrote: > If we went with something like tht approach, I think we'd have to do something > like redirecting node->ExecProcNode to a wrapper, presumably from within a > CFI. That wrapper could then implement the explain support, without slowing > down the normal execution path. That's an annoying complication; maybe there's some better way to handle this. But I think we need to do something different than what the patch does currently because... > > It's really hard for me to accept that the heavyweight lock problem > > for which the patch contains a workaround is the only one that exists. > > I can't see any reason why that should be true. > > I suspect you're right. ...I think the current approach is just plain dead, because of this issue. We can't take an approach that creates an unbounded number of unclear reentrancy issues and then insert hacks one by one to cure them (or hack around them, more to the point) as they're discovered. The premise has to be that we only allow logging the query plan at points where we know it's safe, rather than, as at present, allowing it in places that are unsafe and then trying to compensate with code elsewhere. That's not likely to ever be as stable as we want PostgreSQL to be. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Optimize planner memory consumption for huge arrays
On 20/2/2024 04:51, Tom Lane wrote: Tomas Vondra writes: On 2/19/24 16:45, Tom Lane wrote: Tomas Vondra writes: For example, I don't think we expect selectivity functions to allocate long-lived objects, right? So maybe we could run them in a dedicated memory context, and reset it aggressively (after each call). Here's a quick and probably-incomplete implementation of that idea. I've not tried to study its effects on memory consumption, just made sure it passes check-world. Thanks for the sketch. The trick with the planner_tmp_cxt_depth especially looks interesting. I think we should design small memory contexts - one per scalable direction of memory utilization, like selectivity or partitioning (appending ?). My coding experience shows that short-lived GEQO memory context forces people to learn on Postgres internals more precisely :). -- regards, Andrei Lepikhov Postgres Professional
Re: WIP Incremental JSON Parser
On 2024-01-26 Fr 12:15, Andrew Dunstan wrote: On 2024-01-24 We 13:08, Robert Haas wrote: Maybe you should adjust your patch to dump the manifests into the log file with note(). Then when cfbot runs on it you can see exactly what the raw file looks like. Although I wonder if it's possible that the manifest itself is OK, but somehow it gets garbled when uploaded to the server, either because the client code that sends it or the server code that receives it does something that isn't safe in 32-bit mode. If we hypothesize an off-by-one error or a buffer overrun, that could possibly explain how one field got garbled while the rest of the file is OK. Yeah, I thought earlier today I was on the track of an off by one error, but I was apparently mistaken, so here's the same patch set with an extra patch that logs a bunch of stuff, and might let us see what's upsetting the cfbot. Well, that didn't help a lot, but meanwhile the CFBot seems to have decided in the last few days that it's now happy, so full steam aead! ;-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: POC, WIP: OR-clause support for indexes
On 20/2/2024 11:03, jian he wrote: Neither the code comments nor the commit message really explain the design idea here. That's unfortunate, principally because it makes review difficult. I'm very skeptical about the idea of using JumbleExpr for any part of this. It seems fairly expensive, and it might produce false matches. If expensive is OK, then why not just use equal()? If it's not, then this probably isn't really OK either. But in any case there should be comments explaining why this strategy was chosen. The above message (https://postgr.es/m/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com) seems still not answered. How can we evaluate whether JumbleExpr is expensive or not? I used this naive script to test, but didn't find a big difference when enable_or_transformation is ON or OFF. First, I am open to discussion here. But IMO, equal() operation is quite expensive by itself. We should use the hash table approach to avoid quadratic behaviour when looking for similar clauses in the 'OR' list. Moreover, we use equal() in many places: selectivity estimations, proving of fitting the index, predtest, etc. So, by reducing the clause list, we eliminate many calls of the equal() routine, too. `leftop operator rightop` the operator can also be volatile. Do we need to check (op_volatile(opno) == PROVOLATILE_VOLATILE) within transformBoolExprOr? As usual, could you provide a test case to discuss it more objectively? -- regards, Andrei Lepikhov Postgres Professional
Re: speed up a logical replica setup
Hi, On Tue, 20 Feb 2024 at 06:59, Euler Taveira wrote: > > On Mon, Feb 19, 2024, at 7:22 AM, Shlok Kyal wrote: > > I have reviewed the v21 patch. And found an issue. > > Initially I started the standby server with a new postgresql.conf file > (not the default postgresql.conf that is present in the instance). > pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf" > > And I have made 'max_replication_slots = 1' in new postgresql.conf and > made 'max_replication_slots = 0' in the default postgresql.conf file. > Now when we run pg_createsubscriber on standby we get error: > pg_createsubscriber: error: could not set replication progress for the > subscription "pg_createsubscriber_5_242843": ERROR: cannot query or > manipulate replication origin when max_replication_slots = 0 > > > That's by design. See [1]. The max_replication_slots parameter is used as the > maximum number of subscriptions on the server. > > NOTICE: dropped replication slot "pg_createsubscriber_5_242843" on publisher > pg_createsubscriber: error: could not drop publication > "pg_createsubscriber_5" on database "postgres": ERROR: publication > "pg_createsubscriber_5" does not exist > pg_createsubscriber: error: could not drop replication slot > "pg_createsubscriber_5_242843" on database "postgres": ERROR: > replication slot "pg_createsubscriber_5_242843" does not exist > > > That's a bug and should be fixed. > > [1] > https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER > I think you misunderstood the issue, I reported. My main concern is that the standby server is using different 'postgresql.conf' file (the default file) after : + /* Start subscriber and wait until accepting connections */ + pg_log_info("starting the subscriber"); + if (!dry_run) + start_standby_server(pg_ctl_path, opt.subscriber_dir, server_start_log); But we initially started the standby server (before running the pg_createsubscriber) with a new postgresql.conf file (different from the default file. for this example created inside the 'new_path' folder). pg_ctl -D ../standby -l standby.log start -o "-c config_file=../new_path/postgresql.conf" So, when we run the pg_createsubscriber, all the initial checks will be run on the standby server started using the new postgresql.conf file. But during pg_createsubscriber, it will restart the standby server using the default postgresql.conf file. And this might create some issues. Thanks and Regards, Shlok Kyal
Re: Optimize planner memory consumption for huge arrays
On 19/2/2024 20:47, Tomas Vondra wrote: On 9/8/23 07:11, Lepikhov Andrei wrote: Just for comparison, without partitioning: elems 1 1E1 1E2 1E3 1E4 master: 12kB14kB37kB266kB 2.5MB patched:12kB11.5kB 13kB24kB141kB These improvements look pretty nice, considering how simple the patch seems to be. I can't even imagine how much memory we'd need with even more partitions (say, 1000) if 100 partitions means 274MB. BTW when releasing memory in scalararraysel, wouldn't it be good to also free the elem_values/elem_nulls? I haven't tried and maybe it's not that significant amount. Agree. Added into the next version of the patch. Moreover, I see a slight planning speedup. Looking into the reason for that, I discovered that it is because sometimes the planner utilizes the same memory piece for the next array element. It finds this piece more quickly than before that optimization. -- regards, Andrei Lepikhov Postgres Professional From 2e89dc8b743953068174c777d7a014e1ea71f659 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 20 Feb 2024 11:05:51 +0700 Subject: [PATCH] Utilize memory in scalararraysel in more optimal manner. Estimating selectivity on an array operation free the memory right after working out a single element. It reduces peak memory consumption on massive arrays. --- src/backend/utils/adt/selfuncs.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index cea777e9d4..e5bad75ec1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -281,6 +281,7 @@ eqsel_internal(PG_FUNCTION_ARGS, bool negate) selec = var_eq_non_const(&vardata, operator, collation, other, varonleft, negate); + pfree(other); ReleaseVariableStats(vardata); return selec; @@ -1961,15 +1962,15 @@ scalararraysel(PlannerInfo *root, { List *args; Selectivity s2; - - args = list_make2(leftop, - makeConst(nominal_element_type, - -1, - nominal_element_collation, - elmlen, - elem_values[i], - elem_nulls[i], - elmbyval)); + Const *c = makeConst(nominal_element_type, + -1, + nominal_element_collation, + elmlen, + elem_values[i], + elem_nulls[i], + elmbyval); + + args = list_make2(leftop, c); if (is_join_clause) s2 = DatumGetFloat8(FunctionCall5Coll(&oprselproc, clause->inputcollid, @@ -1985,7 +1986,8 @@ scalararraysel(PlannerInfo *root, ObjectIdGetDatum(operator), PointerGetDatum(args), Int32GetDatum(varRelid))); - + list_free(args); + pfree(c); if (useOr) { s1 = s1 + s2 - s1 * s2; @@ -2004,6 +2006,9 @@ scalararraysel(PlannerInfo *root, if ((useOr ? isEquality : isInequality) && s1disjoint >= 0.0 && s1disjoint <= 1.0) s1 = s1disjoint; + + pfree(elem_values); + pfree(elem_nulls); } else if (rightop && IsA(rightop, ArrayExpr) && !((ArrayExpr *) rightop)->multidims) @@ -2052,6 +2057,7 @@ scalararraysel(PlannerInfo *root, ObjectIdGetDat
Re: POC, WIP: OR-clause support for indexes
On Mon, Feb 19, 2024 at 4:35 PM Andrei Lepikhov wrote: > > In attachment - v17 for both patches. As I see it, the only general > explanation of the idea is not addressed. I'm not sure how deeply we > should explain it. > On Tue, Nov 28, 2023 at 5:04 AM Robert Haas wrote: > > On Mon, Nov 27, 2023 at 3:02 AM Andrei Lepikhov > wrote: > > On 25/11/2023 08:23, Alexander Korotkov wrote: > > > I think patch certainly gets better in this aspect. One thing I can't > > > understand is why do we use home-grown code for resolving > > > hash-collisions. You can just define custom hash and match functions > > > in HASHCTL. Even if we need to avoid repeated JumbleExpr calls, we > > > still can save pre-calculated hash value into hash entry and use > > > custom hash and match. This doesn't imply us to write our own > > > collision-resolving code. > > > > Thanks, it was an insightful suggestion. > > I implemented it, and the code has become shorter (see attachment). > > Neither the code comments nor the commit message really explain the > design idea here. That's unfortunate, principally because it makes > review difficult. > > I'm very skeptical about the idea of using JumbleExpr for any part of > this. It seems fairly expensive, and it might produce false matches. > If expensive is OK, then why not just use equal()? If it's not, then > this probably isn't really OK either. But in any case there should be > comments explaining why this strategy was chosen. The above message (https://postgr.es/m/CA%2BTgmoZCgP6FrBQEusn4yaWm02XU8OPeoEMk91q7PRBgwaAkFw%40mail.gmail.com) seems still not answered. How can we evaluate whether JumbleExpr is expensive or not? I used this naive script to test, but didn't find a big difference when enable_or_transformation is ON or OFF. ` create table test_1_100 as (select (random()*1000)::int x, (random()*1000) y from generate_series(1,1_000_000) i); explain(costs off, analyze) select * from test where x = 1 or x + 2= 3 or x + 3= 4 or x + 4= 5 or x + 5= 6 or x + 6= 7 or x + 7= 8 or x + 8= 9 or x + 9=10 or x + 10= 11 or x + 11= 12 or x + 12= 13 or x + 13= 14 or x + 14= 15 or x + 15= 16 or x + 16= 17 or x + 17= 18 or x + 18=19 or x + 19= 20 or x + 20= 21 or x + 21= 22 or x + 22= 23 or x + 23= 24 or x + 24= 25 or x + 25= 26 or x + 26= 27 or x + 27=28 or x + 28= 29 or x + 29= 30 or x + 30= 31 \watch i=0.1 c=10 ` `leftop operator rightop` the operator can also be volatile. Do we need to check (op_volatile(opno) == PROVOLATILE_VOLATILE) within transformBoolExprOr?
Re: Synchronizing slots from primary to standby
On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada wrote: > > Some comments not related to the patch but to the existing code: > > --- > It might have already been discussed but is the > src/backend/replication/logical the right place for the slocsync.c? If > it's independent of logical decoding/replication, is under > src/backend/replication could be more appropriate? > This point has not been discussed, so thanks for raising it. I think the reasoning behind keeping it in logical is that this file contains a code for logical slot syncing and a worker doing that. As it is mostly about logical slot syncing so there is an argument to keep it under logical. In the future, we may need to extend this functionality to have a per-db slot sync worker as well in which case it will probably be again somewhat related to logical slots. Having said that, there is an argument to keep it under replication as well because the functionality it provides is for physical standbys. -- With Regards, Amit Kapila.
Re: Synchronizing slots from primary to standby
On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada wrote: > > > I've reviewed the v91 patch. Here are random comments: Thanks for the comments. > --- > /* > * Checks the remote server info. > * > - * We ensure that the 'primary_slot_name' exists on the remote server and the > - * remote server is not a standby node. > + * Check whether we are a cascading standby. For non-cascading standbys, it > + * also ensures that the 'primary_slot_name' exists on the remote server. > */ > > IIUC what the validate_remote_info() does doesn't not change by this > patch, so the previous comment seems to be clearer to me. > > --- > if (remote_in_recovery) > + { > + /* > +* If we are a cascading standby, no need to check further for > +* 'primary_slot_name'. > +*/ > ereport(ERROR, > errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("cannot synchronize replication slots from a > standby server")); > + } > + else > + { > + boolprimary_slot_valid; > > - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > - Assert(!isnull); > + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > + Assert(!isnull); > > - if (!primary_slot_valid) > - ereport(ERROR, > - errcode(ERRCODE_INVALID_PARAMETER_VALUE), > - errmsg("bad configuration for slot synchronization"), > - /* translator: second %s is a GUC variable name */ > - errdetail("The replication slot \"%s\" specified by > \"%s\" does not exist on the primary server.", > - PrimarySlotName, "primary_slot_name")); > + if (!primary_slot_valid) > + ereport(ERROR, > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("bad configuration for slot synchronization"), > + /* translator: second %s is a GUC variable name */ > + errdetail("The replication slot \"%s\" specified > by \"%s\" does not exist on the primary server.", > + PrimarySlotName, "primary_slot_name")); > + } > > I think it's a refactoring rather than changes required by the > slotsync worker. We can do that in a separate patch but why do we need > this change in the first place? In v90, this refactoring was made due to the fact that validate_remote_info() was supposed to behave differently for SQL function and slot-sync worker. SQL-function was supposed to ERROR out while the worker was supposed to LOG and become no-op. And thus the change was needed. In v91, we made this functionality same i.e. both sql function and worker will error out but missed to remove this refactoring. Thanks for catching this, I will revert it in the next version. To match the refactoring, I made the comment change too, will revert that as well. > --- > +ValidateSlotSyncParams(ERROR); > + > /* Load the libpq-specific functions */ > load_file("libpqwalreceiver", false); > > -ValidateSlotSyncParams(); > +(void) CheckDbnameInConninfo(); > > Is there any reason why we move where to check the parameters? Earlier DBname verification was done inside ValidateSlotSyncParams() and thus it was needed to load 'libpqwalreceiver' before we call this function. Now we have moved dbname verification in a separate call and thus the above order got changed. ValidateSlotSyncParams() is a common function used by SQL function and worker. Earlier slot sync worker was checking all the GUCs after starting up and was exiting each time any GUC was invalid. It was suggested that it would be better to check the GUCs before starting the slot sync worker in the postmaster itself, making the ValidateSlotSyncParams() move to postmaster (see SlotSyncWorkerAllowed). But it was not a good idea to load libpq in postmaster and thus we moved libpq related verification out of ValidateSlotSyncParams(). This resulted in the above change. I hope it answers your query. thanks Shveta
Re: POC, WIP: OR-clause support for indexes
On 19/2/2024 19:53, Ranier Vilela wrote: v17-0002 1) move the vars *arrayconst and *dest, to after if, to avoid makeNode (palloc). + Const *arrayconst; + ScalarArrayOpExpr *dest; + + pd = (PredicatesData *) lfirst(lc); + if (pd->elems == NIL) + /* The index doesn't participate in this operation */ + continue; + arrayconst = lsecond_node(Const, saop->args); + dest = makeNode(ScalarArrayOpExpr); Thanks for the review! I'm not sure I understand you clearly. Does the patch in attachment fix the issue you raised? -- regards, Andrei Lepikhov Postgres Professional diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 56b04541db..1545876e2f 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -1284,7 +1284,7 @@ build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, ArrayType *arrayval = NULL; ArrayExpr *arr = NULL; Const *arrayconst = lsecond_node(Const, saop->args); - ScalarArrayOpExpr *dest = makeNode(ScalarArrayOpExpr); + ScalarArrayOpExpr dest; pd = (PredicatesData *) lfirst(lc); if (pd->elems == NIL) @@ -1302,10 +1302,10 @@ build_paths_for_SAOP(PlannerInfo *root, RelOptInfo *rel, RestrictInfo *rinfo, arr->multidims = false; /* Compose new SAOP, partially covering the source one */ - memcpy(dest, saop, sizeof(ScalarArrayOpExpr)); - dest->args = list_make2(linitial(saop->args), arr); + memcpy(&dest, saop, sizeof(ScalarArrayOpExpr)); + dest.args = list_make2(linitial(saop->args), arr); - clause = (Expr *) estimate_expression_value(root, (Node *) dest); + clause = (Expr *) estimate_expression_value(root, (Node *) &dest); /* * Create new RestrictInfo. It maybe more heavy than just copy node,
Re: Patch: Add parse_type Function
On Feb 19, 2024, at 21:58, Erik Wienhold wrote: > See the patch I wrote for my benchmarks. But it's pretty easy anyway to > cut down parse_type() ;) LOL, I missed that, just wrote it myself in the last hour. :-) v6 attached. > But you don't actually need reformat_type() in pgTAP. You can just get > the type OID and modifier of the want_type and have_type and compare > those. Then use format_type() for the error message. Looks a bit > cleaner to me than doing the string comparison. Fair. > On second thought, I guess comparing the reformatted type names is > necessary in order to have a uniform API on older Postgres releases > where pgTAP has to provide its own to_regtypmod() based on typmodin > functions. Maybe. Worth playing with. >> For the latter, it could easily be an example in the docs. > > Can be mentioned right under format_type(). Well I included it in the to_regtypemod docs here, but could so either. Best, David v6-0001-Add-to_regtypemod-SQL-function.patch Description: Binary data
Re: Patch: Add parse_type Function
On 2024-02-19 23:59 +0100, David E. Wheeler wrote: > On Feb 19, 2024, at 15:47, Tom Lane wrote: > > >> 1. Add a to_regtypmod() for those who just want the typemod. > > > > Seems like there's a good case for doing that. > > I’ll work on that. See the patch I wrote for my benchmarks. But it's pretty easy anyway to cut down parse_type() ;) > > I'm less thrilled about that, mainly because I can't think of a good > > name for it ("format_type_string" is certainly not that). Is the > > use-case for this functionality really strong enough that we need to > > provide it as a single function rather than something assembled out > > of spare parts? > > Not essential for pgTAP, no, as we can just use the parts. It was the > typmod bit that was missing. But you don't actually need reformat_type() in pgTAP. You can just get the type OID and modifier of the want_type and have_type and compare those. Then use format_type() for the error message. Looks a bit cleaner to me than doing the string comparison. On second thought, I guess comparing the reformatted type names is necessary in order to have a uniform API on older Postgres releases where pgTAP has to provide its own to_regtypmod() based on typmodin functions. > On Feb 19, 2024, at 17:13, Tom Lane wrote: > > > After some time ruminating, a couple of possibilities occurred to > > me: reformat_type(text) canonical_type_name(text) > > I was just thinking about hitting the thesaurus entry for “canonical”, > but I quite like reformat_type. It’s super clear and draws the > parallel to format_type() more clearly. Will likely steal the name for > pgTAP if we don’t add it to core. :-) > > > I'm still unconvinced about that, though. > > I guess the questions are: > > * What are the other use cases for it? Can't think of any right now other than this are-type-names-the-same check. Perhaps also for pretty-printing user-provided type names where we don't take the type info from e.g. pg_attribute. > * How obvious is it how to do it? > > For the latter, it could easily be an example in the docs. Can be mentioned right under format_type(). -- Erik
Re: Synchronizing slots from primary to standby
On Mon, Feb 19, 2024 at 9:59 PM shveta malik wrote: > > On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila wrote: > > > > Few comments on 0001 > > Thanks for the feedback. > > > > > 1. I think it is better to error out when the valid GUC or option is > > not set in ensure_valid_slotsync_params() and > > ensure_valid_remote_info() instead of waiting. And we shouldn't start > > the worker in the first place if not all required GUCs are set. This > > will additionally simplify the code a bit. > > Sure, removed 'ensure' functions. Moved the validation checks to the > postmaster before starting the slot sync worker. > > > 2. > > +typedef struct SlotSyncWorkerCtxStruct > > { > > - /* prevents concurrent slot syncs to avoid slot overwrites */ > > + pid_t pid; > > + bool stopSignaled; > > bool syncing; > > + time_t last_start_time; > > slock_t mutex; > > -} SlotSyncCtxStruct; > > +} SlotSyncWorkerCtxStruct; > > > > I think we don't need to change the name of this struct as this can be > > used both by the worker and the backend. We can probably add the > > comment to indicate that all the fields except 'syncing' are used by > > slotsync worker. > > Modified. > > > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't > > be changed to SlotSyncWorkerShmemInit(). > > Modified. > > > 4. > > +ReplSlotSyncWorkerMain(int argc, char *argv[]) > > { > > ... > > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); > > ... > > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); > > ... > > } > > > > Do we need two separate callbacks? Can't we have just one (say > > slotsync_failure_callback) that cleans additional things in case of > > slotsync worker? And, if we need both those callbacks then please add > > some comments for both and why one is before_shmem_exit() and the > > other is on_shmem_exit(). > > I think we can merge these now. Earlier 'on_shmem_exit' was needed to > avoid race-condition between startup and slot sync worker process to > drop 'i' slots on promotion. Now we do not have any such scenario. > But I need some time to analyze it well. Will do it in the next > version. > > > In addition to the above, I have made a few changes in the comments > > and code (cosmetic code changes). Please include those in the next > > version if you find them okay. You need to rename .txt file to remove > > .txt and apply atop v90-0001*. > > Sure, included these. > > Please find the patch001 attached. I've reviewed the v91 patch. Here are random comments: --- /* * Checks the remote server info. * - * We ensure that the 'primary_slot_name' exists on the remote server and the - * remote server is not a standby node. + * Check whether we are a cascading standby. For non-cascading standbys, it + * also ensures that the 'primary_slot_name' exists on the remote server. */ IIUC what the validate_remote_info() does doesn't not change by this patch, so the previous comment seems to be clearer to me. --- if (remote_in_recovery) + { + /* +* If we are a cascading standby, no need to check further for +* 'primary_slot_name'. +*/ ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot synchronize replication slots from a standby server")); + } + else + { + boolprimary_slot_valid; - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); - Assert(!isnull); + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); - if (!primary_slot_valid) - ereport(ERROR, - errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("bad configuration for slot synchronization"), - /* translator: second %s is a GUC variable name */ - errdetail("The replication slot \"%s\" specified by \"%s\" does not exist on the primary server.", - PrimarySlotName, "primary_slot_name")); + if (!primary_slot_valid) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + /* translator: second %s is a GUC variable name */ + errdetail("The replication slot \"%s\" specified by \"%s\" does not exist on the primary server.", + PrimarySlotName, "primary_slot_name")); + } I think it's a refactoring rather than changes required by the slotsync worker. We can do that in a separate patch but why do we need this change in the first place? --- +ValidateSlotSyncParams(ERROR); + /* Load the libpq-specific functions */ load_file("libpqwalreceiver", false); -ValidateSlotSyncParams(); +(void) CheckDbnameInConninfo(); Is there any reason why we move where to check the parameters? Some comments not related to the patch but to the existing code: --- It might have alre
Re: Memory consumed by paths during partitionwise join planning
On 19/2/2024 19:25, Ashutosh Bapat wrote: On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov wrote: Live example: right now, I am working on the code like MSSQL has - a combination of NestLoop and HashJoin paths and switching between them in real-time. It requires both paths in the path list at the moment when extensions are coming. Even if one of them isn't referenced from the upper pathlist, it may still be helpful for the extension. There is no guarantee that every path presented to add_path will be preserved. Suboptimal paths are freed as and when add_path discovers that they are suboptimal. So I don't think an extension can rely on existence of a path. But having a refcount makes it easy to preserve the required paths by referencing them. I don't insist, just provide my use case. It would be ideal if you would provide some external routines for extensions that allow for sticking the path in pathlist even when it has terrible cost estimation. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. filter or free? Filter. I meant that Postres tries to apply IndexScan, BitmapScan, IndexOnlyScan, and other strategies, passing throughout the partition indexes. The optimizer spends a lot of time doing that. So, why not introduce a symmetrical strategy and give away from the search some indexes of types of scan based on the pathifying experience of previous partitions of the same table: if you have dozens of partitions, Is it beneficial for the system to find a bit more optimal IndexScan on one partition having SeqScans on 999 other? IIUC, you are suggesting that instead of planning each partition/partitionwise join, we only create paths with the strategies which were found to be optimal with previous partitions. That's a good heuristic but it won't work if partition properties - statistics, indexes etc. differ between groups of partitions. Sure, but the "Symmetry" strategy assumes that on the scope of a thousand partitions, especially with parallel append involved, it doesn't cause sensible performance degradation if we find a bit suboptimal path in a small subset of partitions. Does it make sense? As I see, when people use 10-100 partitions for the table, they usually strive to keep indexes symmetrical for all partitions. -- regards, Andrei Lepikhov Postgres Professional
Support boolcol IS [NOT] UNKNOWN in partition pruning
While working on 4c2369ac5, I noticed there's close to as much code to disallow BooleanTests in the form of "IS UNKNOWN" and "IS NOT UNKNOWN" in partition pruning as it would take to allow pruning to work for these. The attached makes it work. David From b9f3ff909652c96f1f0dced9e1165ffa8c93c7f1 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Tue, 20 Feb 2024 14:56:49 +1300 Subject: [PATCH v1] Support partition pruning on boolcol IS [NOT] UNKNOWN While working on 4c2369ac5, I noticed we went out of our way not to support clauses on boolean partitioned tables in the form of "IS UNKNOWN" and "IS NOT UNKNOWN". It's almost as much code to disallow this as it is to allow it, so let's allow it. --- src/backend/partitioning/partprune.c | 68 +++ src/test/regress/expected/partition_prune.out | 56 --- src/test/regress/sql/partition_prune.sql | 7 ++ 3 files changed, 91 insertions(+), 40 deletions(-) diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index ccb9a579b3..5382c24fb9 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -200,7 +200,7 @@ static PartClauseMatchStatus match_boolean_partition_clause(Oid partopfamily, Expr *clause, Expr *partkey, Expr **outconst, - bool *noteq); + bool *notclause); static void partkey_datum_from_expr(PartitionPruneContext *context, Expr *expr, int stateidx, Datum *value, bool *isnull); @@ -1798,13 +1798,14 @@ match_clause_to_partition_key(GeneratePruningStepsContext *context, Oid partopfamily = part_scheme->partopfamily[partkeyidx], partcoll = part_scheme->partcollation[partkeyidx]; Expr *expr; - boolnoteq; + boolnotclause; /* * Recognize specially shaped clauses that match a Boolean partition key. */ boolmatchstatus = match_boolean_partition_clause(partopfamily, clause, - partkey, &expr, ¬eq); + partkey, &expr, + ¬clause); if (boolmatchstatus == PARTCLAUSE_MATCH_CLAUSE) { @@ -1818,7 +1819,7 @@ match_clause_to_partition_key(GeneratePruningStepsContext *context, * punt it off to gen_partprune_steps_internal() to generate pruning * steps. */ - if (noteq) + if (notclause) { List *new_clauses; List *or_clause; @@ -1836,9 +1837,8 @@ match_clause_to_partition_key(GeneratePruningStepsContext *context, else { /* -* We only expect match_boolean_partition_clause to match for -* IS_NOT_TRUE and IS_NOT_FALSE. IS_NOT_UNKNOWN is not -* supported. +* We only expect match_boolean_partition_clause to return +* PARTCLAUSE_MATCH_CLAUSE for IS_NOT_TRUE and IS_NOT_FALSE. */ Assert(false); } @@ -1876,6 +1876,15 @@ match_clause_to_partition_key(GeneratePruningStepsContext *context, return PARTCLAUSE_MATCH_CLAUSE; } + else if (boolmatchstatus == PARTCLAUSE_MATCH_NULLNESS) + { + /* +* Handle IS UNKNOWN and IS NOT UNKNOWN. These just logically +* translate to IS NULL and IS NOT NULL. +*/ + *clause_is_not_null = notclause; + return PARTCLAUSE_MATCH_NULLNESS; + } else if (IsA(clause, OpExpr) && list_length(((OpExpr *) clause)->args) == 2) { @@ -3650,21 +3659,21 @@ perform_pruning_combine_step(PartitionPruneConte
Re: Thoughts about NUM_BUFFER_PARTITIONS
Hi Heikki Linnakangas I saw git log found this commit: https://github.com/postgres/postgres/commit/3acc10c997f916f6a741d0b4876126b7b08e3892 ,I don't seem to see an email discussing this commit. As the commit log tells us, we don't know exactly how large a value is optimal, and I believe it's more flexible to make it as a parameter.Thank you very much tomas.vondra for explaining the relationship, i see that MAX_SIMUL_LWLOCKS was just doubled in this commit, is there a more appropriate ratio between them? ``` commit 3acc10c997f916f6a741d0b4876126b7b08e3892 Author: Robert Haas Date: Thu Oct 2 13:58:50 2014 -0400 Increase the number of buffer mapping partitions to 128. Testing by Amit Kapila, Andres Freund, and myself, with and without other patches that also aim to improve scalability, seems to indicate that this change is a significant win over the current value and over smaller values such as 64. It's not clear how high we can push this value before it starts to have negative side-effects elsewhere, but going this far looks OK. ` wenhui qiu 于2024年2月20日周二 09:36写道: > Hi Japlin Li >Thank you for such important information ! Got it > > Japin Li 于2024年2月19日周一 10:26写道: > >> >> On Mon, 19 Feb 2024 at 00:56, Tomas Vondra >> wrote: >> > On 2/18/24 03:30, Li Japin wrote: >> >> >> >> I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the >> NUM_BUFFER_PARTITIONS, >> >> I didn’t find any comments to describe the relation between >> MAX_SIMUL_LWLOCKS and >> >> NUM_BUFFER_PARTITIONS, am I missing someghing? >> > >> > IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be >> > higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all >> > the partition locks if needed. >> > >> >> Thanks for the explanation! Got it. >> >> > There's other places that acquire a bunch of locks, and all of them need >> > to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has >> > GIST_MAX_SPLIT_PAGES. >> > >> > >> > regards >> >
Re: Thoughts about NUM_BUFFER_PARTITIONS
Hi Japlin Li Thank you for such important information ! Got it Japin Li 于2024年2月19日周一 10:26写道: > > On Mon, 19 Feb 2024 at 00:56, Tomas Vondra > wrote: > > On 2/18/24 03:30, Li Japin wrote: > >> > >> I find it seems need to change MAX_SIMUL_LWLOCKS if we enlarge the > NUM_BUFFER_PARTITIONS, > >> I didn’t find any comments to describe the relation between > MAX_SIMUL_LWLOCKS and > >> NUM_BUFFER_PARTITIONS, am I missing someghing? > > > > IMHO the relationship is pretty simple - MAX_SIMUL_LWLOCKS needs to be > > higher than NUM_BUFFER_PARTITIONS, so that the backend can acquire all > > the partition locks if needed. > > > > Thanks for the explanation! Got it. > > > There's other places that acquire a bunch of locks, and all of them need > > to be careful not to exceed MAX_SIMUL_LWLOCKS. For example gist has > > GIST_MAX_SPLIT_PAGES. > > > > > > regards >
Re: speed up a logical replica setup
On Mon, Feb 19, 2024, at 7:22 AM, Shlok Kyal wrote: > I have reviewed the v21 patch. And found an issue. > > Initially I started the standby server with a new postgresql.conf file > (not the default postgresql.conf that is present in the instance). > pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf" > > And I have made 'max_replication_slots = 1' in new postgresql.conf and > made 'max_replication_slots = 0' in the default postgresql.conf file. > Now when we run pg_createsubscriber on standby we get error: > pg_createsubscriber: error: could not set replication progress for the > subscription "pg_createsubscriber_5_242843": ERROR: cannot query or > manipulate replication origin when max_replication_slots = 0 That's by design. See [1]. The max_replication_slots parameter is used as the maximum number of subscriptions on the server. > NOTICE: dropped replication slot "pg_createsubscriber_5_242843" on publisher > pg_createsubscriber: error: could not drop publication > "pg_createsubscriber_5" on database "postgres": ERROR: publication > "pg_createsubscriber_5" does not exist > pg_createsubscriber: error: could not drop replication slot > "pg_createsubscriber_5_242843" on database "postgres": ERROR: > replication slot "pg_createsubscriber_5_242843" does not exist That's a bug and should be fixed. [1] https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER -- Euler Taveira EDB https://www.enterprisedb.com/
Re: serial not accepted as datatype in ALTER TABLE ... ALTER COLUMN
Ashutosh Bapat writes: > On Sun, Feb 18, 2024 at 1:59 PM Andy Fan wrote: >> >> >> I tried your idea with the attatchment, it is still in a drafted state >> but it can be used as a prove-of-concept and for better following >> communicating. Just one point needs to metion is serial implies >> "default value" + "not null" constaint. So when we modify a column into >> serial, we need to modify the 'NULL value' and only to the default value >> at the RewriteTable stage. >> > > I am surprised that this requires changes in ReWrite. I thought adding > NOT NULL constraint and default value commands would be done by > transformColumnDefinition(). But I haven't looked at the patch close > enough. Hmm, I think this depends on how to handle the NULL values before the RewriteTable. Consider the example like this: \pset null (null) create table t(a int); insert into t select 1; insert into t select; postgres=# select * from t; a 1 (null) (2 rows) since serial type implies "not null" + "default value", shall we raise error or fill the value with the "default" value? The patch choose the later way which needs changes in RewirteTable stage, but now I think the raise error directly is an option as well. -- Best Regards Andy Fan
Re: Have pg_basebackup write "dbname" in "primary_conninfo"?
On Tue, 20 Feb 2024 at 00:34, Ian Lawrence Barwick wrote: > With the addition of "pg_sync_replication_slots()", there is now a use-case > for > including "dbname" in "primary_conninfo" and the docs have changed from > stating [1]: > > Do not specify a database name in the primary_conninfo string. > > to [2]: > > For replication slot synchronization (see Section 48.2.3), it is also > necessary to specify a valid dbname in the primary_conninfo string. This > will > only be used for slot synchronization. It is ignored for streaming. > > [1] > https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY > [2] > https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY Sounds like that documentation should be updated in the same way as was done for pg_basebackup/pg_receivewal in commit cca97ce6a665. When considering middleware/proxies having dbname in there can be useful even for older PG versions. > I can't see any reason for continuing to omit "dbname", so suggest it should > only continue to be omitted for 16 and earlier; see attached patch. Yeah, that seems like a good change. Though, I'm wondering if the version check is actually necessary.
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
On Mon, Feb 19, 2024 at 09:49:24AM +, Bertrand Drouvot wrote: > On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote: >> Prefix 'initial_' makes the variable names a bit longer, I think we >> can just use effective_xmin, catalog_effective_xmin and restart_lsn, >> the code updating then only when if (!terminated) tells one that they >> aren't updated every time. > > I'm not sure about that. I prefer to see meaningfull names instead of having > to read the code where they are used. Prefixing these with "initial_" is fine, IMO. That shows the intention that these come from the slot's data before doing the termination. So I'm OK with what's been proposed in v3. >> This needs a bit more info as to why and how effective_xmin, >> catalog_effective_xmin and restart_lsn can move ahead after signaling >> a backend and before the signalled backend reports back. > > I'm not sure of the added value of such extra details in this comment and if > the comment would be easy to maintain. I've the feeling that knowing it's > possible > is enough here. Happy to hear what others think about it too. Documenting that the risk exists rather than all the possible reasons why this could happen is surely more maintainable. In short, I'm OK with what the patch does, just telling that it is possible. >> +Assert(!(conflict_prev != RS_INVAL_NONE && terminated && >> + conflict_prev != conflict)); >> >> It took a while for me to understand the above condition, can we >> simplify it like below using De Morgan's laws for better readability? >> >> Assert((conflict_prev == RS_INVAL_NONE) || !terminated || >> (conflict_prev == conflict)); > > I don't have a strong opinon on this, looks like a matter of taste. Both are the same to me, so I have no extra opinion to offer. ;) -- Michael signature.asc Description: PGP signature
Have pg_basebackup write "dbname" in "primary_conninfo"?
Hi Hi With the addition of "pg_sync_replication_slots()", there is now a use-case for including "dbname" in "primary_conninfo" and the docs have changed from stating [1]: Do not specify a database name in the primary_conninfo string. to [2]: For replication slot synchronization (see Section 48.2.3), it is also necessary to specify a valid dbname in the primary_conninfo string. This will only be used for slot synchronization. It is ignored for streaming. [1] https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY [2] https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY However, when setting up a standby (with the intent of creating a logical standby) with pg_basebackup, providing the -R/-write-recovery-conf option results in a "primary_conninfo" string being generated without a "dbname" parameter, which requires a manual change should one intend to use "pg_sync_replication_slots()". I can't see any reason for continuing to omit "dbname", so suggest it should only continue to be omitted for 16 and earlier; see attached patch. Note that this does mean that if the conninfo string provided to pg_basebackup does not contain "dbname", the generated "primary_conninfo" GUC will default to "dbname=replication". It would be easy enough to suppress this, but AFAICS there's no way to tell if it was explicitly supplied by the user, in which case it would be surprising if it were omitted from the final "primary_conninfo" string. The only other place where GenerateRecoveryConfig() is called is pg_rewind; I can't see any adverse affects from the proposed change. But it's perfectly possible there's something blindingly obvious I'm overlooking. Regards Ian Barwick diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c index 2585f11939..de463f3956 100644 --- a/src/fe_utils/recovery_gen.c +++ b/src/fe_utils/recovery_gen.c @@ -49,12 +49,16 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot) { /* Omit empty settings and those libpqwalreceiver overrides. */ if (strcmp(opt->keyword, "replication") == 0 || - strcmp(opt->keyword, "dbname") == 0 || strcmp(opt->keyword, "fallback_application_name") == 0 || (opt->val == NULL) || (opt->val != NULL && opt->val[0] == '\0')) continue; + /* Omit "dbname" in PostgreSQL 16 and earlier. */ + if (strcmp(opt->keyword, "dbname") == 0 && + PQserverVersion(pgconn) < MINIMUM_VERSION_FOR_DBNAME_PARAM) + continue; + /* Separate key-value pairs with spaces */ if (conninfo_buf.len != 0) appendPQExpBufferChar(&conninfo_buf, ' '); diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h index ca2c4800d0..740d90c9d8 100644 --- a/src/include/fe_utils/recovery_gen.h +++ b/src/include/fe_utils/recovery_gen.h @@ -20,6 +20,12 @@ */ #define MINIMUM_VERSION_FOR_RECOVERY_GUC 12 +/* + * from version 17, there is a use-case for adding the dbname parameter + * to the generated "primary_conninfo" value + */ +#define MINIMUM_VERSION_FOR_DBNAME_PARAM 17 + extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot); extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,
Re: speed up a logical replica setup
On Mon, Feb 19, 2024, at 6:47 AM, Peter Eisentraut wrote: > Some review of the v21 patch: Thanks for checking. > - commit message > > Mention pg_createsubscriber in the commit message title. That's the > most important thing that someone doing git log searches in the future > will be looking for. Right. Done. > - doc/src/sgml/ref/allfiles.sgml > > Move the new entry to alphabetical order. Done. > > - doc/src/sgml/ref/pg_createsubscriber.sgml > > + > + The pg_createsubscriber should be run at > the target > + server. The source server (known as publisher server) should accept > logical > + replication connections from the target server (known as subscriber > server). > + The target server should accept local logical replication connection. > + > > "should" -> "must" ? Done. > + > + Options > > Sort options alphabetically. Done. > It would be good to indicate somewhere which options are mandatory. I'll add this information in the option description. AFAICT the synopsis kind of indicates it. > + > + Examples > > I suggest including a pg_basebackup call into this example, so it's > easier for readers to get the context of how this is supposed to be > used. You can add that pg_basebackup in this example is just an example > and that other base backups can also be used. We can certainly add it but creating a standby isn't out of scope here? I will make sure to include references to pg_basebackup and the "Setting up a Standby Server" section. > - doc/src/sgml/reference.sgml > > Move the new entry to alphabetical order. Done. > - src/bin/pg_basebackup/Makefile > > Move the new sections to alphabetical order. Done. > - src/bin/pg_basebackup/meson.build > > Move the new sections to alphabetical order. Done. > > - src/bin/pg_basebackup/pg_createsubscriber.c > > +typedef struct CreateSubscriberOptions > +typedef struct LogicalRepInfo > > I think these kinds of local-use struct don't need to be typedef'ed. > (Then you also don't need to update typdefs.list.) Done. > +static void > +usage(void) > > Sort the options alphabetically. Are you referring to s/options/functions/? > +static char * > +get_exec_path(const char *argv0, const char *progname) > > Can this not use find_my_exec() and find_other_exec()? It is indeed using it. I created this function because it needs to run the same code path twice (pg_ctl and pg_resetwal). > +int > +main(int argc, char **argv) > > Sort the options alphabetically (long_options struct, getopt_long() > argument, switch cases). Done. > - .../t/040_pg_createsubscriber.pl > - .../t/041_pg_createsubscriber_standby.pl > > These two files could be combined into one. Done. > +# Force it to initialize a new cluster instead of copying a > +# previously initdb'd cluster. > > Explain why? Ok. It needs a new cluster because it will have a different system identifier so we can make sure the target cluster is a copy of the source server. > +$node_s->append_conf( > + 'postgresql.conf', qq[ > +log_min_messages = debug2 > > Is this setting necessary for the test? No. It is here as a debugging aid. Better to include it in a separate patch. There are a few messages that I don't intend to include in the final patch. All of these modifications will be included in the next patch. I'm finishing to integrate patches proposed by Hayato [1] and some additional fixes and refactors. [1] https://www.postgresql.org/message-id/TYCPR01MB12077A8421685E5515DE408EEF5512%40TYCPR01MB12077.jpnprd01.prod.outlook.com -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Injection points: some tools to wait and wake
On Mon, Feb 19, 2024 at 02:28:04PM +, Bertrand Drouvot wrote: > +CREATE FUNCTION injection_points_wake() > > what about injection_points_wakeup() instead? Sure. > +/* Shared state information for injection points. */ > +typedef struct InjectionPointSharedState > +{ > + /* protects accesses to wait_counts */ > + slock_t lock; > + > + /* Counter advancing when injection_points_wake() is called */ > + int wait_counts; > > If slock_t protects "only" one counter, then what about using pg_atomic_uint64 > or pg_atomic_uint32 instead? And btw do we need wait_counts at all? (see > comment > number 4) We could, indeed, even if we use more than one counter. Still, I would be tempted to keep it in case more data is added to this structure as that would make for less stuff to do while being normally non-critical. This sentence may sound pedantic, though.. > + * SQL function for waking a condition variable > > waking up instead? Okay. > I'm wondering if this loop and wait_counts are needed, why not doing something > like (and completely get rid of wait_counts)? > > " > ConditionVariablePrepareToSleep(&inj_state->wait_point); > ConditionVariableSleep(&inj_state->wait_point, injection_wait_event); > ConditionVariableCancelSleep(); > " > > It's true that the comment above ConditionVariableSleep() mentions that: Perhaps not, but it encourages good practices around the use of condition variables, and we need to track all that in shared memory anyway. Ashutosh has argued in favor of the approach taken by the patch in the original thread when I've sent a version doing exactly what you are saying now to not track a state in shmem. -- Michael signature.asc Description: PGP signature
Re: Injection points: some tools to wait and wake
On Mon, Feb 19, 2024 at 11:54:20AM +0300, Andrey M. Borodin wrote: > 1. injection_points_wake() will wake all of waiters. But it's not > suitable for complex tests. I think there must be a way to wake only > specific waiter by injection point name. I don't disagree with that, but I don't have a strong argument for implementing that until there is an explicit need for it in core. It is also possible to plug in your own module, outside of core, if you are looking for something more specific. The backend APIs allow that. > 2. Alexander Korotkov's stopevents could be used in isolation > tests. This kind of tests is perfect for describing complex race > conditions. (as a side note, I'd be happy if we could have > primary\standby in isolation tests too) This requires plugging is more into src/test/isolation/, with multiple connection strings. This has been suggested in the past. > 3. Can we have some Perl function for this? > +# Wait until the checkpointer is in the middle of the restart point > +# processing, relying on the custom wait event generated in the > +# wait callback used in the injection point previously attached. > +ok( $node_standby->poll_query_until( > + 'postgres', > + qq[SELECT count(*) FROM pg_stat_activity > + WHERE backend_type = 'checkpointer' AND wait_event = > 'injection_wait' ;], > + '1'), > + 'checkpointer is waiting in restart point' > +) or die "Timed out while waiting for checkpointer to run restart point"; > > Perhaps something like > $node->do_a_query_and_wait_for_injection_point_observed(sql,injection_point_name); I don't see why not. But I'm not sure how much I need to plug in into the main modules yet. > 4. Maybe I missed it, but I'd like to see a guideline on how to name > injection points. I don't think we have any of that, or even that we need one. In short, I'm OK to be more consistent with the choice of ginbtree.c and give priority to it and make it more official in the docs. > 5. In many cases we need to have injection point under critical > section. I propose to have a "prepared injection point". See [0] for > example in v2-0003-Test-multixact-CV-sleep.patch > + INJECTION_POINT_PREPARE("GetNewMultiXactId-done"); > + > START_CRIT_SECTION(); > > + INJECTION_POINT_RUN_PREPARED(); I don't see how that's different from a wait/wake logic? The only thing you've changed is to stop a wait when a point is detached and you want to make the stop conditional. Plugging in a condition variable is more flexible than a hardcoded sleep in terms of wait, while being more responsive. > 7. This is extremely distant, but some DBMSs allow to enable > injection points by placing files on the filesystem. That would > allow to test something during recovery when no SQL interface is > present. Yeah, I could see why you'd want to do something like that. If a use case pops up, that can surely be discussed. -- Michael signature.asc Description: PGP signature
Re: 035_standby_logical_decoding unbounded hang
On Fri, Feb 16, 2024 at 06:37:38AM +, Bertrand Drouvot wrote: > On Thu, Feb 15, 2024 at 12:48:16PM -0800, Noah Misch wrote: > > On Wed, Feb 14, 2024 at 03:31:16PM +, Bertrand Drouvot wrote: > > > What about creating a sub, say wait_for_restart_lsn_calculation() in > > > Cluster.pm > > > and then make use of it in create_logical_slot_on_standby() and above? > > > (something > > > like wait_for_restart_lsn_calculation-v1.patch attached). > > > > Waiting for restart_lsn is just a prerequisite for calling > > pg_log_standby_snapshot(), so I wouldn't separate those two. > > Yeah, makes sense. > > > If we're > > extracting a sub, I would move the pg_log_standby_snapshot() call into the > > sub > > and make the API like one of these: > > > > $standby->wait_for_subscription_starting_point($primary, $slot_name); > > $primary->log_standby_snapshot($standby, $slot_name); > > > > Would you like to finish the patch in such a way? > > Sure, please find attached standby-slot-test-2-race-v2.patch doing so. I used > log_standby_snapshot() as it seems more generic to me. Thanks. Pushed at commit 0e16281.
Re: Avoid switching between system-user and system-username in the doc
On Mon, Feb 19, 2024 at 06:00:11PM +0100, Jelte Fennema-Nio wrote: > On Mon, 19 Feb 2024 at 09:52, Bertrand Drouvot > wrote: >> Please find attached a tiny patch to clean that up. > > LGTM Looks like a mistake from me in efb6f4a4f9b6, will fix and backpatch for consistency. -- Michael signature.asc Description: PGP signature
Re: Streaming read-ready sequential scan code
On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman wrote: > > There is an outstanding question about where to allocate the > PgStreamingRead object for sequential scans I've written three alternative implementations of the actual streaming read user for sequential scan which handle the question of where to allocate the streaming read object and how to handle changing scan direction in different ways. Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation - Allocates the streaming read object in initscan(). Since we do not know the scan direction at this time, if the scan ends up not being a forwards scan, the streaming read object must later be freed -- so this will sometimes allocate a streaming read object it never uses. - Only supports ForwardScanDirection and once the scan direction changes, streaming is never supported again -- even if we return to ForwardScanDirection - Must maintain a "fallback" codepath that does not use the streaming read API Option B) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only - Allocates the streaming read object in heapgettup[_pagemode]() when it has not been previously allocated. To do this it has to record and switch into a different memory context than the per-tuple context. It only allocates the streaming read object if it is a forwards scan. It frees the streaming read object if the scan direction is later changed. - Only supports ForwardScanDirection and once the scan direction changes, streaming is never supported again -- even if we return to ForwardScanDirection - Must maintain a "fallback" codepath that does not use the streaming read API Option C) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream - Allocates the streaming read object in heapgettup[_pagemode]() when it has not been previously allocated. To do this it has to record and switch into a different memory context than the per-tuple context. - All scan directions support streaming. To do this, the scan direction has to be tracked and we must check if the direction has changed on every heapgettup[_pagemode]() invocation to avoid returning wrong results. - No "fallback" codepath as all heap sequential scans will use the streaming read API As you can see, each option has pros and cons. I'm interested in what others think about which we should choose. - Melanie
Re: Possible to trigger autovacuum?
On Mon, Feb 19, 2024 at 03:15:29PM -0600, Chris Cleveland wrote: > Is it possible to launch an autovacuum from within an extension? > > I'm developing an index access method. After the index gets built it needs > some cleanup and optimization. I'd prefer to do this in the > amvacuumcleanup() method so it can happen periodically and asynchronously. > > I could fire up a background worker to do the job, but it would be a lot > simpler to call please_launch_autovacuum_right_now(); The autovacuum launcher can be stopped in its nap with signals, like a SIGHUP. So you could rely on that to force a job to happen on a given database based on the timing you're aiming for. -- Michael signature.asc Description: PGP signature
Re: Patch: Add parse_type Function
On Feb 19, 2024, at 15:47, Tom Lane wrote: >> 1. Add a to_regtypmod() for those who just want the typemod. > > Seems like there's a good case for doing that. I’ll work on that. > I'm less thrilled about that, mainly because I can't think of > a good name for it ("format_type_string" is certainly not that). > Is the use-case for this functionality really strong enough that > we need to provide it as a single function rather than something > assembled out of spare parts? Not essential for pgTAP, no, as we can just use the parts. It was the typmod bit that was missing. On Feb 19, 2024, at 17:13, Tom Lane wrote: > After some time ruminating, a couple of possibilities occurred to > me: > reformat_type(text) > canonical_type_name(text) I was just thinking about hitting the thesaurus entry for “canonical”, but I quite like reformat_type. It’s super clear and draws the parallel to format_type() more clearly. Will likely steal the name for pgTAP if we don’t add it to core. :-) > I'm still unconvinced about that, though. I guess the questions are: * What are the other use cases for it? * How obvious is it how to do it? For the latter, it could easily be an example in the docs. Best, David
Re: Add last_commit_lsn to pg_stat_database
On Mon, Feb 19, 2024 at 10:26:43AM +0100, Tomas Vondra wrote: > On 2/19/24 07:57, Michael Paquier wrote: > > On Sun, Feb 18, 2024 at 02:28:06AM +0100, Tomas Vondra wrote: >>> 1) Do we really need to modify RecordTransactionCommitPrepared and >>> XactLogCommitRecord to return the LSN of the commit record? Can't we >>> just use XactLastRecEnd? It's good enough for advancing >>> replorigin_session_origin_lsn, why wouldn't it be good enough for this? >> >> Or XactLastCommitEnd? > > But that's not set in RecordTransactionCommitPrepared (or twophase.c in > general), and the patch seems to need that. Hmm. Okay. > > The changes in AtEOXact_PgStat() are not really > > attractive for what's a static variable in all the backends. > > I'm sorry, I'm not sure what "changes not being attractive" means :-( It just means that I am not much a fan of the signature changes done for RecordTransactionCommit() and AtEOXact_PgStat_Database(), adding a dependency to a specify LSN. Your suggestion to switching to XactLastRecEnd should avoid that. > - stats_fetch_consistency=cache: always the same min/max value > > - stats_fetch_consistency=none: min != max > > Which would suggest you're right and this should be VOLATILE and not > STABLE. But then again - the existing pg_stat_get_db_* functions are all > marked as STABLE, so (a) is that correct and (b) why should this > function be marked different? This can look like an oversight of 5891c7a8ed8f to me. I've skimmed through the threads related to this commit and messages around [1] explain why this GUC exists and why we have both behaviors, but I'm not seeing a discussion about the volatibility of these functions. The current situation is not bad either, the default makes these functions stable, and I'd like to assume that users of this GUC know what they do. Perhaps Andres or Horiguchi-san can comment on that. https://www.postgresql.org/message-id/382908.1616343...@sss.pgh.pa.us -- Michael signature.asc Description: PGP signature
Re: Patch: Add parse_type Function
I wrote: > I'm less thrilled about that, mainly because I can't think of > a good name for it ("format_type_string" is certainly not that). After some time ruminating, a couple of possibilities occurred to me: reformat_type(text) canonical_type_name(text) > Is the use-case for this functionality really strong enough that > we need to provide it as a single function rather than something > assembled out of spare parts? I'm still unconvinced about that, though. regards, tom lane
Re: Optimize planner memory consumption for huge arrays
Tomas Vondra writes: > On 2/19/24 16:45, Tom Lane wrote: >> Tomas Vondra writes: >>> For example, I don't think we expect selectivity functions to allocate >>> long-lived objects, right? So maybe we could run them in a dedicated >>> memory context, and reset it aggressively (after each call). >> That could eliminate a whole lot of potential leaks. I'm not sure >> though how much it moves the needle in terms of overall planner >> memory consumption. > I'm not sure about that either, maybe not much - for example it would > not help with the two other memory usage patches (which are related to > SpecialJoinInfo and RestrictInfo, outside selectivity functions). > It was an ad hoc thought, inspired by the issue at hand. Maybe it would > be possible to find similar "boundaries" in other parts of the planner. Here's a quick and probably-incomplete implementation of that idea. I've not tried to study its effects on memory consumption, just made sure it passes check-world. The main hazard here is that something invoked inside clause selectivity might try to cache a data structure for later use. However, there are already places that do that kind of thing, and they already explicitly switch into the planner_cxt, because otherwise they fail under GEQO. (If we do find places that need fixing for this, they were probably busted under GEQO already.) Perhaps it's worth updating the comments at those places, but I didn't bother in this first cut. regards, tom lane diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index c949dc1866..669acd7315 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -24,6 +24,7 @@ #include "statistics/statistics.h" #include "utils/fmgroids.h" #include "utils/lsyscache.h" +#include "utils/memutils.h" #include "utils/selfuncs.h" /* @@ -693,6 +694,7 @@ clause_selectivity_ext(PlannerInfo *root, Selectivity s1 = 0.5; /* default for any unhandled clause type */ RestrictInfo *rinfo = NULL; bool cacheable = false; + MemoryContext saved_cxt; if (clause == NULL) /* can this still happen? */ return s1; @@ -756,6 +758,21 @@ clause_selectivity_ext(PlannerInfo *root, clause = (Node *) rinfo->clause; } + /* + * Run all the remaining work in the short-lived planner_tmp_cxt, which + * we'll reset at the bottom. This allows selectivity-related code to not + * be too concerned about leaking memory. + */ + saved_cxt = MemoryContextSwitchTo(root->planner_tmp_cxt); + + /* + * This function can be called recursively, in which case we'd better not + * reset planner_tmp_cxt until we exit the topmost level. Use of + * planner_tmp_cxt_depth also makes it safe for other places to use and + * reset planner_tmp_cxt in the same fashion. + */ + root->planner_tmp_cxt_depth++; + if (IsA(clause, Var)) { Var *var = (Var *) clause; @@ -967,6 +984,12 @@ clause_selectivity_ext(PlannerInfo *root, rinfo->outer_selec = s1; } + /* Exit and clean up the planner_tmp_cxt */ + MemoryContextSwitchTo(saved_cxt); + Assert(root->planner_tmp_cxt_depth > 0); + if (--root->planner_tmp_cxt_depth == 0) + MemoryContextReset(root->planner_tmp_cxt); + #ifdef SELECTIVITY_DEBUG elog(DEBUG4, "clause_selectivity: s1 %f", s1); #endif /* SELECTIVITY_DEBUG */ diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index be4e182869..44b9555dbf 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -643,6 +643,17 @@ subquery_planner(PlannerGlobal *glob, Query *parse, root->plan_params = NIL; root->outer_params = NULL; root->planner_cxt = CurrentMemoryContext; + /* a subquery can share the main query's planner_tmp_cxt */ + if (parent_root) + { + root->planner_tmp_cxt = parent_root->planner_tmp_cxt; + Assert(parent_root->planner_tmp_cxt_depth == 0); + } + else + root->planner_tmp_cxt = AllocSetContextCreate(root->planner_cxt, + "Planner temp context", + ALLOCSET_DEFAULT_SIZES); + root->planner_tmp_cxt_depth = 0; root->init_plans = NIL; root->cte_plan_ids = NIL; root->multiexpr_params = NIL; @@ -6514,6 +6525,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid) root->glob = glob; root->query_level = 1; root->planner_cxt = CurrentMemoryContext; + root->planner_tmp_cxt = AllocSetContextCreate(root->planner_cxt, + "Planner temp context", + ALLOCSET_DEFAULT_SIZES); + root->planner_tmp_cxt_depth = 0; root->wt_param_id = -1; root->join_domains = list_make1(makeNode(JoinDomain)); @@ -6552,7 +6567,10 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid) * trust the index contents but use seqscan-and-sort. */ if (lc == NULL)/* not in the list? */ + { + MemoryContextDelete(root->planner_tmp_cxt); return true; /* use sort */ + } /* * Rather than doing all the pushups that would be needed to use @@ -658
Possible to trigger autovacuum?
Is it possible to launch an autovacuum from within an extension? I'm developing an index access method. After the index gets built it needs some cleanup and optimization. I'd prefer to do this in the amvacuumcleanup() method so it can happen periodically and asynchronously. I could fire up a background worker to do the job, but it would be a lot simpler to call please_launch_autovacuum_right_now(); -- Chris Cleveland 312-339-2677 mobile
Re: Patch: Add parse_type Function
"David E. Wheeler" writes: > The only way I can think of to avoid that is to: > 1. Add a to_regtypmod() for those who just want the typemod. Seems like there's a good case for doing that. > 2. Add a format_type_string() function that returns a string, the equivalent > of this function but in C: > CREATE FUNCTION format_type_string(text) RETURNS TEXT AS $$ >SELECT format_type(to_regtype($1), to_regtypmod($1)) > $$; I'm less thrilled about that, mainly because I can't think of a good name for it ("format_type_string" is certainly not that). Is the use-case for this functionality really strong enough that we need to provide it as a single function rather than something assembled out of spare parts? regards, tom lane
Re: Why is pq_begintypsend so slow?
Hi, On 2024-02-19 10:02:52 +0900, Sutou Kouhei wrote: > In <20240218200906.zvihkrs46yl2j...@awork3.anarazel.de> > "Re: Why is pq_begintypsend so slow?" on Sun, 18 Feb 2024 12:09:06 -0800, > Andres Freund wrote: > > >> [1] > >> https://www.postgresql.org/message-id/flat/20240215.153421.96888103784986803.kou%40clear-code.com#34df359e6d255795d16814ce138cc995 > > >> Do we need one FunctionCallInfoBaseData for each attribute? > >> How about sharing one FunctionCallInfoBaseData by all > >> attributes like [1]? > > > > That makes no sense to me. You're throwing away most of the possible gains > > by > > having to update the FunctionCallInfo fields on every call. You're saving > > neglegible amounts of memory at a substantial runtime cost. > > The number of updated fields of your approach and [1] are > same: > > Your approach: 6 (I think that "fcinfo->isnull = false" is > needed though.) > > + fcinfo->args[0].value = CStringGetDatum(string); > + fcinfo->args[0].isnull = false; > + fcinfo->args[1].value = > ObjectIdGetDatum(attr->typioparam); > + fcinfo->args[1].isnull = false; > + fcinfo->args[2].value = > Int32GetDatum(attr->typmod); > + fcinfo->args[2].isnull = false; > > [1]: 6 (including "fcinfo->isnull = false") > > + fcinfo->flinfo = flinfo; > + fcinfo->context = escontext; > + fcinfo->isnull = false; > + fcinfo->args[0].value = CStringGetDatum(str); > + fcinfo->args[1].value = ObjectIdGetDatum(typioparam); > + fcinfo->args[2].value = Int32GetDatum(typmod); If you want to do so you can elide the isnull assignments in my approach just as well as yours. Assigning not just the value but also flinfo and context is overhead. But you can't elide assigning flinfo and context, which is why reusing one FunctionCallInfo isn't going to win I don't think you necessarily need to assign fcinfo->isnull on every call, these functions aren't allowed to return NULL IIRC. And if they do we'd error out, so it could only happen once. > I don't have a strong opinion how to implement this > optimization as I said above. It seems that you like your > approach. So I withdraw [1]. Could you complete this > optimization? Can we proceed making COPY format extensible > without this optimization? It seems that it'll take a little > time to complete this optimization because your patch is > still WIP. And it seems that you can work on it after making > COPY format extensible. I don't think optimizing this aspect needs to block making copy extensible. I don't know how much time/energy I'll have to focus on this in the near term. I really just reposted this because the earlier patches were relevant for the discussion in another thread. If you want to pick the COPY part up, feel free. Greetings, Andres Freund
Re: PGC_SIGHUP shared_buffers?
Hi, On 2024-02-19 13:54:01 -0500, Joe Conway wrote: > On 2/19/24 13:13, Andres Freund wrote: > > On 2024-02-19 09:19:16 -0500, Joe Conway wrote: > > > Couldn't we scale the rounding, e.g. allow small allocations as we do now, > > > but above some number always round? E.g. maybe >= 2GB round to the nearest > > > 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, > > > etc? > > > > That'd make the translation considerably more expensive. Which is important, > > given how common an operation this is. > > > Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my > intent was that the rounding be done/enforced when setting the GUC value > which surely cannot be that often. It'd be used for something like WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ; If CHUNK_SIZE isn't a compile time constant this gets a good bit more expensive. A lot more, if implemented naively (i.e as actual modulo/division operations, instead of translating to shifts and masks). Greetings, Andres Freund
Re: Proposal: Adjacent B-Tree index
On Mon, 19 Feb 2024 at 18:48, Dilshod Urazov wrote: > > - Motivation > > A regular B-tree index provides efficient mapping of key values to tuples > within a table. However, if you have two tables connected in some way, a > regular B-tree index may not be efficient enough. In this case, you would > need to create an index for each table. The purpose will become clearer if we > consider a simple example which is the main use-case as I see it. I'm not sure why are two indexes not sufficient here? PostgreSQL can already do merge joins, which would have the same effect of hitting the same location in the index at the same time between all tables, without the additional overhead of having to scan two table's worth of indexes in VACUUM. > During the vacuum of A an index tuple pointing to a dead tuple in A should be > cleaned as well as all index tuples for the same key. This is definitely not correct. If I have this schema table test (id int primary key, b text unique) table test_ref(test_id int references test(id)) and if an index would contain entries for both test and test_ref, it can't just remove all test_ref entries because an index entry with the same id was removed: The entry could've been removed because (e.g.) test's b column was updated thus inserting a new index entry for the new HOT-chain's TID. > would suffice for this new semantics. With the provided explanation I don't think this is a great idea. Kind regards, Matthias van de Meent.
Re: PGC_SIGHUP shared_buffers?
On 2/19/24 13:13, Andres Freund wrote: On 2024-02-19 09:19:16 -0500, Joe Conway wrote: Couldn't we scale the rounding, e.g. allow small allocations as we do now, but above some number always round? E.g. maybe >= 2GB round to the nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, etc? That'd make the translation considerably more expensive. Which is important, given how common an operation this is. Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my intent was that the rounding be done/enforced when setting the GUC value which surely cannot be that often. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PGC_SIGHUP shared_buffers?
Hi, On 2024-02-19 09:19:16 -0500, Joe Conway wrote: > On 2/18/24 15:35, Andres Freund wrote: > > On 2024-02-18 17:06:09 +0530, Robert Haas wrote: > > > How many people set shared_buffers to something that's not a whole > > > number of GB these days? > > > > I'd say the vast majority of postgres instances in production run with less > > than 1GB of s_b. Just because numbers wise the majority of instances are > > running on small VMs and/or many PG instances are running on one larger > > machine. There are a lot of instances where the total available memory is > > less than 2GB. > > > > > I mean I bet it happens, but in practice if you rounded to the nearest GB, > > > or even the nearest 2GB, I bet almost nobody would really care. I think > > > it's > > > fine to be opinionated here and hold the line at a relatively large > > > granule, > > > even though in theory people could want something else. > > > > I don't believe that at all unfortunately. > > Couldn't we scale the rounding, e.g. allow small allocations as we do now, > but above some number always round? E.g. maybe >= 2GB round to the nearest > 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, > etc? That'd make the translation considerably more expensive. Which is important, given how common an operation this is. Greetings, Andres Freund
Proposal: Adjacent B-Tree index
- Motivation A regular B-tree index provides efficient mapping of key values to tuples within a table. However, if you have two tables connected in some way, a regular B-tree index may not be efficient enough. In this case, you would need to create an index for each table. The purpose will become clearer if we consider a simple example which is the main use-case as I see it. - Example We need to store a graph. So we create a table for nodes CREATE TABLE Nodes ( id SERIAL PRIMARY KEY, label VARCHAR(255) ); and a table for edges CREATE TABLE Edges ( label VARCHAR(255), source INTEGER REFERENCES Nodes(id), target INTEGER REFERENCES Nodes(id) ); In order to efficiently traverse a graph we would have an index for Nodes.id which created automatically in this case and an index for Edges.source. - Tweaked B-Tree We could optimize cases like the former by modifying PostgreSQL btree index to allow it to index 2 tables simultaneously. Semantically it would be UNIQUE index for attribute x of table A and an index for attribute y in table B. In the non-deduplicated case index tuple pointing to a tuple in A should be marked by a flag. In the deduplicated case first TID in an array can be interpreted as a link to A. During the vacuum of A an index tuple pointing to a dead tuple in A should be cleaned as well as all index tuples for the same key. We can reach this by clearing all index tuples after the dead one until we come to index tuple marked by a flag with different key. Or we can enforce deduplication in such index. In the example with a graph such index would provide PRIMARY KEY for Nodes and the index for Edges.Source. The query SELECT * FROM Nodes WHERE id = X; will use this index and take into account only a TID in Nodes (so this would be marked index tuple or first TID in a posting list). The query SELECT * FROM Edges WHERE source = X; conversely will ignore links to Nodes. -- Syntax I believe that CREATE TABLE Nodes ( id SERIAL PRIMARY KEY ADJACENT, label VARCHAR(255) ); CREATE TABLE Edges ( label VARCHAR(255), source INTEGER REFERENCES ADJACENT Nodes(id), target INTEGER REFERENCES Nodes(id) ); would suffice for this new semantics. -- Dilshod Urazov
Re: Optimize planner memory consumption for huge arrays
On 2/19/24 16:45, Tom Lane wrote: > Tomas Vondra writes: >> Considering there are now multiple patches improving memory usage during >> planning with partitions, perhaps it's time to take a step back and >> think about how we manage (or rather not manage) memory during query >> planning, and see if we could improve that instead of an infinite >> sequence of ad hoc patches? > > +1, I've been getting an itchy feeling about that too. I don't have > any concrete proposals ATM, but I quite like your idea here: > >> For example, I don't think we expect selectivity functions to allocate >> long-lived objects, right? So maybe we could run them in a dedicated >> memory context, and reset it aggressively (after each call). > > That could eliminate a whole lot of potential leaks. I'm not sure > though how much it moves the needle in terms of overall planner > memory consumption. I'm not sure about that either, maybe not much - for example it would not help with the two other memory usage patches (which are related to SpecialJoinInfo and RestrictInfo, outside selectivity functions). It was an ad hoc thought, inspired by the issue at hand. Maybe it would be possible to find similar "boundaries" in other parts of the planner. I keep thinking about how compilers/optimizers typically have separate optimizations passes, maybe that's something we might leverage ... > I've always supposed that the big problem was data structures > associated with rejected Paths, but I might be wrong. Is there some > simple way we could get a handle on where the most memory goes while > planning? > I suspect this might have changed thanks to partitioning - it's not a coincidence most of the recent memory usage improvements address cases with many partitions. As for how to analyze the memory usage - maybe there's a simpler way, but what I did recently was adding simple instrumentation into memory contexts, recording pointer/size/backtrace for each request, and then a script that aggregated that into a "currently allocated" report with information about "source" of the allocation. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Avoid switching between system-user and system-username in the doc
On Mon, 19 Feb 2024 at 09:52, Bertrand Drouvot wrote: > Please find attached a tiny patch to clean that up. LGTM
Re: Patch: Add parse_type Function
On Feb 18, 2024, at 15:55, Erik Wienhold wrote: >> The overhead of parse_type_and_format can be related to higher planning >> time. PL/pgSQL can assign composite without usage FROM clause. > > Thanks, didn't know that this makes a difference. In that case both > variants are on par. Presumably this is a side-effect of the use of a RECORD here, which requires a FROM clause in pure SQL, yes? The only way I can think of to avoid that is to: 1. Add a to_regtypmod() for those who just want the typemod. 2. Add a format_type_string() function that returns a string, the equivalent of this function but in C: CREATE FUNCTION format_type_string(text) RETURNS TEXT AS $$ SELECT format_type(to_regtype($1), to_regtypmod($1)) $$; We could also keep the record-returning function for users who want both the regtype and the regytypemod at once, but with the other two I consider it less essential. Thoughts? Best, David
Re: JIT compilation per plan node
Hi Melih, On 1/2/24 20:50, Melih Mutlu wrote: > Hi hackers, > > After discussing this with David offlist, I decided to reinitiate this > discussion that has already been raised and discussed several times in the > past. [1] [2] > > Currently, if JIT is enabled, the decision for JIT compilation is purely > tied to the total cost of the query. The number of expressions to be JIT > compiled is not taken into consideration, however the time spent JITing > also depends on that number. This may cause the cost of JITing to become > too much that it hurts rather than improves anything. > An example case would be that you have many partitions and run a query that > touches one, or only a few, of those partitions. If there is no partition > pruning done in planning time, all 1000 partitions are JIT compiled while > most will not even be executed. > > Proposed patch (based on the patch from [1]) simply changes consideration > of JIT from plan level to per-plan-node level. Instead of depending on the > total cost, we decide whether to perform JIT on a node or not by > considering only that node's cost. This allows us to only JIT compile plan > nodes with high costs. > > Here is a small test case to see the issue and the benefit of the patch: > > CREATE TABLE listp(a int, b int) PARTITION BY LIST(a); > SELECT 'CREATE TABLE listp'|| x || ' PARTITION OF listp FOR VALUES IN > ('||x||');' FROM generate_Series(1,1000) x; \gexec > INSERT INTO listp SELECT 1,x FROM generate_series(1,1000) x; > > EXPLAIN (VERBOSE, ANALYZE) SELECT COUNT(*) FROM listp WHERE b < 0; > > master jit=off: > Planning Time: 25.113 ms > Execution Time: 315.896 ms > > master jit=on: > Planning Time: 24.664 ms > JIT: >Functions: 9008 >Options: Inlining false, Optimization false, Expressions true, Deforming > true >Timing: Generation 290.705 ms (Deform 108.274 ms), Inlining 0.000 ms, > Optimization 165.991 ms, Emission 3232.775 ms, Total 3689.472 ms > Execution Time: 1612.817 ms > > patch jit=on: > Planning Time: 24.055 ms > JIT: >Functions: 17 >Options: Inlining false, Optimization false, Expressions true, Deforming > true >Timing: Generation 1.463 ms (Deform 0.232 ms), Inlining 0.000 ms, > Optimization 0.766 ms, Emission 11.609 ms, Total 13.837 ms > Execution Time: 299.721 ms > Thanks for the updated / refreshed patch. I think one of the main challenges this patch faces is that there's a couple old threads with previous attempts, and the current thread simply builds on top of them, without explaining stuff fully. But people either don't realize that, or don't have time to read old threads just in case, so can't follow some of the decisions :-( I think it'd be good to maybe try to explain some of the problems and solutions more thoroughly, or at least point to the relevant places in the old threads ... > > A bit more on what this patch does: > - It introduces a new context to keep track of the number of estimated > calls and if JIT is decided for each node that the context applies. AFAIK this is an attempt to deal with passing the necessary information while constructing the plan, which David originally tried [1] doing by passing est_calls during create_plan ... I doubt CreatePlanContext is a great way to achieve this. For one, it breaks the long-standing custom that PlannerInfo is the first parameter, usually followed by RelOptInfo, etc. CreatePlanContext is added to some functions (but not all), which makes it ... unpredictable. FWIW it's not clear to me if/how this solves the problem with early create_plan() calls for subplans. Or is it still the same? Wouldn't it be simpler to just build the plan as we do now, and then have an expression_tree_walker that walks the complete plan top-down, inspects the nodes, enables JIT where appropriate and so on? That can have arbitrary context, no problem with that. Considering we decide JIT pretty late anyway (long after costing and other stuff that might affect the plan selection), the result should be exactly the same, without the extensive createplan.c disruption ... (usual caveat: I haven't tried, maybe there's something that means this can't work) > - The number of estimated calls are especially useful where a node is > expected to be rescanned, such as Gather. Gather Merge, Memoize and Nested > Loop. Knowing the estimated number of calls for a node allows us to rely on > total cost multiplied by the estimated calls instead of only total cost for > the node. Not sure I follow. Why would be these nodes (Gather, Gather Merge, ...) more likely to be rescanned compared to other nodes? > - For each node, the planner considers if the node should be JITed. If the > cost of the node * the number of estimated calls is greater than > jit_above_cost, it's decided to be JIT compiled. Note that this changes > the meaning of jit_above_cost, it's now a threshold for a single plan node > and not the whole query. Additionally, this change in JIT consideration is >
Re: Optimize planner memory consumption for huge arrays
Tomas Vondra writes: > Considering there are now multiple patches improving memory usage during > planning with partitions, perhaps it's time to take a step back and > think about how we manage (or rather not manage) memory during query > planning, and see if we could improve that instead of an infinite > sequence of ad hoc patches? +1, I've been getting an itchy feeling about that too. I don't have any concrete proposals ATM, but I quite like your idea here: > For example, I don't think we expect selectivity functions to allocate > long-lived objects, right? So maybe we could run them in a dedicated > memory context, and reset it aggressively (after each call). That could eliminate a whole lot of potential leaks. I'm not sure though how much it moves the needle in terms of overall planner memory consumption. I've always supposed that the big problem was data structures associated with rejected Paths, but I might be wrong. Is there some simple way we could get a handle on where the most memory goes while planning? regards, tom lane
Re: numeric_big in make check?
Dean Rasheed writes: > On 19 Feb 2024, at 12:48, Tom Lane wrote: >> Or we could just flush it. It's never detected a bug, and I think >> you'd find that it adds zero code coverage (or if not, we could >> fix that in a far more surgical and less expensive manner). > Off the top of my head, I can't say to what extent that's true, but it > wouldn't surprise me if at least some of the tests added in the last 4 > commits to touch that file aren't covered by tests elsewhere. Indeed > that certainly looks like the case for 18a02ad2a5. I'm sure those > tests could be pared down though. I thought I'd try to acquire some actual facts here, so I compared the code coverage shown by "make check" as of HEAD, versus "make check" after adding numeric_big to parallel_schedule. I saw the following lines of numeric.c as being covered in the second run and not the first: numeric(): 1285 || !NUMERIC_IS_SHORT(num))) 1293 new->choice.n_long.n_sign_dscale = NUMERIC_SIGN(new) | 1294 ((uint16) dscale & NUMERIC_DSCALE_MASK); div_var_fast(): 9185 idivisor = idivisor * NBASE + var2->digits[1]; 9186 idivisor_weight--; sqrt_var(): 10073 res_ndigits = res_weight + 1 - (-rscale - 1) / DEC_DIGITS; Pretty poor return on investment for the runtime consumed. I don't object to adding something to numeric.sql that's targeted at adding coverage for these (or anyplace else that's not covered), but let's not just throw cycles at the problem. Oddly, there were a few lines in numeric_poly_combine and int8_avg_combine that were hit in the first run and not the second. Apparently our tests of parallel aggregation aren't as reproducible as one could wish. regards, tom lane
Re: Add trim_trailing_whitespace to editorconfig file
On Fri, 16 Feb 2024 at 11:45, Peter Eisentraut wrote: > I have committed that one. Thanks :) > v3-0002-Require-final-newline-in-.po-files.patch > > The .po files are imported from elsewhere, so I'm not sure this is going > to have the desired effect. Perhaps it's worth cleaning up, but it > would require more steps. Okay, yeah that would need to be changed at the source then. Removed this change from the newly attached patchset, as well as updating editorconfig to have "insert_final_newline = unset" for .po files. > v3-0003-Bring-editorconfig-in-line-with-gitattributes.patch > > I question whether we need to add rules to .editorconfig about files > that are generated or imported from elsewhere, since those are not meant > to be edited. I agree that it's not strictly necessary to have .editorconfig match .gitattributes for files that are not meant to be edited by hand. But I don't really see a huge downside either, apart from having a few extra lines it .editorconfig. And adding these lines does have a few benefits: 1. It makes it easy to ensure that .editorconfig and .gitattributes stay in sync 2. If someone opens a file that they are not supposed to edit by hand, and then saves it. Then no changes are made. As opposed to suddenly making some whitespace changes Attached is a new patchset with the first commit split in three separate commits, which configure: 1. Files meant to be edited by hand) 2. Output test files (maybe edited by hand) 3. Imported/autogenerated files The first one is definitely the most useful to me personally. From 419d2d0e45d40a4cddb942fbc63c3c54f4dd479d Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio Date: Mon, 19 Feb 2024 16:03:31 +0100 Subject: [PATCH v4 2/5] Include test output files in .editorconfig Most of the time these will be copy pasted from a test run, but if there's a trivial change someone might want to make the change by hand. By adding them to .editorconfig, we make sure that doing so won't mess up the formatting. --- .editorconfig | 9 + 1 file changed, 9 insertions(+) diff --git a/.editorconfig b/.editorconfig index 0e7c2048f9..92a16bd5de 100644 --- a/.editorconfig +++ b/.editorconfig @@ -27,3 +27,12 @@ trim_trailing_whitespace = false [src/backend/catalog/sql_features.txt] trim_trailing_whitespace = false + +# Test output files that contain extra whitespace +[*.out] +trim_trailing_whitespace = false +insert_final_newline = unset + +[src/interfaces/ecpg/test/expected/*] +trim_trailing_whitespace = false +insert_final_newline = unset -- 2.34.1 From b05173b619c45d9ad859da50bfa78d1fc626fe3d Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio Date: Wed, 14 Feb 2024 17:19:42 +0100 Subject: [PATCH v4 1/5] Handle blank-at-eof/blank-at-eol in .editorconfig too For many files in the repo our .gitattributes file is configured to complain about trailing whitespace at the end of a line, as well as not including a final newline at the end of the file. This updates our .editorconfig file, to make editors and IDEs fix these issues automatically on save in the same way for files that are intended to be edited by hand. --- .editorconfig | 15 +++ 1 file changed, 15 insertions(+) diff --git a/.editorconfig b/.editorconfig index d69a3d1dc4..0e7c2048f9 100644 --- a/.editorconfig +++ b/.editorconfig @@ -1,5 +1,9 @@ root = true +[*] +trim_trailing_whitespace = true +insert_final_newline = true + [*.{c,h,l,y,pl,pm}] indent_style = tab indent_size = tab @@ -12,3 +16,14 @@ indent_size = 1 [*.xsl] indent_style = space indent_size = 2 + +# Certain data files that contain special whitespace, and other special cases +[*.data] +trim_trailing_whitespace = false +insert_final_newline = unset + +[contrib/pgcrypto/sql/pgp-armor.sql] +trim_trailing_whitespace = false + +[src/backend/catalog/sql_features.txt] +trim_trailing_whitespace = false base-commit: e1b7fde418f2c0ba4ab0d9fbfa801ef62d96397b -- 2.34.1 From 97fe71d7769a3e4191adedbfe4092afec696ea7b Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio Date: Thu, 15 Feb 2024 10:23:19 +0100 Subject: [PATCH v4 4/5] Add note about keeping .editorconfig and .gitattributes in sync Now that they are in sync, hopefully this reminder makes sure we keep them that way. Automation to detect them being out of sync seems excessive. --- .editorconfig | 1 + .gitattributes | 1 + 2 files changed, 2 insertions(+) diff --git a/.editorconfig b/.editorconfig index c742e0f844..3f5c4e0ef8 100644 --- a/.editorconfig +++ b/.editorconfig @@ -1,3 +1,4 @@ +# IMPORTANT: When updating this file, also update .gitattributes to match. root = true [*] diff --git a/.gitattributes b/.gitattributes index e9ff4a56bd..7923fc3387 100644 --- a/.gitattributes +++ b/.gitattributes @@ -1,3 +1,4 @@ +# IMPORTANT: When updating this file, also update .editorconfig to match. * whitespace=space-before-tab,trailing-space *.[chly] whitespace=space-before-tab,trailing-space,indent-with-non-tab,tabwidth=4 *.pl whitespace=space-before-tab,tra
Use streaming read API in ANALYZE
Hi, I worked on using the currently proposed streaming read API [1] in ANALYZE. The patch is attached. 0001 is the not yet merged streaming read API code changes that can be applied to the master, 0002 is the actual code. The blocks to analyze are obtained by using the streaming read API now. - Since streaming read API is already doing prefetch, I removed the #ifdef USE_PREFETCH code from acquire_sample_rows(). - Changed 'while (BlockSampler_HasMore(&bs))' to 'while (nblocks)' because the prefetch mechanism in the streaming read API will advance 'bs' before returning buffers. - Removed BlockNumber and BufferAccessStrategy from the declaration of scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them. I counted syscalls of analyzing ~5GB table. It can be seen that the patched version did ~1300 less read calls. Patched: % time seconds usecs/call callserrors syscall -- --- --- - - 39.670.012128 0 29809 pwrite64 36.960.011299 0 28594 pread64 23.240.007104 0 27611 fadvise64 Master (21a71648d3): % time seconds usecs/call callserrors syscall -- --- --- - - 38.940.016457 0 29816 pwrite64 36.790.015549 0 29850 pread64 23.910.010106 0 29848 fadvise64 Any kind of feedback would be appreciated. [1]: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com -- Regards, Nazir Bilal Yavuz Microsoft From 509e55997c084f831fcbcb46cabe79d4f97aa93e Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 22 Jul 2023 17:31:54 +1200 Subject: [PATCH v1 1/2] Streaming read API changes that are not committed to master yet Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com --- src/include/storage/bufmgr.h | 22 + src/include/storage/streaming_read.h | 52 ++ src/backend/storage/Makefile | 2 +- src/backend/storage/aio/Makefile | 14 + src/backend/storage/aio/meson.build | 5 + src/backend/storage/aio/streaming_read.c | 472 ++ src/backend/storage/buffer/bufmgr.c | 592 +++ src/backend/storage/buffer/localbuf.c| 14 +- src/backend/storage/meson.build | 1 + src/tools/pgindent/typedefs.list | 2 + 10 files changed, 953 insertions(+), 223 deletions(-) create mode 100644 src/include/storage/streaming_read.h create mode 100644 src/backend/storage/aio/Makefile create mode 100644 src/backend/storage/aio/meson.build create mode 100644 src/backend/storage/aio/streaming_read.c diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index d51d46d3353..a38f1acb37a 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -14,6 +14,7 @@ #ifndef BUFMGR_H #define BUFMGR_H +#include "port/pg_iovec.h" #include "storage/block.h" #include "storage/buf.h" #include "storage/bufpage.h" @@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount; #define BUFFER_LOCK_SHARE 1 #define BUFFER_LOCK_EXCLUSIVE 2 +/* + * Maximum number of buffers for multi-buffer I/O functions. This is set to + * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum. + */ +#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ) /* * prototypes for functions in bufmgr.c @@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, bool permanent); +extern Buffer PrepareReadBuffer(BufferManagerRelation bmr, +ForkNumber forkNum, +BlockNumber blockNum, +BufferAccessStrategy strategy, +bool *foundPtr); +extern void CompleteReadBuffers(BufferManagerRelation bmr, +Buffer *buffers, +ForkNumber forknum, +BlockNumber blocknum, +int nblocks, +bool zero_on_error, +BufferAccessStrategy strategy); extern void ReleaseBuffer(Buffer buffer); extern void UnlockReleaseBuffer(Buffer buffer); extern bool BufferIsExclusiveLocked(Buffer buffer); @@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer); extern bool ConditionalLockBufferForCleanup(Buffer buffer); extern bool IsBufferCleanupOK(Buffer buffer); extern bool HoldingBufferPinThatDelaysRecovery(void); +extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode); extern bool BgBufferSync(struct WritebackContext *wb_context); +extern void LimitAdditionalPins(uint32 *additional_pins); +extern void LimitAdditionalLocalPins(uint32 *additional_pins); + /* in buf_init.c */ extern void InitBufferPool(void); extern Size
Re: partitioning and identity column
Hello Ashutosh, 19.02.2024 15:17, Ashutosh Bapat wrote: Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too, so I think they can be exploited as well. not just Identity related functions, but many other functions in tablecmds.c have that problem as I mentioned earlier. Could you please name functions, which you suspect, for me to recheck them? Perhaps we should consider fixing all of such functions, in light of b0f7dd915 and d57b7cc33... Best regards, Alexander
Re: Switching XLog source from archive to streaming when primary available
On Mon, 19 Feb 2024 at 18:36, Bharath Rupireddy wrote: > On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy > wrote: >> >> Needed a rebase due to commit 776621a (conflict in >> src/test/recovery/meson.build for new TAP test file added). Please >> find the attached v17 patch. > > Strengthened tests a bit by using recovery_min_apply_delay to mimic > standby spending some time fetching from archive. PSA v18 patch. Here are some minor comments: [1] +primary). However, the standby exhausts all the WAL present in pg_wal s|pg_wal|pg_wal|g [2] +# Ensure checkpoint doesn't come in our way +$primary->append_conf('postgresql.conf', qq( +min_wal_size = 2MB +max_wal_size = 1GB +checkpoint_timeout = 1h + autovacuum = off +)); Keeping the same indentation might be better.
Re: Injection points: some tools to wait and wake
Hi, On Mon, Feb 19, 2024 at 04:51:45PM +0900, Michael Paquier wrote: > On Mon, Feb 19, 2024 at 03:01:40PM +0900, Michael Paquier wrote: > > 0002 is a polished version of the TAP test that makes use of this > > facility, providing coverage for the bug fixed by 7863ee4def65 > > (reverting this commit causes the test to fail), where a restart point > > runs across a promotion request. The trick is to stop the > > checkpointer in the middle of a restart point and issue a promotion > > in-between. > > The CF bot has been screaming at this one on Windows because the > process started with IPC::Run::start was not properly finished, so > attached is an updated version to bring that back to green. Thanks for the patch, that's a very cool feature! Random comments: 1 === +CREATE FUNCTION injection_points_wake() what about injection_points_wakeup() instead? 2 === +/* Shared state information for injection points. */ +typedef struct InjectionPointSharedState +{ + /* protects accesses to wait_counts */ + slock_t lock; + + /* Counter advancing when injection_points_wake() is called */ + int wait_counts; If slock_t protects "only" one counter, then what about using pg_atomic_uint64 or pg_atomic_uint32 instead? And btw do we need wait_counts at all? (see comment number 4) 3 === + * SQL function for waking a condition variable waking up instead? 4 === + for (;;) + { + int new_wait_counts; + + SpinLockAcquire(&inj_state->lock); + new_wait_counts = inj_state->wait_counts; + SpinLockRelease(&inj_state->lock); + + if (old_wait_counts != new_wait_counts) + break; + ConditionVariableSleep(&inj_state->wait_point, injection_wait_event); + } I'm wondering if this loop and wait_counts are needed, why not doing something like (and completely get rid of wait_counts)? " ConditionVariablePrepareToSleep(&inj_state->wait_point); ConditionVariableSleep(&inj_state->wait_point, injection_wait_event); ConditionVariableCancelSleep(); " It's true that the comment above ConditionVariableSleep() mentions that: " * This should be called in a predicate loop that tests for a specific exit * condition and otherwise sleeps " but is it needed in our particular case here? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PGC_SIGHUP shared_buffers?
On 2/18/24 15:35, Andres Freund wrote: On 2024-02-18 17:06:09 +0530, Robert Haas wrote: How many people set shared_buffers to something that's not a whole number of GB these days? I'd say the vast majority of postgres instances in production run with less than 1GB of s_b. Just because numbers wise the majority of instances are running on small VMs and/or many PG instances are running on one larger machine. There are a lot of instances where the total available memory is less than 2GB. I mean I bet it happens, but in practice if you rounded to the nearest GB, or even the nearest 2GB, I bet almost nobody would really care. I think it's fine to be opinionated here and hold the line at a relatively large granule, even though in theory people could want something else. I don't believe that at all unfortunately. Couldn't we scale the rounding, e.g. allow small allocations as we do now, but above some number always round? E.g. maybe >= 2GB round to the nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, etc? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Speeding up COPY TO for uuids and arrays
On Sat, 2024-02-17 at 12:24 -0800, Andres Freund wrote: > On 2024-02-17 17:48:23 +0100, Laurenz Albe wrote: > > - Patch 0001 speeds up pq_begintypsend with a custom macro. > > That brought the binary copy down to 3.7 seconds, which is a > > speed gain of 15%. > > Nice win, but I think we can do a bit better than this. Certainly it shouldn't > be a macro, given the multiple evaluation risks. I don't think we actually > need to zero those bytes, in fact that seems more likely to hide bugs than > prevent them. > > I have an old patch series improving performance in this area. The multiple evaluation danger could be worked around with an automatic variable, or the macro could just carry a warning (like appendStringInfoCharMacro). But considering the thread in [1], I guess I'll retract that patch; I'm sure the more invasive improvement you have in mind will do better. > > - Patch 0001 speeds up uuid_out by avoiding the overhead of > > a Stringinfo. This brings text mode COPY to 19.4 seconds, > > which is speed gain of 21%. > > Nice! > > I wonder if we should move the core part for converting to hex to numutils.c, > we already have code the for the inverse. There does seem to be further > optimization potential in the conversion, and that seems better done somewhere > central rather than one type's output function. OTOH, it might not be worth > it, given the need to add the dashes. I thought about it, but then decided not to do that. Calling a function that converts the bytes to hex and then adding the hyphens will slow down processing, and I think the code savings would be minimal at best. > > - Patch 0003 speeds up array_out a bit by avoiding some zero > > byte writes. The measured speed gain is under 2%. > > Makes sense. Thanks for your interest and approval. The attached patches are the renamed, but unmodified patches 2 and 3 from before. Yours, Laurenz Albe [1]: https://postgr.es/m/CAMkU%3D1whbRDUwa4eayD9%2B59K-coxO9senDkPRbTn3cg0pUz4AQ%40mail.gmail.com From de87d249e3f55c38062e563821af9fa32d15e341 Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Sat, 17 Feb 2024 17:23:39 +0100 Subject: [PATCH v1 2/3] Speed up uuid_out Since the size of the string representation of an uuid is fixed, there is no benefit in using a StringInfo. Avoiding the overhead makes the function substantially faster. --- src/backend/utils/adt/uuid.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/src/backend/utils/adt/uuid.c b/src/backend/utils/adt/uuid.c index 73dfd711c7..b48bbf01e1 100644 --- a/src/backend/utils/adt/uuid.c +++ b/src/backend/utils/adt/uuid.c @@ -53,10 +53,11 @@ uuid_out(PG_FUNCTION_ARGS) { pg_uuid_t *uuid = PG_GETARG_UUID_P(0); static const char hex_chars[] = "0123456789abcdef"; - StringInfoData buf; + char *buf, *p; int i; - initStringInfo(&buf); + buf = palloc(2 * UUID_LEN + 5); + p = buf; for (i = 0; i < UUID_LEN; i++) { int hi; @@ -68,16 +69,17 @@ uuid_out(PG_FUNCTION_ARGS) * ("-"). Therefore, add the hyphens at the appropriate places here. */ if (i == 4 || i == 6 || i == 8 || i == 10) - appendStringInfoChar(&buf, '-'); + *p++ = '-'; hi = uuid->data[i] >> 4; lo = uuid->data[i] & 0x0F; - appendStringInfoChar(&buf, hex_chars[hi]); - appendStringInfoChar(&buf, hex_chars[lo]); + *p++ = hex_chars[hi]; + *p++ = hex_chars[lo]; } + *p = '\0'; - PG_RETURN_CSTRING(buf.data); + PG_RETURN_CSTRING(buf); } /* -- 2.43.2 From e8346ea88785a763d2bd3f99800ae928b7469f64 Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Sat, 17 Feb 2024 17:24:19 +0100 Subject: [PATCH v1 3/3] Small speedup for array_out Avoid writing zero bytes where it is not necessary. This offers only a small, but measurable speed gain for larger arrays. --- src/backend/utils/adt/arrayfuncs.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index f3fee54e37..306c5062f7 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -1200,7 +1200,7 @@ array_out(PG_FUNCTION_ARGS) p = retval; #define APPENDSTR(str) (strcpy(p, (str)), p += strlen(p)) -#define APPENDCHAR(ch) (*p++ = (ch), *p = '\0') +#define APPENDCHAR(ch) (*p++ = (ch)) if (needdims) APPENDSTR(dims_str); @@ -1222,10 +1222,9 @@ array_out(PG_FUNCTION_ARGS) char ch = *tmp; if (ch == '"' || ch == '\\') - *p++ = '\\'; -*p++ = ch; + APPENDCHAR('\\'); +APPENDCHAR(ch); } - *p = '\0'; APPENDCHAR('"'); } else @@ -1248,6 +1247,8 @@ array_out(PG_FUNCTION_ARGS) j = i; } while (j != -1); + *p = '\0'; + #undef APPENDSTR #undef APPENDCHAR -- 2.43.2
Re: numeric_big in make check?
> > On 19 Feb 2024, at 12:48, Tom Lane wrote: > > > > Or we could just flush it. It's never detected a bug, and I think > > you'd find that it adds zero code coverage (or if not, we could > > fix that in a far more surgical and less expensive manner). > Off the top of my head, I can't say to what extent that's true, but it wouldn't surprise me if at least some of the tests added in the last 4 commits to touch that file aren't covered by tests elsewhere. Indeed that certainly looks like the case for 18a02ad2a5. I'm sure those tests could be pared down though. Regards, Dean
Re: Optimize planner memory consumption for huge arrays
On 9/8/23 07:11, Lepikhov Andrei wrote: > > > On Wed, Sep 6, 2023, at 8:09 PM, Ashutosh Bapat wrote: >> Hi Lepikhov, >> >> Thanks for using my patch and I am glad that you found it useful. >> >> On Mon, Sep 4, 2023 at 10:56 AM Lepikhov Andrei >> wrote: >>> >>> Hi, hackers, >>> >>> Looking at the planner behaviour with the memory consumption patch [1], I >>> figured out that arrays increase memory consumption by the optimizer >>> significantly. See init.sql in attachment. >>> The point here is that the planner does small memory allocations for each >>> element during estimation. As a result, it looks like the planner consumes >>> about 250 bytes for each integer element. >> >> I guess the numbers you mentioned in init.sql are total memory used by >> the planner (as reported by the patch in the thread) when planning >> that query and not memory consumed by Const nodes themselves. Am I >> right? I think the measurements need to be explained better and also >> the realistic scenario you are trying to oprimize. > > Yes, it is the total memory consumed by the planner - I used the numbers > generated by your patch [1]. I had been increasing the number of elements in > the array to exclude the memory consumed by the planner for other purposes. > As you can see, the array with 1 element consumes 12kB of memory, 1E4 > elements - 2.6 MB. All of that memory increment is related to the only > enlargement of this array. (2600-12)/10 = 260 bytes. So, I make a conclusion: > each 4-byte element produces a consumption of 260 bytes of memory. > This scenario I obtained from the user complaint - they had strict > restrictions on memory usage and were stuck in this unusual memory usage case. > >> I guess, the reason you think that partitioning will increase the >> memory consumed is because each partition will have the clause >> translated for it. Selectivity estimation for each partition will >> create those many Const nodes and hence consume memory. Am I right? > > Yes. > >> Can you please measure the memory consumed with and without your >> patch. > > Done. See test case and results in 'init_parts.sql' in attachment. Short > summary below. I varied a number of elements from 1 to 1 and partitions > from 1 to 100. As you can see, partitioning adds a lot of memory consumption > by itself. But we see an effect from patch also. > > master: > elems 1 1E1 1E2 1E3 1E4 > parts > 1 28kB50kB0.3MB 2.5MB 25MB > 1045kB143kB 0.6MB 4.8MB 47MB > 100 208kB 125kB 3.3MB 27MB274MB > > patched: > elems 1 1E1 1E2 1E3 1E4 > parts > 1 28kB48kB0.25MB 2.2MB 22.8MB > 1044kB100kB 313kB 2.4MB 23.7MB > 100 208kB 101kB 0.9MB 3.7MB 32.4MB > > Just for comparison, without partitioning: > elems 1 1E1 1E2 1E3 1E4 > master: 12kB14kB37kB266kB 2.5MB > patched: 12kB11.5kB 13kB24kB141kB > These improvements look pretty nice, considering how simple the patch seems to be. I can't even imagine how much memory we'd need with even more partitions (say, 1000) if 100 partitions means 274MB. BTW when releasing memory in scalararraysel, wouldn't it be good to also free the elem_values/elem_nulls? I haven't tried and maybe it's not that significant amount. Considering there are now multiple patches improving memory usage during planning with partitions, perhaps it's time to take a step back and think about how we manage (or rather not manage) memory during query planning, and see if we could improve that instead of an infinite sequence of ad hoc patches? Our traditional attitude is to not manage memory, and rely on the memory context to not be very long-lived. And that used to be fine, but partitioning clearly changed the equation, increasing the amount of allocated memory etc. I don't think we want to stop relying on memory contexts for planning in general - memory contexts are obviously very convenient etc. But maybe we could identify "stages" in the planning and release the memory more aggressively in those? For example, I don't think we expect selectivity functions to allocate long-lived objects, right? So maybe we could run them in a dedicated memory context, and reset it aggressively (after each call). Ofc, I'm not suggesting this patch should be responsible for doing this. >>> It is maybe not a problem most of the time. However, in the case of >>> partitions, memory consumption multiplies by each partition. Such a corner >>> case looks weird, but the fix is simple. So, why not? >> >> With vectorized operations becoming a norm these days, it's possible >> to have thousands of element in array of an ANY or IN clause. Also >> will be common to have thousands of partitions. But I think what we >> need to d
Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning
On Mon, Feb 19, 2024 at 4:35 AM Tomas Vondra wrote: > > Hi, > > After taking a look at the patch optimizing SpecialJoinInfo allocations, > I decided to take a quick look at this one too. I don't have any > specific comments on the code yet, but it seems quite a bit more complex > than the other patch ... it's introducing a HTAB into the optimizer, > surely that has costs too. Thanks for looking into this too. > > I started by doing the same test as with the other patch, comparing > master to the two patches (applied independently and both). And I got > about this (in MB): > > tablesmaster sjinforinfo both > --- >237 36 3433 >3 138129 122 113 >4 421376 363 318 >5 1495 1254 1172 931 > > Unlike the SpecialJoinInfo patch, I haven't seen any reduction in > planning time for this one. Yeah. That agreed with my observation as well. > > The reduction in allocated memory is nice. I wonder what's allocating > the remaining memory, and we'd have to do to reduce it further. Please see reply to SpecialJoinInfo thread. Other that the current patches, we require memory efficient Relids implementation. I have shared some ideas in the slides I shared in the other thread, but haven't spent time experimenting with any ideas there. > > However, this is a somewhat extreme example - it's joining 5 tables, > each with 1000 partitions, using a partition-wise join. It reduces the > amount of memory, but the planning time is still quite high (and > essentially the same as without the patch). So it's not like it'd make > them significantly more practical ... do we have any other ideas/plans > how to improve that? Yuya has been working on reducing planning time [1]. Has some significant improvements in that area based on my experiments. But those patches are complex and still WIP. > > AFAIK we don't expect this to improve "regular" cases with modest number > of tables / partitions, etc. But could it make them slower in some case? > AFAIR, my experiments did not show any degradation in regular cases with modest number of tables/partitions. The variation in planning time was with the usual planning time variations. [1] https://www.postgresql.org/message-id/flat/CAExHW5uVZ3E5RT9cXHaxQ_DEK7tasaMN%3DD6rPHcao5gcXanY5w%40mail.gmail.com#112b3e104e0f9e39eb007abe075aae20 -- Best Wishes, Ashutosh Bapat
Re: Experiments with Postgres and SSL
I've been asked to take a look at this thread and review some patches, and the subject looks interesting enough, so here I am. On Thu, 19 Jan 2023 at 04:16, Greg Stark wrote: > I had a conversation a while back with Heikki where he expressed that > it was annoying that we negotiate SSL/TLS the way we do since it > introduces an extra round trip. Aside from the performance > optimization I think accepting standard TLS connections would open the > door to a number of other opportunities that would be worth it on > their own. I agree that this would be very nice. > Other things it would open the door to in order from least > controversial to most > > * Hiding Postgres behind a standard SSL proxy terminating SSL without > implementing the Postgres protocol. I think there is also the option "hiding Postgres behind a standard SNI-based SSL router that does not terminate SSL", as that's arguably a more secure way to deploy any SSL service than SSL-terminating proxies. > * "Service Mesh" type tools that hide multiple services behind a > single host/port ("Service Mesh" is just a new buzzword for "proxy"). People proxying PostgreSQL seems fine, and enabling better proxying seems reasonable. > * Browser-based protocol implementations using websockets for things > like pgadmin or other tools to connect directly to postgres using > Postgres wire protocol but using native SSL implementations. > > * Postgres could even implement an HTTP based version of its protocol > and enable things like queries or browser based tools using straight > up HTTP requests so they don't need to use websockets. > > * Postgres could implement other protocols to serve up data like > status queries or monitoring metrics, using HTTP based standard > protocols instead of using our own protocol. I don't think we should be trying to serve anything HTTP-like, even with a ten-foot pole, on a port that we serve the PostgreSQL wire protocol on. If someone wants to multiplex the PostgreSQL wire protocol on the same port that serves HTTPS traffic, they're welcome to do so with their own proxy, but I'd rather we keep the PostgreSQL server's socket handling fundamentaly incapable of servicng protocols primarily used in web browsers on the same socket that handles normal psql data connections. PostgreSQL may have its own host-based authentication with HBA, but I'd rather not have to depend on it to filter incoming connections between valid psql connections and people trying to grab the latest monitoring statistics at some http endpoint - I'd rather use my trusty firewall that can already limit access to specific ports very efficiently without causing undue load on the database server. Matthias van de Meent Neon (https://neon.tech)
Re: serial not accepted as datatype in ALTER TABLE ... ALTER COLUMN
On Sun, Feb 18, 2024 at 1:59 PM Andy Fan wrote: > > > Hi Ashutosh, > > > data_type is described on that page as "Data type of the new column, > > or new data type for an existing column." but CREATE TABLE > > documentation [2] redirects data_type to [3], which mentions serial. > > The impression created by the documentation is the second statement > > above is a valid statement as should not throw an error; instead > > change the data type of the column (and create required sequence). > > I didn't find out a reason to not support it, if have any reason, I > think it is better have some explaination in the document. > > > In code ATPrepAlterColumnType() calls typenameTypeIdAndMod(), whereas > > transformColumnDefinition() (called for ALTER TABLE ... ADD COLUMN and > > CREATE TABLE) handles "serial" data type separately. Looks like we are > > missing a call to transformColumnDefinition() in > > transformAlterTableStmt() under case AT_AlterColumnType. > > I tried your idea with the attatchment, it is still in a drafted state > but it can be used as a prove-of-concept and for better following > communicating. Just one point needs to metion is serial implies > "default value" + "not null" constaint. So when we modify a column into > serial, we need to modify the 'NULL value' and only to the default value > at the RewriteTable stage. > I am surprised that this requires changes in ReWrite. I thought adding NOT NULL constraint and default value commands would be done by transformColumnDefinition(). But I haven't looked at the patch close enough. -- Best Wishes, Ashutosh Bapat
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
On Sun, Feb 18, 2024 at 10:55 PM Tomas Vondra wrote: > > Hi, > > I took a quick look at this patch today. I certainly agree with the > intent to reduce the amount of memory during planning, assuming it's not > overly disruptive. And I think this patch is fairly localized and looks > sensible. Thanks for looking at the patch-set. > > That being said I'm a big fan of using a local variable on stack and > filling it. I'd probably go with the usual palloc/pfree, because that > makes it much easier to use - the callers would not be responsible for > allocating the SpecialJoinInfo struct. Sure, it's a little bit of > overhead, but with the AllocSet caching I doubt it's measurable. You are suggesting that instead of declaring a local variable of type SpecialJoinInfo, build_child_join_sjinfo() should palloc() and return SpecialJoinInfo which will be freed by free_child_sjinfo() (free_child_sjinfo_members in the patch). I am fine with that. > > I did put this through check-world on amd64/arm64, with valgrind, > without any issue. I also tried the scripts shared by Ashutosh in his > initial message (with some minor fixes, adding MEMORY to explain etc). > > The results with the 20240130 patches are like this: > >tablesmasterpatched > - > 2 40.8 39.9 > 3 151.7 142.6 > 4 464.0 418.5 > 51663.9 1419.5 > > That's certainly a nice improvement, and it even reduces the amount of > time for planning (the 5-table join goes from 18s to 17s on my laptop). > That's nice, although 17 seconds for planning is not ... great. > > That being said, the amount of remaining memory needed by planning is > still pretty high - we save ~240MB for a join of 5 tables, but we still > need ~1.4GB. Yes, this is a bit extreme example, and it probably is not > very common to join 5 tables with 1000 partitions each ... > Yes. > Do we know what are the other places consuming the 1.4GB of memory? Please find the breakdown at [1]. Investigating further, I found that memory consumed by Bitmapsets is not explicitly visible through that breakdown. But Bitmapsets consume a lot of memory in this case. I shared google slides with raw numbers in [2]. The Gslides can be found at https://docs.google.com/presentation/d/1S9BiAADhX-Fv9tDbx5R5Izq4blAofhZMhHcO1c-wzfI/edit#slide=id.p. According to that measurement, Bitmapset objects created while planning a 5-way join between partitioned tables with 1000 partitions each consumed 769.795 MiB as compared to 19.728 KiB consumed by Bitmapsets when planning a 5-way non-partitioned join. See slide 3 for detailed numbers. > Considering my recent thread about scalability, where malloc() turned > out to be one of the main culprits, I wonder if maybe there's a lot to > gain by reducing the memory usage ... Our attitude to memory usage is > that it doesn't really matter if we keep it allocated for a bit, because > we'll free it shortly. And that may be true for "modest" memory usage, > but with 1.4GB that doesn't seem great, and the malloc overhead can be > pretty bad. > That probably explains why we see some modest speedups because of my memory saving patches. Thanks for sharing it. [1] https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAExHW5t0cPNjJRPRtt%3D%2Bc5SiTeBPHvx%3DSd2n8EO%2B7XdVuE8_YQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
Re: Synchronizing slots from primary to standby
On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila wrote: > > Few comments on 0001 Thanks for the feedback. > > 1. I think it is better to error out when the valid GUC or option is > not set in ensure_valid_slotsync_params() and > ensure_valid_remote_info() instead of waiting. And we shouldn't start > the worker in the first place if not all required GUCs are set. This > will additionally simplify the code a bit. Sure, removed 'ensure' functions. Moved the validation checks to the postmaster before starting the slot sync worker. > 2. > +typedef struct SlotSyncWorkerCtxStruct > { > - /* prevents concurrent slot syncs to avoid slot overwrites */ > + pid_t pid; > + bool stopSignaled; > bool syncing; > + time_t last_start_time; > slock_t mutex; > -} SlotSyncCtxStruct; > +} SlotSyncWorkerCtxStruct; > > I think we don't need to change the name of this struct as this can be > used both by the worker and the backend. We can probably add the > comment to indicate that all the fields except 'syncing' are used by > slotsync worker. Modified. > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't > be changed to SlotSyncWorkerShmemInit(). Modified. > 4. > +ReplSlotSyncWorkerMain(int argc, char *argv[]) > { > ... > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); > ... > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); > ... > } > > Do we need two separate callbacks? Can't we have just one (say > slotsync_failure_callback) that cleans additional things in case of > slotsync worker? And, if we need both those callbacks then please add > some comments for both and why one is before_shmem_exit() and the > other is on_shmem_exit(). I think we can merge these now. Earlier 'on_shmem_exit' was needed to avoid race-condition between startup and slot sync worker process to drop 'i' slots on promotion. Now we do not have any such scenario. But I need some time to analyze it well. Will do it in the next version. > In addition to the above, I have made a few changes in the comments > and code (cosmetic code changes). Please include those in the next > version if you find them okay. You need to rename .txt file to remove > .txt and apply atop v90-0001*. Sure, included these. Please find the patch001 attached. I will rebase the rest of the patches and post them tomorrow. thanks Shveta v91-0001-Add-a-new-slotsync-worker.patch Description: Binary data
Re: POC, WIP: OR-clause support for indexes
Em seg., 19 de fev. de 2024 às 05:35, Andrei Lepikhov < a.lepik...@postgrespro.ru> escreveu: > On 16/2/2024 19:54, jian he wrote: > > After setting these parameters, overall enable_or_transformation ON is > > performance better. > > sorry for the noise. > Don't worry, at least we know a weak point of partial paths estimation. > > so now I didn't find any corner case where enable_or_transformation is > > ON peforms worse than when it's OFF. > > > > +typedef struct OrClauseGroupEntry > > +{ > > + OrClauseGroupKey key; > > + > > + Node *node; > > + List *consts; > > + Oid scalar_type; > > + List *exprs; > > +} OrClauseGroupEntry; > > > > I found that the field `scalar_type` was never used. > Thanks, fixed. > Not that it will make a big difference, but it would be good to avoid, I think. v17-0002 1) move the vars *arrayconst and *dest, to after if, to avoid makeNode (palloc). + Const *arrayconst; + ScalarArrayOpExpr *dest; + + pd = (PredicatesData *) lfirst(lc); + if (pd->elems == NIL) + /* The index doesn't participate in this operation */ + continue; + arrayconst = lsecond_node(Const, saop->args); + dest = makeNode(ScalarArrayOpExpr); best regards, Ranier Vilela
Re: Memory consumed by paths during partitionwise join planning
On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov wrote: > Live example: right now, I am working on the code like MSSQL has - a > combination of NestLoop and HashJoin paths and switching between them in > real-time. It requires both paths in the path list at the moment when > extensions are coming. Even if one of them isn't referenced from the > upper pathlist, it may still be helpful for the extension. There is no guarantee that every path presented to add_path will be preserved. Suboptimal paths are freed as and when add_path discovers that they are suboptimal. So I don't think an extension can rely on existence of a path. But having a refcount makes it easy to preserve the required paths by referencing them. > > >> About partitioning. As I discovered planning issues connected to > >> partitions, the painful problem is a rule, according to which we are > >> trying to use all nomenclature of possible paths for each partition. > >> With indexes, it quickly increases optimization work. IMO, this can help > >> a 'symmetrical' approach, which could restrict the scope of possible > >> pathways for upcoming partitions if we filter some paths in a set of > >> previously planned partitions. > > > > filter or free? > Filter. > I meant that Postres tries to apply IndexScan, BitmapScan, > IndexOnlyScan, and other strategies, passing throughout the partition > indexes. The optimizer spends a lot of time doing that. So, why not > introduce a symmetrical strategy and give away from the search some > indexes of types of scan based on the pathifying experience of previous > partitions of the same table: if you have dozens of partitions, Is it > beneficial for the system to find a bit more optimal IndexScan on one > partition having SeqScans on 999 other? > IIUC, you are suggesting that instead of planning each partition/partitionwise join, we only create paths with the strategies which were found to be optimal with previous partitions. That's a good heuristic but it won't work if partition properties - statistics, indexes etc. differ between groups of partitions. -- Best Wishes, Ashutosh Bapat
Re: partitioning and identity column
On Thu, Feb 15, 2024 at 11:30 PM Alexander Lakhin wrote: > > Hello Ashutosh, > > 24.01.2024 09:34, Ashutosh Bapat wrote: > > > >>> There's another thing I found. The file isn't using > >>> check_stack_depth() in the function which traverse inheritance > >>> hierarchies. This isn't just a problem of the identity related > >>> function but most of the functions in that file. Do you think it's > >>> worth fixing it? > >> I suppose the number of inheritance levels is usually not a problem for > >> stack depth? > >> > > Practically it should not. I would rethink the application design if > > it requires so many inheritance or partition levels. But functions in > > optimizer like try_partitionwise_join() and set_append_rel_size() call > > > > /* Guard against stack overflow due to overly deep inheritance tree. */ > > check_stack_depth(); > > > > I am fine if we want to skip this. > > I've managed to reach stack overflow inside ATExecSetIdentity() with > the following script: > (echo "CREATE TABLE tp0 (a int PRIMARY KEY, > b int GENERATED ALWAYS AS IDENTITY) PARTITION BY RANGE (a);"; > for ((i=1;i<=8;i++)); do >echo "CREATE TABLE tp$i PARTITION OF tp$(( $i - 1 )) > FOR VALUES FROM ($i) TO (100) PARTITION BY RANGE (a);"; > done; > echo "ALTER TABLE tp0 ALTER COLUMN b SET GENERATED BY DEFAULT;") | psql > >psql.log > > (with max_locks_per_transaction = 400 in the config) > > It runs about 15 minutes for me and ends with: > Program terminated with signal SIGSEGV, Segmentation fault. > > #0 0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, > mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169 > 1169{ > (gdb) bt > #0 0x55a8ced20de9 in LWLockAcquire (lock=0x7faec200b900, > mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1169 > #1 0x55a8cea0342d in WALInsertLockAcquire () at xlog.c:1389 > #2 XLogInsertRecord (rdata=0x55a8cf1ccee8 , > fpw_lsn=fpw_lsn@entry=1261347512, flags=0 '\000', > num_fpi=num_fpi@entry=0, topxid_included=false) at xlog.c:817 > #3 0x55a8cea1396e in XLogInsert (rmid=rmid@entry=11 '\v', > info=) at xloginsert.c:524 > #4 0x55a8ce9c1541 in _bt_insertonpg (rel=0x7faeb8478c98, > heaprel=0x7faecf63d378, > itup_key=itup_key@entry=0x55a8d5064678, buf=3210, cbuf=cbuf@entry=0, > stack=stack@entry=0x55a8d1063d08, > itup=0x55a8d5064658, itemsz=16, > newitemoff=, postingoff=0, split_only_page= out>) at nbtinsert.c:1389 > #5 0x55a8ce9bf9a7 in _bt_doinsert (rel=, > rel@entry=0x7faeb8478c98, itup=, > itup@entry=0x55a8d5064658, checkUnique=, > checkUnique@entry=UNIQUE_CHECK_YES, indexUnchanged=, > heapRel=, heapRel@entry=0x7faecf63d378) at nbtinsert.c:260 > #6 0x55a8ce9c92ad in btinsert (rel=0x7faeb8478c98, values= out>, isnull=, > ht_ctid=0x55a8d50643cc, heapRel=0x7faecf63d378, checkUnique=UNIQUE_CHECK_YES, > indexUnchanged=, > indexInfo=) at nbtree.c:205 > #7 0x55a8cea41391 in CatalogIndexInsert > (indstate=indstate@entry=0x55a8d0fc03e8, heapTuple=, > heapTuple@entry=0x55a8d50643c8, updateIndexes=) at > indexing.c:170 > #8 0x55a8cea4172c in CatalogTupleUpdate > (heapRel=heapRel@entry=0x7faecf63d378, otid=0x55a8d50643cc, > tup=tup@entry=0x55a8d50643c8) at indexing.c:324 > #9 0x55a8ceb18173 in ATExecSetIdentity (rel=0x7faeab1288a8, > colName=colName@entry=0x55a8d0fbc2b8 "b", > def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, > recursing=) at tablecmds.c:8307 > #10 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab127f28, > colName=colName@entry=0x55a8d0fbc2b8 "b", > def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, > recursing=) at tablecmds.c:8337 > #11 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab1275a8, > colName=colName@entry=0x55a8d0fbc2b8 "b", > def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, > recursing=) at tablecmds.c:8337 > #12 0x55a8ceb18251 in ATExecSetIdentity (rel=0x7faeab126c28, > colName=colName@entry=0x55a8d0fbc2b8 "b", > def=def@entry=0x55a8d1063918, lockmode=lockmode@entry=8, recurse=true, > recursing=) at tablecmds.c:8337 > ... > > Functions ATExecAddIdentity() and ATExecDropIdentity() are recursive too, > so I think they can be exploited as well. not just Identity related functions, but many other functions in tablecmds.c have that problem as I mentioned earlier. -- Best Wishes, Ashutosh Bapat
Re: Synchronizing slots from primary to standby
On Mon, Feb 19, 2024 at 9:46 AM shveta malik wrote: > > Okay I see. Thanks for pointing it out. Here are the patches > addressing your comments. Changes are in patch001, rest are rebased. > Few comments on 0001 1. I think it is better to error out when the valid GUC or option is not set in ensure_valid_slotsync_params() and ensure_valid_remote_info() instead of waiting. And we shouldn't start the worker in the first place if not all required GUCs are set. This will additionally simplify the code a bit. 2. +typedef struct SlotSyncWorkerCtxStruct { - /* prevents concurrent slot syncs to avoid slot overwrites */ + pid_t pid; + bool stopSignaled; bool syncing; + time_t last_start_time; slock_t mutex; -} SlotSyncCtxStruct; +} SlotSyncWorkerCtxStruct; I think we don't need to change the name of this struct as this can be used both by the worker and the backend. We can probably add the comment to indicate that all the fields except 'syncing' are used by slotsync worker. 3. Similar to above, function names like SlotSyncShmemInit() shouldn't be changed to SlotSyncWorkerShmemInit(). 4. +ReplSlotSyncWorkerMain(int argc, char *argv[]) { ... + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); ... + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); ... } Do we need two separate callbacks? Can't we have just one (say slotsync_failure_callback) that cleans additional things in case of slotsync worker? And, if we need both those callbacks then please add some comments for both and why one is before_shmem_exit() and the other is on_shmem_exit(). In addition to the above, I have made a few changes in the comments and code (cosmetic code changes). Please include those in the next version if you find them okay. You need to rename .txt file to remove .txt and apply atop v90-0001*. -- With Regards, Amit Kapila. diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 3080d4aa53..790799c164 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -463,9 +463,9 @@ static void InitPostmasterDeathWatchHandle(void); /* * Slot sync worker allowed to start up? * - * If we are on a hot standby, slot sync parameter is enabled, and it is - * the first time of worker's launch, or enough time has passed since the - * worker was launched last, then it is allowed to be started. + * We allow to start the slot sync worker when we are on a hot standby, + * sync_replication_slots is enabled, and it is the first time of worker's + * launch, or enough time has passed since the worker was launched last. */ #define SlotSyncWorkerAllowed()\ (sync_replication_slots && pmState == PM_HOT_STANDBY && \ diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c index 40ab87bce1..0e0f3cdf76 100644 --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -1052,9 +1052,9 @@ ensure_valid_slotsync_params(void) /* * Re-read the config file. * - * If any of the slot sync GUCs have changed, exit the worker and - * let it get restarted by the postmaster. The worker to be exited for - * restart purpose only if the caller passed restart as true. + * Exit if any of the slot sync GUCs have changed. The postmaster will + * restart it. The worker to be exited for restart purpose only if the + * caller passed restart as true. */ static void slotsync_reread_config(bool restart) @@ -1295,9 +1295,9 @@ ReplSlotSyncWorkerMain(int argc, char *argv[]) dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); /* -* Connect to the database specified by user in primary_conninfo. We need -* a database connection for walrcv_exec to work. Please see comments atop -* libpqrcv_exec. +* Connect to the database specified by the user in primary_conninfo. We +* need a database connection for walrcv_exec to work which we use to fetch +* slot information from the remote node. See comments atop libpqrcv_exec. */ InitPostgres(dbname, InvalidOid, NULL, InvalidOid, 0, NULL); @@ -1310,12 +1310,11 @@ ReplSlotSyncWorkerMain(int argc, char *argv[]) appendStringInfo(&app_name, "%s", "slotsyncworker"); /* -* Establish the connection to the primary server for slots +* Establish the connection to the primary server for slot * synchronization. */ wrconn = walrcv_connect(PrimaryConnInfo, false, false, false, - app_name.data, - &err); + app_name.data, &err); pfree(app_name.data); if (!wrconn) @@ -1332,7 +1331,7 @@ ReplSlotSyncWorkerMain(int argc, char *argv[]) */ ensure_valid_remote_info(wrconn); - /* Main
Re: numeric_big in make check?
Daniel Gustafsson writes: > numeric_big has been left out of parallel_schedule, requiring EXTRA_TESTS to > run it, since going in back in 1999 (AFAICT it was even the reason EXTRA_TESTS > was invented). The original commit states that it's huge, and it probably > was. > Today it runs faster than many tests we have in parallel_schedule, even on > slow > hardware like my ~5 year old laptop. Repeated runs in CI at various parallel > groups place it at 50% the runtime of triggers.sql and 25% of brin.sql. > To make sure it's executed and not silently breaks, is it time to add this to > the regular make check? Or we could just flush it. It's never detected a bug, and I think you'd find that it adds zero code coverage (or if not, we could fix that in a far more surgical and less expensive manner). regards, tom lane
Re: Add an option to skip loading missing publication to avoid logical replication failure
On Mon, 19 Feb 2024 at 12:48, vignesh C wrote: > > Hi, > > Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the > logical replication in certain cases. This can happen as the apply > worker will get restarted after SET PUBLICATION, the apply worker will > use the existing slot and replication origin corresponding to the > subscription. Now, it is possible that before restart the origin has > not been updated and the WAL start location points to a location prior > to where PUBLICATION pub exists which can lead to such an error. Once > this error occurs, apply worker will never be able to proceed and will > always return the same error. > > There was discussion on this and Amit had posted a patch to handle > this at [2]. Amit's patch does continue using a historic snapshot but > ignores publications that are not found for the purpose of computing > RelSyncEntry attributes. We won't mark such an entry as valid till all > the publications are loaded without anything missing. This means we > won't publish operations on tables corresponding to that publication > till we found such a publication and that seems okay. > I have added an option skip_not_exist_publication to enable this > operation only when skip_not_exist_publication is specified as true. > There is no change in default behavior when skip_not_exist_publication > is specified as false. I have updated the patch to now include changes for pg_dump, added few tests, describe changes and added documentation changes. The attached v2 version patch has the changes for the same. Regards, Vignesh From c0f97fc671db390239ca5cb4c224cb3d80a0e22e Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Mon, 19 Feb 2024 10:20:02 +0530 Subject: [PATCH v2 1/2] Skip loading the publication if the publication does not exist. Skip loading the publication if the publication does not exist. --- src/backend/replication/pgoutput/pgoutput.c | 28 +++-- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 998f92d671..f7b6d0384d 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -82,7 +82,7 @@ static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, static bool publications_valid; -static List *LoadPublications(List *pubnames); +static List *LoadPublications(List *pubnames, bool *skipped); static void publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue); static void send_relation_and_attrs(Relation relation, TransactionId xid, @@ -1703,9 +1703,13 @@ pgoutput_shutdown(LogicalDecodingContext *ctx) /* * Load publications from the list of publication names. + * + * Here, we just skip the publications that don't exist yet. 'skipped' + * will be true if we find any publication from the given list that doesn't + * exist. */ static List * -LoadPublications(List *pubnames) +LoadPublications(List *pubnames, bool *skipped) { List *result = NIL; ListCell *lc; @@ -1713,9 +1717,12 @@ LoadPublications(List *pubnames) foreach(lc, pubnames) { char *pubname = (char *) lfirst(lc); - Publication *pub = GetPublicationByName(pubname, false); + Publication *pub = GetPublicationByName(pubname, true); - result = lappend(result, pub); + if (pub) + result = lappend(result, pub); + else + *skipped = true; } return result; @@ -1994,7 +2001,7 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) } /* Validate the entry */ - if (!entry->replicate_valid) + if (!entry->replicate_valid || !publications_valid) { Oid schemaId = get_rel_namespace(relid); List *pubids = GetRelationPublications(relid); @@ -2011,6 +2018,7 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) bool am_partition = get_rel_relispartition(relid); char relkind = get_rel_relkind(relid); List *rel_publications = NIL; + bool skipped_pub = false; /* Reload publications if needed before use. */ if (!publications_valid) @@ -2021,9 +2029,15 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) list_free_deep(data->publications); data->publications = NIL; } - data->publications = LoadPublications(data->publication_names); + data->publications = LoadPublications(data->publication_names, &skipped_pub); MemoryContextSwitchTo(oldctx); - publications_valid = true; + + /* + * We don't consider the publications to be valid till we have + * information of all the publications. + */ + if (!skipped_pub) +publications_valid = true; } /* -- 2.34.1 From e9ff5304b2a3a0fccfb9084ff076a0465f28edbe Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Mon, 19 Feb 2024 10:23:05 +0530 Subject: [PATCH v2 2/2] Added an option skip_not_exist_publication which will skip loading the publication, if the publication does not exist. Added an option skip_not_exist_publication which will skip
Re: table inheritance versus column compression and storage settings
On Fri, Feb 16, 2024 at 11:54 PM Tom Lane wrote: > > I wrote: > > I find it surprising that the committed patch does not touch > > pg_dump. Is it really true that pg_dump dumps situations with > > differing compression/storage settings accurately already? > > It's worse than I thought. Run "make installcheck" with > today's HEAD, then: > > $ pg_dump -Fc regression >r.dump > $ createdb r2 > $ pg_restore -d r2 r.dump > pg_restore: error: could not execute query: ERROR: column "a" inherits > conflicting storage methods > HINT: To resolve the conflict, specify a storage method explicitly. > Command was: CREATE TABLE public.stchild4 ( > a text > ) > INHERITS (public.stparent1, public.stparent2); > ALTER TABLE ONLY public.stchild4 ALTER COLUMN a SET STORAGE MAIN; > > > pg_restore: error: could not execute query: ERROR: relation > "public.stchild4" does not exist > Command was: ALTER TABLE public.stchild4 OWNER TO postgres; > > pg_restore: error: could not execute query: ERROR: relation > "public.stchild4" does not exist > Command was: COPY public.stchild4 (a) FROM stdin; > pg_restore: warning: errors ignored on restore: 3 Thanks for the test. Let's call this Problem1. I expected src/bin/pg_upgrade/t/002_pg_upgrade.pl to fail in this case since it will execute similar steps as you did. And it actually does, except that it uses binary-upgrade mode. In that mode, INHERITed tables are dumped in a different manner -- For binary upgrade, set up inheritance this way. ALTER TABLE ONLY "public"."stchild4" INHERIT "public"."stparent1"; ALTER TABLE ONLY "public"."stchild4" INHERIT "public"."stparent2"; ... snip ... ALTER TABLE ONLY "public"."stchild4" ALTER COLUMN "a" SET STORAGE MAIN; that does not lead to the conflict and pg_upgrade does not fail. > > > What I'd intended to compare was the results of the query added to the > regression tests: > > regression=# SELECT attrelid::regclass, attname, attstorage FROM pg_attribute > WHERE (attrelid::regclass::name like 'stparent%' > OR attrelid::regclass::name like 'stchild%') > and attname = 'a' > ORDER BY 1, 2; > attrelid | attname | attstorage > ---+-+ > stparent1 | a | p > stparent2 | a | x > stchild1 | a | p > stchild3 | a | m > stchild4 | a | m > stchild5 | a | x > stchild6 | a | m > (7 rows) > > r2=# SELECT attrelid::regclass, attname, attstorage FROM pg_attribute > WHERE (attrelid::regclass::name like 'stparent%' > OR attrelid::regclass::name like 'stchild%') > and attname = 'a' > ORDER BY 1, 2; > attrelid | attname | attstorage > ---+-+ > stparent1 | a | p > stchild1 | a | p > stchild3 | a | m > stparent2 | a | x > stchild5 | a | p > stchild6 | a | m > (6 rows) > > So not only does stchild4 fail to restore altogether, but stchild5 > ends with the wrong attstorage. With binary-upgrade dump and restore stchild5 gets the correct storage value. Looks like we need a test which pg_dump s regression database and restores it without going through pg_upgrade. I think the fix is easy one. Dump the STORAGE and COMPRESSION clauses with CREATE TABLE for local attributes. Those for inherited attributes will be dumped separately. But that will not fix an existing problem described below. Let's call it Problem2. With HEAD at commit 57f59396bb51953bb7b957780c7f1b7f67602125 (almost a month back) $ createdb regression $ psql -d regression #create table par1 (a text storage plain); #create table par2 (a text storage plain); #create table chld (a text) inherits (par1, par2); NOTICE: merging multiple inherited definitions of column "a" NOTICE: merging column "a" with inherited definition -- parent storages conflict after child creation #alter table par1 alter column a set storage extended; #SELECT attrelid::regclass, attname, attstorage FROM pg_attribute WHERE (attrelid::regclass::name like 'par%' OR attrelid::regclass::name like 'chld%') and attname = 'a' ORDER BY 1, 2; attrelid | attname | attstorage --+-+ par1 | a | x par2 | a | p chld | a | x (3 rows) $ createdb r2 $ pg_dump -Fc regression > /tmp/r.dump $ pg_restore -d r2 /tmp/r.dump pg_restore: error: could not execute query: ERROR: inherited column "a" has a storage parameter conflict DETAIL: EXTENDED versus PLAIN Command was: CREATE TABLE public.chld ( a text ) INHERITS (public.par1, public.par2); pg_restore: error: could not execute query: ERROR: relation "public.chld" does not exist Command was: ALTER TABLE public.chld OWNER TO ashutosh; pg_restore: error: could not execute query: ERROR: relation "public.chld" does not exist Command was: COPY public.chld (a) FROM stdin; pg_restore: warning: errors ignored on restore: 3 Fixing this requires that we dump ALTER TABLE ... ALTER COLUMN SET STORAGE and COMPRESSION commands after all the tables (at least children) have been created.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada wrote: > > I think that vacuum and tidbitmap (and future users) would end up > having the same max block size calculation. And it seems slightly odd > layering to me that max-block-size-specified context is created on > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by > tidstore layer, and the passed context is used for leaves (if > varlen-value is used) on radix tree layer. That sounds slightly more complicated than I was thinking of, but we could actually be talking about the same thing: I'm drawing a distinction between "used = must be detected / #ifdef'd" and "used = actually happens to call allocation". I meant that the passed context would _always_ be used for leaves, regardless of varlen or not. So with fixed-length values short enough to live in child pointer slots, that context would still be used for iteration etc. > Another idea is to create a > max-block-size-specified context on the tidstore layer. That is, > vacuum and tidbitmap pass a work_mem and a flag indicating whether the > tidstore can use the bump context, and tidstore creates a (aset of > bump) memory context with the calculated max block size and passes it > to the radix tree. That might be a better abstraction since both uses have some memory limit. > As for using the bump memory context, I feel that we need to store > iterator struct in aset context at least as it can be individually > freed and re-created. Or it might not be necessary to allocate the > iterator struct in the same context as radix tree. Okay, that's one thing I was concerned about. Since we don't actually have a bump context yet, it seems simple to assume aset for non-nodes, and if we do get it, we can adjust slightly. Anyway, this seems like a good thing to try to clean up, but it's also not a show-stopper. On that note: I will be going on honeymoon shortly, and then to PGConf India, so I will have sporadic connectivity for the next 10 days and won't be doing any hacking during that time. Andres, did you want to take a look at the radix tree patch 0003? Aside from the above possible cleanup, most of it should be stable.
Re: Switching XLog source from archive to streaming when primary available
On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy wrote: > > Needed a rebase due to commit 776621a (conflict in > src/test/recovery/meson.build for new TAP test file added). Please > find the attached v17 patch. Strengthened tests a bit by using recovery_min_apply_delay to mimic standby spending some time fetching from archive. PSA v18 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v18-0001-Allow-standby-to-switch-WAL-source-from-archive-.patch Description: Binary data
Re: speed up a logical replica setup
Hi, I have reviewed the v21 patch. And found an issue. Initially I started the standby server with a new postgresql.conf file (not the default postgresql.conf that is present in the instance). pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf" And I have made 'max_replication_slots = 1' in new postgresql.conf and made 'max_replication_slots = 0' in the default postgresql.conf file. Now when we run pg_createsubscriber on standby we get error: pg_createsubscriber: error: could not set replication progress for the subscription "pg_createsubscriber_5_242843": ERROR: cannot query or manipulate replication origin when max_replication_slots = 0 NOTICE: dropped replication slot "pg_createsubscriber_5_242843" on publisher pg_createsubscriber: error: could not drop publication "pg_createsubscriber_5" on database "postgres": ERROR: publication "pg_createsubscriber_5" does not exist pg_createsubscriber: error: could not drop replication slot "pg_createsubscriber_5_242843" on database "postgres": ERROR: replication slot "pg_createsubscriber_5_242843" does not exist I observed that when we run the pg_createsubscriber command, it will stop the standby instance (the non-default postgres configuration) and restart the standby instance which will now be started with default postgresql.conf, where the 'max_replication_slot = 0' and pg_createsubscriber will now fail with the error given above. I have added the script file with which we can reproduce this issue. Also similar issues can happen with other configurations such as port, etc. The possible solution would be 1) allow to run pg_createsubscriber if standby is initially stopped . I observed that pg_logical_createsubscriber also uses this approach. 2) read GUCs via SHOW command and restore them when server restarts I would prefer the 1st solution. Thanks and Regards, Shlok Kyal sudo pkill -9 postgres rm -rf ../primary ../standby ../new_path rm -rf primary.log standby.log ./initdb -D ../primary cat << EOF >> ../primary/postgresql.conf wal_level = 'logical' EOF ./pg_ctl -D ../primary -l primary.log start ./psql -d postgres -c "CREATE table t1 (c1 int);" ./psql -d postgres -c "Insert into t1 values(1);" ./psql -d postgres -c "Insert into t1 values(2);" ./psql -d postgres -c "INSERT into t1 values(3);" ./psql -d postgres -c "INSERT into t1 values(4);" ./pg_basebackup -h localhost -X stream -v -W -R -D ../standby/ # postgresql.conf file in new location mkdir ../new_path cat << EOF >> ../new_path/postgresql.conf max_replication_slots = 1 port = 9000 EOF # postgresql.conf file in default location cat << EOF >> ../standby/postgresql.conf max_replication_slots = 0 port = 9000 EOF ./pg_ctl -D ../standby -l standby.log start -o "-c config_file=../new_path/postgresql.conf" ./pg_createsubscriber -D ../standby -S 'host=localhost port=9000 dbname=postgres' -P 'host=localhost port=5432 dbname=postgres' -d postgres -r
Re: Transaction timeout
On Mon, 19 Feb 2024 at 17:14, Andrey M. Borodin wrote: >> On 18 Feb 2024, at 22:16, Andrey M. Borodin wrote: >> >> But it seems a little strange that session 3 did not fail at all > It was only coincidence. Any test that verifies FATALing out in 100ms can > fail, see new failure here [0]. > > In a nearby thread Michael is proposing injections points that can wait and > be awaken. So I propose following course of action: > 1. Remove all tests that involve pg_stat_activity test of FATALed session > (any permutation with checker_sleep step) > 2. Add idle_in_transaction_session_timeout, statement_timeout and > transaction_timeout tests when injection points features get committed. > +1 > Alexander, what do you think? >
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Hi, On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote: > On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier wrote: > > > > > Yeah, comments added in v3. > > > > The contents look rather OK, I may do some word-smithing for both. > > Here are some comments on v3: Thanks for looing at it! > 1. > +XLogRecPtrinitial_effective_xmin = InvalidXLogRecPtr; > +XLogRecPtrinitial_catalog_effective_xmin = InvalidXLogRecPtr; > +XLogRecPtrinitial_restart_lsn = InvalidXLogRecPtr; > > Prefix 'initial_' makes the variable names a bit longer, I think we > can just use effective_xmin, catalog_effective_xmin and restart_lsn, > the code updating then only when if (!terminated) tells one that they > aren't updated every time. I'm not sure about that. I prefer to see meaningfull names instead of having to read the code where they are used. > 2. > +/* > + * We'll release the slot's mutex soon, so it's possible that > + * those values change since the process holding the slot has > been > + * terminated (if any), so record them here to ensure we would > + * report the slot as obsolete correctly. > + */ > > This needs a bit more info as to why and how effective_xmin, > catalog_effective_xmin and restart_lsn can move ahead after signaling > a backend and before the signalled backend reports back. I'm not sure of the added value of such extra details in this comment and if the comment would be easy to maintain. I've the feeling that knowing it's possible is enough here. Happy to hear what others think about it too. > 3. > +/* > + * Assert that the conflict cause recorded previously before we > + * terminate the process did not change now for any reason. > + */ > +Assert(!(conflict_prev != RS_INVAL_NONE && terminated && > + conflict_prev != conflict)); > > It took a while for me to understand the above condition, can we > simplify it like below using De Morgan's laws for better readability? > > Assert((conflict_prev == RS_INVAL_NONE) || !terminated || > (conflict_prev == conflict)); I don't have a strong opinon on this, looks like a matter of taste. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: speed up a logical replica setup
Some review of the v21 patch: - commit message Mention pg_createsubscriber in the commit message title. That's the most important thing that someone doing git log searches in the future will be looking for. - doc/src/sgml/ref/allfiles.sgml Move the new entry to alphabetical order. - doc/src/sgml/ref/pg_createsubscriber.sgml + + The pg_createsubscriber should be run at the target + server. The source server (known as publisher server) should accept logical + replication connections from the target server (known as subscriber server). + The target server should accept local logical replication connection. + "should" -> "must" ? + + Options Sort options alphabetically. It would be good to indicate somewhere which options are mandatory. + + Examples I suggest including a pg_basebackup call into this example, so it's easier for readers to get the context of how this is supposed to be used. You can add that pg_basebackup in this example is just an example and that other base backups can also be used. - doc/src/sgml/reference.sgml Move the new entry to alphabetical order. - src/bin/pg_basebackup/Makefile Move the new sections to alphabetical order. - src/bin/pg_basebackup/meson.build Move the new sections to alphabetical order. - src/bin/pg_basebackup/pg_createsubscriber.c +typedef struct CreateSubscriberOptions +typedef struct LogicalRepInfo I think these kinds of local-use struct don't need to be typedef'ed. (Then you also don't need to update typdefs.list.) +static void +usage(void) Sort the options alphabetically. +static char * +get_exec_path(const char *argv0, const char *progname) Can this not use find_my_exec() and find_other_exec()? +int +main(int argc, char **argv) Sort the options alphabetically (long_options struct, getopt_long() argument, switch cases). - .../t/040_pg_createsubscriber.pl - .../t/041_pg_createsubscriber_standby.pl These two files could be combined into one. +# Force it to initialize a new cluster instead of copying a +# previously initdb'd cluster. Explain why? +$node_s->append_conf( + 'postgresql.conf', qq[ +log_min_messages = debug2 Is this setting necessary for the test?
numeric_big in make check?
numeric_big has been left out of parallel_schedule, requiring EXTRA_TESTS to run it, since going in back in 1999 (AFAICT it was even the reason EXTRA_TESTS was invented). The original commit states that it's huge, and it probably was. Today it runs faster than many tests we have in parallel_schedule, even on slow hardware like my ~5 year old laptop. Repeated runs in CI at various parallel groups place it at 50% the runtime of triggers.sql and 25% of brin.sql. To make sure it's executed and not silently breaks, is it time to add this to the regular make check? -- Daniel Gustafsson
Re: Add last_commit_lsn to pg_stat_database
On 2/19/24 07:57, Michael Paquier wrote: > On Sun, Feb 18, 2024 at 02:28:06AM +0100, Tomas Vondra wrote: >> Thanks for the updated patch. I don't have a clear opinion on the >> feature and whether this is the way to implement it, but I have two >> simple questions. > > Some users I know of would be really happy to be able to get this > information for each database in a single view, so that feels natural > to plug the information into pg_stat_database. > OK >> 1) Do we really need to modify RecordTransactionCommitPrepared and >> XactLogCommitRecord to return the LSN of the commit record? Can't we >> just use XactLastRecEnd? It's good enough for advancing >> replorigin_session_origin_lsn, why wouldn't it be good enough for this? > > Or XactLastCommitEnd? But that's not set in RecordTransactionCommitPrepared (or twophase.c in general), and the patch seems to need that. > The changes in AtEOXact_PgStat() are not really > attractive for what's a static variable in all the backends. > I'm sorry, I'm not sure what "changes not being attractive" means :-( >> 2) Earlier in this thread it was claimed the function returning the >> last_commit_lsn is STABLE, but I have to admit it's not clear to me why >> that's the case :-( All the pg_stat_get_db_* functions are marked as >> stable, so I guess it's thanks to the pgstat system ... > > The consistency of the shared stats data depends on > stats_fetch_consistency. The default of "cache" makes sure that the > values don't change across a scan, until the end of a transaction. > > I have not paid much attention about that until now, but note that it > would not be the case of "none" where the data is retrieved each time > it is requested. So it seems to me that these functions should be > actually volatile, not stable, because they could deliver different > values across the same scan as an effect of the design behind > pgstat_fetch_entry() and a non-default stats_fetch_consistency. I may > be missing something, of course. Right. If I do this: create or replace function get_db_lsn(oid) returns pg_lsn as $$ declare v_lsn pg_lsn; begin select last_commit_lsn into v_lsn from pg_stat_database where datid = $1; return v_lsn; end; $$ language plpgsql; select min(l), max(l) from (select i, get_db_lsn(16384) AS l from generate_series(1,10) s(i)) foo; and run this concurrently with a pgbench on the same database (with OID 16384), I get: - stats_fetch_consistency=cache: always the same min/max value - stats_fetch_consistency=none: min != max Which would suggest you're right and this should be VOLATILE and not STABLE. But then again - the existing pg_stat_get_db_* functions are all marked as STABLE, so (a) is that correct and (b) why should this function be marked different? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Transaction timeout
> On 18 Feb 2024, at 22:16, Andrey M. Borodin wrote: > > But it seems a little strange that session 3 did not fail at all It was only coincidence. Any test that verifies FATALing out in 100ms can fail, see new failure here [0]. In a nearby thread Michael is proposing injections points that can wait and be awaken. So I propose following course of action: 1. Remove all tests that involve pg_stat_activity test of FATALed session (any permutation with checker_sleep step) 2. Add idle_in_transaction_session_timeout, statement_timeout and transaction_timeout tests when injection points features get committed. Alexander, what do you think? Best regards, Andrey Borodin. [0] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-02-18%2022%3A23%3A45
Re: System username in pg_stat_activity
Hi, On Fri, Feb 16, 2024 at 09:41:41PM +0100, Magnus Hagander wrote: > On Fri, Feb 16, 2024 at 8:55 PM Andres Freund wrote: > > > > Hi, > > > > On 2024-01-12 17:16:53 +0100, Magnus Hagander wrote: > > > On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot > > > wrote: > > > > On Thu, Jan 11, 2024 at 02:24:58PM +0100, Magnus Hagander wrote: > > > > > On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot > > > > > wrote: > > > > > > > > > > > > If we go the 2 fields way, then what about auth_identity and > > > > > > auth_method then? > > > > > > > > > > > > > > > Here is an updated patch based on this idea. > > > > > > > > Thanks! > > > > > > > > + > > > > + > > > > + auth_method text > > > > + > > > > + > > > > + The authentication method used for authenticating the > > > > connection, or > > > > + NULL for background processes. > > > > + > > > > > > > > I'm wondering if it would make sense to populate it for parallel > > > > workers too. > > > > I think it's doable thanks to d951052, but I'm not sure it's worth it > > > > (one could > > > > join based on the leader_pid though). OTOH that would be consistent with > > > > how the SYSTEM_USER behaves with parallel workers (it's populated). > > > > > > I guess one could conceptually argue that "authentication happens int > > > he leader". But we do populate it with the other user records, and > > > it'd be weird if this one was excluded. > > > > > > The tricky thing is that pgstat_bestart() is called long before we > > > deserialize the data. But from what I can tell it should be safe to > > > change it per the attached? That should be AFAICT an extremely short > > > window of time longer before we report it, not enough to matter. > > > > I don't like that one bit. The whole subsystem initialization dance already > > is > > quite complicated, particularly for pgstat, we shouldn't make it more > > complicated. Especially not when the initialization is moved quite a bit > > away > > from all the other calls. > > > > Besides just that, I also don't think delaying visibility of the worker in > > pg_stat_activity until parallel worker initialization has completed is good, > > that's not all cheap work. > > > > > > Maybe I am missing something, but why aren't we just getting the value from > > the leader's entry, instead of copying it? Good point! > The answer to that would be "because I didn't think of it" :) I'm in the same boat ;-) > Were you thinking of something like the attached? Doing it that way looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com