Re: Synchronizing slots from primary to standby
On Tue, Oct 31, 2023 at 7:16 AM Hayato Kuroda (Fujitsu) wrote: > > > > 2. > > > Currently ctx->failover is set only in the pgoutput_startup(), but not > > > sure it is > > OK. > > > Can we change the parameter in CreateDecodingContext() or similar > > > functions? > > > > > > Because IIUC it means that only slots which have pgoutput can wait. Other > > > output plugins must understand the change and set faliover flag as well - > > > I felt it is not good. E.g., you might miss to enable the parameter in > > test_decoding. > > > > > > Regarding the two_phase parameter, setting on plugin layer is good > > > because it > > > quite affects the output. As for the failover, it is not related with the > > > content so that all of slots should be enabled. > > > > > > I think CreateDecodingContext or StartupDecodingContext() is the common > > path. > > > Or, is it the out-of-scope for now? > > > > Currently, the failover field is part of the options list in the > > StartReplicationCmd. This gives some > > level of flexibility such that only plugins that are interested in > > this need to handle it. The options list > > is only deparsed by plugins. If we move it to outside of the options list, > > this sort of changes the protocol for START_REPLICATION and will > > impact all plugins. > > But I agree to your larger point that, we need to do it in such a way that > > other plugins do not unintentionally change the 'failover' behaviour > > of the originally created slot. > > Maybe I can code it in such a way that, only if the failover option is > > specified in the list of options > > passed as part of START_REPLICATION will it change the original slot > > created 'failover' flag by adding > > another flag "failover_opt_given". Plugins that set this, will be able > > to change the failover flag of the slot, > > while plugins that do not support this will not set this and the > > failover flag of the created slot will remain. > > What do you think? > > May be OK, but I came up with a corner case that external plugins have a > streaming > option 'failover'. What should be? Has the option been reserved? > Sorry, your question is not clear to me. Did you intend to say that the value of the existing streaming option could be 'failover'? -- With Regards, Amit Kapila.
Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)
On Fri, Sep 8, 2023 at 3:22 PM Ashutosh Bapat wrote: > > On Fri, Jul 28, 2023 at 3:16 PM Ashutosh Bapat > wrote: > > > > Hi Tom, Richard, > > > > On Mon, Jul 24, 2023 at 8:17 AM Richard Guo wrote: > > > > > > Thanks for pushing it! > > > > With this fix, I saw a noticeable increase in the memory consumption > > of planner. I was running experiments mentioned in [1] The reason is > > the Bitmapset created by bms_union() are not freed during planning and > > when there are thousands of child joins involved, bitmapsets takes up > > a large memory and there any a large number of bitmaps. > > > > Attached 0002 patch fixes the memory consumption by calculating > > appinfos only once and using them twice. The number look like below > > > > Number of tables joined | without patch | with patch | > > -- > > 2 | 40.8 MiB | 40.3 MiB | > > 3 | 151.6 MiB | 146.9 MiB | > > 4 | 463.9 MiB | 445.5 MiB | > > 5 |1663.9 MiB | 1563.3 MiB | > > > > The memory consumption is prominent at higher number of joins as that > > exponentially increases the number of child joins. > > > > Attached setup.sql and queries.sql and patch 0001 were used to measure > > memory similar to [1]. > > > > I don't think it's a problem with the patch itself. We should be able > > to use Bitmapset APIs similar to what patch is doing. But there's a > > problem with our Bitmapset implementation. It's not space efficient > > for thousands of partitions. A quick calculation reveals this. > > > > If the number of partitions is 1000, the matching partitions will > > usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting > > the relids will be {b 1000, 2000, 3000, ...}. To represent a single > > 1000th digit current Bitmapset implementation will allocate 1000/8 = > > 125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of > > memory. This is even true for lower relid numbers since they will be > > 1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child > > joins require 625KiB memory. Doing this as many times as the number of > > joins we get 120 * 625 KiB = 75 MiB which is closer to the memory > > difference we see above. > > > > Even if we allocate a list to hold 5 integers it will not take 625 > > bytes. I think we need to improve our Bitmapset representation to be > > efficient in such cases. Of course whatever representation we choose > > has to be space efficient for a small number of tables as well and > > should gel well with our planner logic. So I guess some kind of > > dynamic representation which changes the underlying layout based on > > the contents is required. I have looked up past hacker threads to see > > if this has been discussed previously. > > > > I don't think this is the thread to discuss it and also I am not > > planning to work on it in the immediate future. But I thought I would > > mention it where I found it. > > > > [1] > > https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com > > > > Adding this small patch to the commitfest in case somebody finds it > worth fixing this specific memory consumption. With a new subject. Rebased patches. 0001 - is same as the squashed version of patches at https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1csrjcmyyj8pd...@mail.gmail.com. 0002 is the actual fix described earlier -- Best Wishes, Ashutosh Bapat From e46e9cfe528ecbb04b17c21bb79c55cb8e23289b Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 26 Jul 2023 12:08:55 +0530 Subject: [PATCH 2/2] Reuse child_relids bitmapset in partitionwise joinrels Bitmapsets containing child relids consume large memory when thousands of partitions are involved. Create them once, use multiple times and free them up once their use is over. Ashutosh Bapat --- src/backend/optimizer/path/joinrels.c | 10 ++ src/backend/optimizer/util/relnode.c | 18 +++--- src/include/optimizer/pathnode.h | 3 ++- 3 files changed, 11 insertions(+), 20 deletions(-) diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 015a0b3cbe..e4835c10fc 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -1545,6 +1545,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, RelOptInfo *child_joinrel; AppendRelInfo **appinfos; int nappinfos; + Bitmapset *child_relids = NULL; if (joinrel->partbounds_merged) { @@ -1640,9 +1641,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, child_rel2->relids); /* Find the AppendRelInfo structures */ - appinfos = find_appinfos_by_relids(root, - bms_union(child_rel1->relids, - child_rel2->relids), + child_relids =
Re: A performance issue with Memoize
On 30/10/2023 14:55, Richard Guo wrote: On Thu, Oct 26, 2023 at 12:07 PM Andrei Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: Do you've thought about the case, fixed with the commit 1db5667? As I see, that bugfix still isn't covered by regression tests. Could your approach of a PARAM_EXEC slot reusing break that case? Hm, I don't think so. The issue fixed by commit 1db5667 was caused by sharing PARAM_EXEC slots between different levels of NestLoop. AFAICS it's safe to share PARAM_EXEC slots within the same level of NestLoop. The change here is about sharing PARAM_EXEC slots between subquery's subplan_params and outer-relation variables, which happens within the same level of NestLoop. ... Did you notice a case that the change here breaks? Hi Tom, could you share your insights on this issue and the proposed fix? I think your patch works correctly so far. I mentioned the commit 1db5667 because, as I see, the origin of the problem was parallel workers. I have thought about pushing Memoize down to a parallel worker and couldn't imagine whether such a solution would be correct. Sorry if I disturbed you in vain. -- regards, Andrei Lepikhov Postgres Professional
Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning
On Thu, Sep 14, 2023 at 4:39 PM Ashutosh Bapat wrote: > > The patch set is thus > 0001 - patch used to measure memory used during planning > > 0002 - Patch to free intermediate Relids computed by > adjust_child_relids_multilevel(). I didn't test memory consumption for > multi-level partitioning. But this is clear improvement. In that > function we free the AppendInfos array which as many pointers long as > the number of relids. So it doesn't make sense not to free the Relids > which can be {largest relid}/8 bytes long at least. > > 0003 - patch to save and reuse commuted RestrictInfo. This patch by > itself shows a small memory saving (3%) in the query below where the > same clause is commuted twice. The query does not contain any > partitioned tables. > create table t2 (a int primary key, b int, c int); > create index t2_a_b on t2(a, b); > select * from t2 where 10 = a > Memory used without patch: 13696 bytes > Memory used with patch: 13264 bytes > > 0004 - Patch which implements the hash table of hash table described > above and also code to avoid repeated RestrictInfo list translations. > > I will add this patchset to next commitfest. > > -- > Best Wishes, > Ashutosh Bapat PFA rebased patches. Nothing changes in 0002, 0003 and 0004. 0001 is the squashed version of the latest patch set at https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1csrjcmyyj8pd...@mail.gmail.com. -- Best Wishes, Ashutosh Bapat From de823e9cf0b03f21f8de51b49347df58ecb27c5a Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Thu, 31 Aug 2023 19:02:23 +0530 Subject: [PATCH 2/4] Free intermediate Relids created in adjust_child_relids_multilevel() adjust_child_relids_multilevel() creates Relids for all the intermediate stages in a partition hierarchy. These relid sets are not required finally. Relids or Bitmapset take reasonably large space when thousands of partitions are invovled. Hence free these intermediate relid sets. Ashutosh Bapat --- src/backend/optimizer/util/appendinfo.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c index f456b3b0a4..4008e52de2 100644 --- a/src/backend/optimizer/util/appendinfo.c +++ b/src/backend/optimizer/util/appendinfo.c @@ -591,6 +591,8 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids, { AppendRelInfo **appinfos; int nappinfos; + Relids tmp_relids = NULL; + Relids child_relids; /* * If the given relids set doesn't contain any of the parent relids, it @@ -599,13 +601,14 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids, if (!bms_overlap(relids, parentrel->relids)) return relids; + tmp_relids = relids; /* Recurse if immediate parent is not the top parent. */ if (childrel->parent != parentrel) { if (childrel->parent) - relids = adjust_child_relids_multilevel(root, relids, - childrel->parent, - parentrel); + tmp_relids = adjust_child_relids_multilevel(root, relids, + childrel->parent, + parentrel); else elog(ERROR, "childrel is not a child of parentrel"); } @@ -613,11 +616,15 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids, /* Now translate for this child. */ appinfos = find_appinfos_by_relids(root, childrel->relids, ); - relids = adjust_child_relids(relids, nappinfos, appinfos); + child_relids = adjust_child_relids(tmp_relids, nappinfos, appinfos); + + /* Free any intermediate Relids created. */ + if (tmp_relids != relids) + bms_free(tmp_relids); pfree(appinfos); - return relids; + return child_relids; } /* -- 2.25.1 From 69cfd8bd6ea38a99a715268423d92fa4179f7cc7 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Mon, 4 Sep 2023 14:56:21 +0530 Subject: [PATCH 3/4] Save commuted RestrictInfo for later use A commuted RestrictInfo may be produced as many times as the number of indexes it is used for. It's the same RestrictInfo always. Save some CPU and memory by saving the result in the original RestrictInfo. Ashutosh Bapat --- src/backend/optimizer/util/appendinfo.c | 6 ++ src/backend/optimizer/util/restrictinfo.c | 22 +- src/include/nodes/pathnodes.h | 9 + 3 files changed, 36 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c index 4008e52de2..53e1233dce 100644 --- a/src/backend/optimizer/util/appendinfo.c +++ b/src/backend/optimizer/util/appendinfo.c @@ -492,6 +492,12 @@ adjust_appendrel_attrs_mutator(Node *node, newinfo->left_mcvfreq = -1; newinfo->right_mcvfreq = -1; + /* + * Wipe out commuted parent RestrictInfo. The caller will compute + * commuted clause if required. + */ + newinfo->comm_rinfo = NULL; + return (Node *) newinfo; } diff --git a/src/backend/optimizer/util/restrictinfo.c b/src/backend/optimizer/util/restrictinfo.c index
Re: small erreport bug over partitioned table pgrowlocks module
On Tue, 31 Oct 2023 at 13:18, David Rowley wrote: > Here's a patch that puts the relam check last. I've pushed that patch. David
Re: always use runtime checks for CRC-32C instructions
On Mon, Oct 30, 2023 at 01:48:29PM -0700, Jeff Davis wrote: > I assume you are concerned about the call going through a function > pointer? If so, is it possible that setting a flag and then branching > would be better? > > Also, if it's a concern, should we also consider making an inlineable > version of pg_comp_crc32c_sse42()? I tested pg_waldump -z with 50M 65-byte records for the following implementations on an ARM system: * slicing-by-8: ~3.08s * proposed patches applied (runtime check): ~2.44s * only CRC intrinsics implementation compiled : ~2.42s * forced inlining : ~2.38s Avoiding the runtime check produced a 0.8% improvement, and forced inlining produced another 1.7% improvement. In comparison, even the runtime check implementation produced a 20.8% improvement over the slicing-by-8 one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Not deleted mentions of the cleared path
On Mon, Oct 30, 2023 at 7:31 PM Alena Rybakina wrote: > I have already written about the problem of InvalidPath [0] appearing. I > investigated this and found an error in the add_path() function, when we > reject a path, we free up the memory of the path, but do not delete various > mentions of it (for example, in the ancestor of relation, as in the example > below). > I agree that what you observed is true - add_path() may free a path while it's still referenced from some lower rels. For instance, when creating ordered paths, we may use the input path unchanged without copying if it's already well ordered, and it might be freed afterwards if it fails when competing in add_path(). But this doesn't seem to be a problem in practice. We will not access these references from the lower rels. I'm not sure if this is an issue that we need to fix, or we need to live with. But I do think it deserves some explanation in the comment of add_path(). Thanks Richard
Re: Is this a problem in GenericXLogFinish()?
On Mon, Oct 30, 2023 at 7:13 PM Robert Haas wrote: > > On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila wrote: > > > Hmm. So my question is: do we need the cleanup lock on the write > > > buffer even if there are no tuples, and even if primary bucket and the > > > write bucket are the same? > > > > Yes, we need it to exclude any concurrent in-progress scans that could > > return incorrect tuples during bucket squeeze operation. > > Amit, thanks for weighing in, but I'm not convinced. I thought we only > ever used a cleanup lock on the main bucket page to guard against > concurrent scans. > My understanding is the same. It is possible that my wording is not clear. Let me try to clarify again, Michael asked: "do we need the cleanup lock on the write buffer even if there are no tuples, and even if primary bucket and the write bucket are the same?" My reading of his question was do we need a cleanup lock even if the primary bucket and write bucket are the same which means the write bucket will be the first page in the chain and we need a cleanup lock on it. I think the second condition (no tuples) seems irrelevant here as whether that is true or false we need a cleanup lock. > Here you seem to be saying that we need a cleanup > lock on some page that may be an overflow page somewhere in the middle > of the chain, and that doesn't seem right to me. > Sorry, I don't intend to say this. -- With Regards, Amit Kapila.
Re: Add new option 'all' to pg_stat_reset_shared()
At Mon, 30 Oct 2023 14:15:53 +0530, Bharath Rupireddy wrote in > > > I imagine there are cases where people want to initialize all of them > > > at the same time in addition to initializing one at a time. > > > > FWIW, I fairly often wanted it, but forgot about that:p > > Isn't calling pg_stat_reset_shared() for all stats types helping you > out? Is there any problem with it? Can you be more specific about the > use-case? Oh. Sorry, it's my bad. pg_stat_reset_shared() is sufficient. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: COPY TO (FREEZE)?
HI, Zhang Mingli www.hashdata.xyz On Oct 31, 2023 at 03:55 +0800, Bruce Momjian , wrote: > > Sure, updated patch attached. LGTM.
RE: Synchronizing slots from primary to standby
Dear Ajin, Thanks for your reply! > On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu) > wrote: > > > > Dear Shveta, > > > > > PFA v25 patch set. The changes are: > > > > Thanks for making the patch! It seems that there are lots of comments, so > > I can put some high-level comments for 0001. > > Sorry if there are duplicated comments. > > > > 1. > > The patch seemed not to consider the case that failover option between > replication > > slot and subscription were different. Currently slot option will be > > overwritten > > by subscription one. > > > > Actually, I'm not sure what specification is better. Regarding the > > two_phase, > > 2PC will be decoded only when the both of settings are true. Should we > > follow? > > > > But this is the intention, we want the Alter subscription to be able > to change the failover behaviour > of the slot. I had not understood how two_phase is enabled. I found that slot->data.two_phase is overwritten in CreateDecodingContext(), so the failover option now follows two_phase, right? (I think the overwritten of data.failover should be also done at CreateDecodingContext()). > > 2. > > Currently ctx->failover is set only in the pgoutput_startup(), but not sure > > it is > OK. > > Can we change the parameter in CreateDecodingContext() or similar functions? > > > > Because IIUC it means that only slots which have pgoutput can wait. Other > > output plugins must understand the change and set faliover flag as well - > > I felt it is not good. E.g., you might miss to enable the parameter in > test_decoding. > > > > Regarding the two_phase parameter, setting on plugin layer is good because > > it > > quite affects the output. As for the failover, it is not related with the > > content so that all of slots should be enabled. > > > > I think CreateDecodingContext or StartupDecodingContext() is the common > path. > > Or, is it the out-of-scope for now? > > Currently, the failover field is part of the options list in the > StartReplicationCmd. This gives some > level of flexibility such that only plugins that are interested in > this need to handle it. The options list > is only deparsed by plugins. If we move it to outside of the options list, > this sort of changes the protocol for START_REPLICATION and will > impact all plugins. > But I agree to your larger point that, we need to do it in such a way that > other plugins do not unintentionally change the 'failover' behaviour > of the originally created slot. > Maybe I can code it in such a way that, only if the failover option is > specified in the list of options > passed as part of START_REPLICATION will it change the original slot > created 'failover' flag by adding > another flag "failover_opt_given". Plugins that set this, will be able > to change the failover flag of the slot, > while plugins that do not support this will not set this and the > failover flag of the created slot will remain. > What do you think? May be OK, but I came up with a corner case that external plugins have a streaming option 'failover'. What should be? Has the option been reserved? Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: "38.10.10. Shared Memory and LWLocks" may require a clarification
On Mon, Oct 30, 2023 at 04:10:17PM +0300, Aleksander Alekseev wrote: > I didn't like the commit message. Here is the corrected patch. Sorry > for the noise. Sounds pretty much OK to me. Thanks! -- Michael signature.asc Description: PGP signature
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Mon, Oct 30, 2023 at 12:47:41PM -0700, Andres Freund wrote: > I think the problem with these variables is that they're a really messy state > machine - something this patch doesn't meaningfully improve IMO. Okay. Yes, this is my root issue as well. We're at the stage where we should reduce the possible set of combinations and assumptions we're inventing because people can do undocumented stuff, then perhaps refactor the code on top of that (say, if one combination with too booleans is not possible, switch to a three-state enum rather than 2 bools, etc). >> This configuration was possible when recovering from a base backup taken >> by pg_basebackup without -R. Note that the documentation requires at >> least to set recovery.signal to restore from a backup, but the startup >> process was not making this policy explicit. > > Maybe I just didn't check the right place, but from I saw, this, at most, is > implied, rather than explicitly stated. See the doc reference here: https://www.postgresql.org/message-id/zubm6bnqneh7l...@paquier.xyz So it kind of implies it, still also mentions restore_command. It's like Schrödinger's cat, yes and no at the same time. > With -X ... we have all the necessary WAL locally, how does the workload on > the primary matter? If you pass --no-slot, pg_basebackup might fail to fetch > the necessary wal, but then you'd also have gotten an error. > > [...] > > Right now running pg_basebackup with -X stream, without --write-recovery-conf, > gives you a copy of a cluster that will come up correctly as a distinct > instance. > > [...] > > I also just don't think that it's always desirable to create a new timeline. Yeah. Another argument I was mentioning to Robert is that we may want to just treat the case where you have a backup_label without any signal files just the same as crash recovery, replaying all the local pg_wal/, and nothing else. For example, something like the attached should make sure that InArchiveRecovery=true should never be set if ArchiveRecoveryRequested is not set. The attached would still cause redo to complain on a "WAL ends before end of online backup" if not all the WAL is here (reason behind the tweak of 010_pg_basebackup.pl, but the previous tweak to pg_rewind's 008_min_recovery_point.pl is not required here). Attached is the idea I had in mind, in terms of code, FWIW. -- Michael From fbd9fa407c04799ad4401d3ba5c6b67a0d922631 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 31 Oct 2023 10:14:01 +0900 Subject: [PATCH] Force crash recovery with backup_label and no .signal files --- src/backend/access/transam/xlogrecovery.c| 29 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 3 +- doc/src/sgml/backup.sgml | 4 ++- 3 files changed, 28 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index c6156a..de5787d7e8 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -133,6 +133,8 @@ static TimeLineID curFileTLI; * currently performing crash recovery using only XLOG files in pg_wal, but * will switch to using offline XLOG archives as soon as we reach the end of * WAL in pg_wal. + * + * InArchiveRecovery should never be set without ArchiveRecoveryRequested. */ bool ArchiveRecoveryRequested = false; bool InArchiveRecovery = false; @@ -595,13 +597,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr, List *tablespaces = NIL; /* - * Archive recovery was requested, and thanks to the backup label - * file, we know how far we need to replay to reach consistency. Enter - * archive recovery directly. + * If archive recovery was requested, and we know how far we need to + * replay to reach consistency thanks to the backup label file, then + * enter archive recovery directly in this case. + * + * If archive recovery was not requested, then do crash recovery and + * replay all the local WAL. This still checks that all the WAL up + * to backupEndRequired has been replayed. This case is useful when + * restoring from a standalone base backup, taken with pg_basebackup + * --wal-method=stream, for example. */ - InArchiveRecovery = true; - if (StandbyModeRequested) - EnableStandbyMode(); + if (ArchiveRecoveryRequested) + { + InArchiveRecovery = true; + if (StandbyModeRequested) +EnableStandbyMode(); + } /* * When a backup_label file is present, we want to roll forward from @@ -1591,6 +1602,12 @@ ShutdownWalRecovery(void) */ if (ArchiveRecoveryRequested) DisownLatch(>recoveryWakeupLatch); + + /* + * InArchiveRecovery should never have been set without + * ArchiveRecoveryRequested. + */ + Assert(ArchiveRecoveryRequested || !InArchiveRecovery); } /* diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index
Re: Synchronizing slots from primary to standby
On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Shveta, > > > PFA v25 patch set. The changes are: > > Thanks for making the patch! It seems that there are lots of comments, so > I can put some high-level comments for 0001. > Sorry if there are duplicated comments. > > 1. > The patch seemed not to consider the case that failover option between > replication > slot and subscription were different. Currently slot option will be > overwritten > by subscription one. > > Actually, I'm not sure what specification is better. Regarding the two_phase, > 2PC will be decoded only when the both of settings are true. Should we follow? > But this is the intention, we want the Alter subscription to be able to change the failover behaviour of the slot. > 2. > Currently ctx->failover is set only in the pgoutput_startup(), but not sure > it is OK. > Can we change the parameter in CreateDecodingContext() or similar functions? > > Because IIUC it means that only slots which have pgoutput can wait. Other > output plugins must understand the change and set faliover flag as well - > I felt it is not good. E.g., you might miss to enable the parameter in > test_decoding. > > Regarding the two_phase parameter, setting on plugin layer is good because it > quite affects the output. As for the failover, it is not related with the > content so that all of slots should be enabled. > > I think CreateDecodingContext or StartupDecodingContext() is the common path. > Or, is it the out-of-scope for now? Currently, the failover field is part of the options list in the StartReplicationCmd. This gives some level of flexibility such that only plugins that are interested in this need to handle it. The options list is only deparsed by plugins. If we move it to outside of the options list, this sort of changes the protocol for START_REPLICATION and will impact all plugins. But I agree to your larger point that, we need to do it in such a way that other plugins do not unintentionally change the 'failover' behaviour of the originally created slot. Maybe I can code it in such a way that, only if the failover option is specified in the list of options passed as part of START_REPLICATION will it change the original slot created 'failover' flag by adding another flag "failover_opt_given". Plugins that set this, will be able to change the failover flag of the slot, while plugins that do not support this will not set this and the failover flag of the created slot will remain. What do you think? regards, Ajin Cherian Fujitsu Australia
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Mon, Oct 30, 2023 at 10:32:28AM -0600, Roberto Mello wrote: > I realize the original use of "touch" is a valid shortcut for what I > suggest above, however that will be less clear for the not-so-un*x-inclined > users of Postgres, while for some it'll be downright confusing, IMHO. It > also provides the advantage of being crystal clear on what needs to be done > to fix the problem. Indeed, "touch" may be better in this path if we'd throw an ERROR to enforce a given policy, and that's more consistent with the rest of the area. -- Michael signature.asc Description: PGP signature
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Mon, Oct 30, 2023 at 01:55:13PM -0400, Robert Haas wrote: > I would encourage some caution here. Thanks for chiming here. > In a vacuum, I'm in favor of this, and for the same reasons as you, > namely, that the huge pile of Booleans that we use to control recovery > is confusing, and it's difficult to make sure that all the code paths > are adequately tested, and I think some of the things that actually > work here are not documented. Yep, same feeling here. > But in practice, I think there is a possibility of something like this > backfiring very hard. Notice that the first two people who commented > on the thread saw the error and immediately removed backup_label even > though that's 100% wrong. It shows how utterly willing users are to > remove backup_label for any reason or no reason at all. If we convert > cases where things would have worked into cases where people nuke > backup_label and then it appears to work, we're going to be worse off > in the long run, no matter how crazy the idea of removing backup_label > may seem to us. As far as I know, there's one paragraph in the docs that implies this mode without giving an actual hint that this may be OK or not, so shrug: https://www.postgresql.org/docs/devel/continuous-archiving.html#BACKUP-TIPS "As with base backups, the easiest way to produce a standalone hot backup is to use the pg_basebackup tool. If you include the -X parameter when calling it, all the write-ahead log required to use the backup will be included in the backup automatically, and no special action is required to restore the backup." And a few lines down we imply to use restore_command, something that we check is set only if recovery.signal is set. See additionally validateRecoveryParameters(), where the comments imply that InArchiveRecovery would be set only when there's a restore command. As you're telling me, and I've considered that as an option as well, perhaps we should just consider the presence of a backup_label file with no .signal files as a synonym of crash recovery? In the recovery path, currently the essence of the problem is when we do InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning that it should do archive recovery but we don't want it, and that does not really make sense. The rest of the code sort of implies that this is not a suported combination. So basically, my suggestion here, is to just replay WAL up to the end of what's in your local pg_wal/ and hope for the best, without TLI jumps, except that we'd do nothing. Doing a pg_basebackup -X stream followed by a restart would work fine with that, because all the WAL is here. A point of contention is if we'd better be stricter about satisfying backupEndPoint in such a case, but the redo code only wants to do something here when ArchiveRecoveryRequested is set (aka there's a .signal file set), and we would not want a TLI jump at the end of recovery, so I don't see an argument with caring about backupEndPoint in this case. At the end, I'm OK as long as ArchiveRecoveryRequested=false InArchiveRecovery=true does not exist anymore, because it's much easier to get what's going on with the redo path, IMHO. (I have a patch at hand to show the idea, will post it with a reply to Andres' message.) > Also, Andres just recently mentioned to me that he uses this procedure > of starting a server with a backup_label but no recovery.signal or > standby.signal file regularly, and thinks other people do too. I was > surprised, since I've never done that, except maybe when I was a noob > and didn't have a clue. But Andres is far from a noob. At this stage, that's basically at your own risk, as the code thinks it's OK to force what's basically archive-recovery-without-being-it. So it basically works, but it can also easily backfire, as well.. -- Michael signature.asc Description: PGP signature
Re: small erreport bug over partitioned table pgrowlocks module
On Tue, 31 Oct 2023 at 13:00, jian he wrote: > BEGIN; > CREATE TABLE fk_parted_pk (a int PRIMARY KEY) PARTITION BY LIST (a); > SELECT * FROM pgrowlocks('fk_parted_pk'); > ERROR: only heap AM is supported > > error should be the following part: > if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > ereport(ERROR, > (errcode(ERRCODE_WRONG_OBJECT_TYPE), > errmsg("\"%s\" is a partitioned table", > RelationGetRelationName(rel)), > errdetail("Partitioned tables do not contain rows."))); Yeah. Seems that 4b8266415 didn't look closely enough at the other error messages and mistakenly put the relam check first instead of last. Here's a patch that puts the relam check last. David diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c index a04e187ec4..adbc8279c3 100644 --- a/contrib/pgrowlocks/pgrowlocks.c +++ b/contrib/pgrowlocks/pgrowlocks.c @@ -81,10 +81,6 @@ pgrowlocks(PG_FUNCTION_ARGS) relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname)); rel = relation_openrv(relrv, AccessShareLock); - if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) - ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("only heap AM is supported"))); - if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), @@ -96,6 +92,10 @@ pgrowlocks(PG_FUNCTION_ARGS) (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("\"%s\" is not a table", RelationGetRelationName(rel; + else if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("only heap AM is supported"))); /* * check permissions: must have SELECT on table or be in
Re: Infinite Interval
On Mon, Oct 30, 2023 at 6:01 PM Ashutosh Bapat wrote: > > > Here's my version of commit message > > ``` > Support Infinite interval values > > Interval datatype uses the same input and output representation for > infinite intervals as other datatypes representing time that support > infinity. An interval larger than any other interval is represented by > string literal 'infinity' or '+infinity'. An interval which is smaller > than any other interval is represented as '-infinity'. Internally > positive infinity is represented as maximum values supported by all > the member types of Interval datastructure and negative infinity is > represented as minimum values set to all the members. INTERVAL_NOBEGIN > and INTERVAL_NOEND macros can be used to set an Interval structure to > negative and positive infinity respectively. INTERVAL_IS_NOBEGIN and > INTERVAL_IS_NOEND macros are used to test respective values. > INTERVAL_NOT_FINITE macro is used to test whether a given Interval > value is infinite. > > Implementation of all known operators now handles infinite interval > values along with operations related to BRIN index, windowing and > selectivity. Regression tests are added to test these implementation. > > If a user has stored interval values '-2147483648 months -2147483648 > days -9223372036854775807 us' and '2147483647 months 2147483647 days > 9223372036854775806 us' in PostgreSQL versions 16 or earlier. Those > values will turn into '-infinity' and 'infinity' respectively after > upgrading to v17. These values are outside the documented range > supported by interval datatype and thus there's almost no possibility > of this occurrence. But it will be good to watch for these values > during upgrade. > ``` > the message is plain enough. I can understand it. thanks!
small erreport bug over partitioned table pgrowlocks module
hi. erreport bug over partitioned table in pgrowlocks. BEGIN; CREATE TABLE fk_parted_pk (a int PRIMARY KEY) PARTITION BY LIST (a); SELECT * FROM pgrowlocks('fk_parted_pk'); ERROR: only heap AM is supported error should be the following part: if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("\"%s\" is a partitioned table", RelationGetRelationName(rel)), errdetail("Partitioned tables do not contain rows.")));
Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}
On Mon, Oct 30, 2023 at 03:14:16PM +0100, hubert depesz lubaczewski wrote: > some things will definitely break, but that's 100% OK. The change seems > needed, and I can update my parser to deal with it :) Thanks for the input. I was looking yesterday if this code was available somewhere, but couldn't find it.. Until this morning: https://gitlab.com/depesz/explain.depesz.com.git And.. It looks like things would become better if we change "shared/local" to "shared", because the parsing code seems to have an issue once you add a '/'. All the fields in I/O Timings are considered as part of a Node, and they're just included in the output. Now, pasting a plan that includes "shared/local" drops entirely the string from the output result, so some information is lost. In short, imagine that we have the following string in a node: I/O Timings: shared/local write=23.77 This would show up like that, meaning that the context where the write/read timings happened is lost: I/O Timings: write=23.77 If we switch back to "shared", the context would be kept around. Of course, this does not count for all the parsers that may be out there, but at least that's something. -- Michael signature.asc Description: PGP signature
Re: Introduce a new view for checkpointer related stats
On Mon, Oct 30, 2023 at 11:59:10AM +0530, Bharath Rupireddy wrote: > Oh, thanks for taking care of this. While at it, I noticed that > there's no coverage for pg_stat_reset_shared('recovery_prefetch') and > XLogPrefetchResetStats() > https://coverage.postgresql.org/src/backend/access/transam/xlogprefetcher.c.gcov.html. > Most of the recovery_prefetch code is covered with recovery/streaming > related tests, but the reset stats part is missing. So, I've added > coverage for it in the attached 0002. Indeed, good catch. I didn't notice the hole in the coverage reports. I have merged 0001 and 0002 together, and applied them as of 5b2147d9fcc1. -- Michael signature.asc Description: PGP signature
Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?
On Tue, 31 Oct 2023 at 03:01, Bruce Momjian wrote: > I think you just go and change it. Your number is better than what we > have, and if someone wants to suggest a better number, we can change it > later. I did some more experimentation on the actual costs of getting a tuple from a foreign server. Using the attached setup, I did: postgres=# explain (analyze, timing off) SELECT * FROM t; QUERY PLAN Seq Scan on t (cost=0.00..144248.48 rows=1048 width=4) (actual rows=1000 loops=1) Planning Time: 0.077 ms Execution Time: 385.978 ms postgres=# explain (analyze, timing off) SELECT * FROM ft; QUERY PLAN --- Foreign Scan on ft (cost=100.00..244348.96 rows=1048 width=4) (actual rows=1000 loops=1) Planning Time: 0.126 ms Execution Time: 8335.392 ms So, let's take the first query and figure out the total cost per millisecond of execution time. We can then multiply that by the execution time of the 2nd query to calculate what we might expect the costs to be for the foreign table scan based on how long it took compared to the local table scan. postgres=# select 144248.48/385.978*8335.392; ?column? - 3115119.5824740270171341280 So, the above number is what we expect the foreign table scan to cost with the assumption that the cost per millisecond is about right for the local scan. We can then calculate how much we'll need to charge for a foreign tuple by subtracting the total cost of that query from our calculated value to calculate how much extra we need to charge, in total, then divide that by the number of tuples to get actual foreign tuple cost for this query. postgres=# select (3115119.58-244348.96)/1000; ?column? 0.287077062000 This is on an AMD 3990x running Linux 6.5 kernel. I tried the same on an Apple M2 mini and got: postgres=# select 144247.77/257.763*3052.084; ?column? - 1707988.7759402241595200680 postgres=# select (1707988.78-244347.54)/1000; ?column? 0.146364124000 So the actual foreign tuple cost on the M2 seems about half of what it is on the Zen 2 machine. Based on this, I agree with my original analysis that setting DEFAULT_FDW_TUPLE_COST to 0.2 is about right. Of course, this is a loopback onto localhost so remote networks likely would benefit from higher values, but based on this 0.01 is far too low and we should change it to at least 0.2. I'd be happy if anyone else would like to try the same experiment to see if there's some other value of DEFAULT_FDW_TUPLE_COST that might suit better. David setup_fdw.sql Description: Binary data
Re: Trigger violates foreign key constraint
On Mon, Oct 30, 2023 at 2:50 PM David G. Johnston < david.g.johns...@gmail.com> wrote: > On Tue, Oct 3, 2023 at 12:52 AM Laurenz Albe > wrote: > >> On Mon, 2023-10-02 at 09:49 -0400, Tom Lane wrote: >> > This is by design: triggers operate at a lower level than >> > foreign keys, so an ill-conceived trigger can break an FK constraint. >> > That's documented somewhere, though maybe not visibly enough. >> >> Not having found any documentation, I propose the attached caution. >> >> > I dislike scaring the user like this without providing any context on what > conditions or actions are problematic. > > The ON DELETE and ON UPDATE clauses of foreign keys are implemented as > system triggers on the referenced table that invoke additional delete or > update commands on the referencing table. The final outcome of these > additional commands are not checked - it is the responsibility of the DBA > to ensure that the user triggers on the referencing table actually remove > the rows they are requested to remove, or update to NULL any referencing > foreign key columns. In particular, before row triggers that return NULL > will prevent the delete/update from occurring and thus result in a violated > foreign key constraint. > > Add sgml as needed, note the original patch missed adding "" > to PostgreSQL. > > Additionally, the existing place this is covered is here: """ Trigger functions invoked by per-statement triggers should always return NULL. Trigger functions invoked by per-row triggers can return a table row (a value of type HeapTuple) to the calling executor, if they choose. A row-level trigger fired before an operation has the following choices: It can return NULL to skip the operation for the current row. This instructs the executor to not perform the row-level operation that invoked the trigger (the insertion, modification, or deletion of a particular table row). For row-level INSERT and UPDATE triggers only, the returned row becomes the row that will be inserted or will replace the row being updated. This allows the trigger function to modify the row being inserted or updated. A row-level BEFORE trigger that does not intend to cause either of these behaviors must be careful to return as its result the same row that was passed in (that is, the NEW row for INSERT and UPDATE triggers, the OLD row for DELETE triggers). """ We should probably add a note pointing back to the DDL chapter and that more concisely says. "Note: If this table also contains any foreign key constraints with on update or on delete clauses, then a failure to return the same row that was passed in for update and delete triggers is going to result in broken referential integrity for the affected row." I do like "broken referential integrity" from the original patch over "violated foreign key constraint" - so that needs to be substituted in for the final part of my earlier proposal if we go with its more detailed wording. My issue with "violated" is that it sounds like the system is going to catch it at the end - broken doesn't have the same implication. David J.
Re: PGDOCS - add more links in the pub/sub reference pages
Thanks for pushing! == Kind Regards, Peter Smith. Fujitsu Australia
Re: Trigger violates foreign key constraint
On Tue, Oct 3, 2023 at 12:52 AM Laurenz Albe wrote: > On Mon, 2023-10-02 at 09:49 -0400, Tom Lane wrote: > > This is by design: triggers operate at a lower level than > > foreign keys, so an ill-conceived trigger can break an FK constraint. > > That's documented somewhere, though maybe not visibly enough. > > Not having found any documentation, I propose the attached caution. > > I dislike scaring the user like this without providing any context on what conditions or actions are problematic. The ON DELETE and ON UPDATE clauses of foreign keys are implemented as system triggers on the referenced table that invoke additional delete or update commands on the referencing table. The final outcome of these additional commands are not checked - it is the responsibility of the DBA to ensure that the user triggers on the referencing table actually remove the rows they are requested to remove, or update to NULL any referencing foreign key columns. In particular, before row triggers that return NULL will prevent the delete/update from occurring and thus result in a violated foreign key constraint. Add sgml as needed, note the original patch missed adding "" to PostgreSQL. David J.
Re: Trigger violates foreign key constraint
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:tested, passed It seems like people have been talking about this problem since 2010 (https://stackoverflow.com/questions/3961825/foreign-keys-in-postgresql-can-be-violated-by-trigger), and we finally got our act together and put it in the official document today, only 13 years later. The new status of this patch is: Ready for Committer
Re: always use runtime checks for CRC-32C instructions
On Mon, Oct 30, 2023 at 12:39:23PM -0400, Tom Lane wrote: > On the one hand, I agree that we need to keep the complexity from > getting out of hand. On the other hand, I wonder if this approach > isn't optimizing for the wrong case. How many machines that PG 17 > will ever be run on in production will lack SSE 4.2 (for Intel) > or ARMv8 instructions (on the ARM side)? For the CRC instructions in use today, I wouldn't be surprised if that number is pretty small, but for newer or optional instructions (like ARM's PMULL), I don't think we'll be so lucky. Even if we do feel comfortable assuming the presence of SSE 4.2, etc., we'll likely still need to add runtime checks for future optimizations. > It seems like a shame > to be burdening these instructions with a subroutine call for the > benefit of long-obsolete hardware versions. Maybe that overhead > is negligible, but it doesn't sound like you tried to measure it. When I went to measure this, I noticed that my relatively new x86 machine with a relatively new version of gcc uses the runtime check. I then skimmed through a few dozen buildfarm machines and found that, of all x86 and ARM machines that supported the specialized CRC instructions, only one ARM machine did not use the runtime check. Of course, this is far from a scientific data point, but it seems to indicate that the runtime check is the norm. (I still need to measure it.) > Anyway, I agree that the cost of a one-time-per-process probe should > be negligible; it's the per-use cost that I worry about. If you can > do some measurements proving that that worry is ill-founded, then > I'm good with test-first. Will do. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
On Mon, 30 Oct 2023 at 23:48, Amit Kapila wrote: > > On Fri, Oct 27, 2023 at 3:23 AM David Rowley wrote: > > * parallel.c in HandleParallelMessages(): > > * applyparallelworker.c in HandleParallelApplyMessages(): > > Both the above calls are used to handle ERROR/NOTICE messages from > parallel workers as you have also noticed. The comment atop > initReadOnlyStringInfo() clearly states that it is used in the > performance-critical path. So, is it worth changing these places? In > the future, this may pose the risk of this API being used > inconsistently. I'm ok to leave those ones out. But just a note on the performance side, if we go around needlessly doing palloc/memcpy then we'll be flushing possibly useful cachelines out and cause slowdowns elsewhere. That's a pretty hard thing to quantify, but something to keep in mind. David
Re: always use runtime checks for CRC-32C instructions
On Mon, 2023-10-30 at 12:39 -0400, Tom Lane wrote: > It seems like a shame > to be burdening these instructions with a subroutine call for the > benefit of long-obsolete hardware versions. It's already doing a call to pg_comp_crc32c_sse42() regardless, right? I assume you are concerned about the call going through a function pointer? If so, is it possible that setting a flag and then branching would be better? Also, if it's a concern, should we also consider making an inlineable version of pg_comp_crc32c_sse42()? Regards, Jeff Davis
Re: MERGE ... RETURNING
On Fri, 2023-10-27 at 15:46 +0100, Dean Rasheed wrote: > > One fiddly part is resolving the shift/reduce conflicts in the > grammar. Specifically, on seeing "RETURNING expr when ...", there is > ambiguity over whether the "when" is a column alias or the start of > the next merge action. I've resolved that by assigning a slightly > higher precedence to an expression without an alias, so WHEN is > assumed to not be an alias. It seems pretty ugly though (in terms of > having to duplicate so much code), and I'd be interested to know if > there's a neater way to do it. Can someone else comment on whether this is a reasonable solution to the grammar problem? > From a usability perspective, I'm still somewhat sceptical about this > approach. It's a much more verbose syntax, and it gets quite tedious > having to repeat the RETURNING list for every action, and keep them > in > sync. If we go with the single RETURNING-clause-at-the-end approach, how important is it that the action can be a part of an arbitrary expression? Perhaps something closer to your original proposal would be a good compromise (sorry to backtrack yet again...)? It couldn't be used in an arbitrary expression, but that also means that it couldn't end up in the wrong kind of expression. Regards, Jeff Davis
Adding ordering to list of available extensions
Please find attached a patch to provide some basic ordering to the system views pg_available_extensions and pg_available_extension_versions. It is sorely tempting to add ORDER BYs to many of the other views in that file, but I understand that would be contentious as there are reasons for not adding an ORDER BY. However, in the case of pg_available_extensions, it's a very, very small resultset, with an obvious default ordering, and extremely unlikely to be a part of a larger complex query. It's much more likely people like myself are just doing a "SELECT * FROM pg_available_extensions" and then get annoyed at the random ordering. Cheers, Greg show_extensions_in_natural_order.pg.patch Description: Binary data
Re: COPY TO (FREEZE)?
On Mon, Oct 30, 2023 at 02:29:05PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > That is a good point. I reviewed more of the messages and added > > capitalization where appropriate, patch attached. > > This is starting to look pretty good. I have one more thought, > as long as we're touching all these messages anyway: how about > s/FOO available only in CSV mode/FOO requires CSV mode/ ? > That's both shorter and less telegraphic, as it's not omitting the verb. Sure, updated patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index d12ba96497..18ecc69c33 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -224,6 +224,7 @@ COPY { table_name [ ( COPY FREEZE on a partitioned table. + This option is only allowed in COPY FROM. Note that all other sessions will immediately be able to see the data diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index c5d7d78645..cfad47b562 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -671,7 +671,7 @@ ProcessCopyOptions(ParseState *pstate, if (!opts_out->csv_mode && opts_out->quote != NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY quote available only in CSV mode"))); + errmsg("COPY QUOTE requires CSV mode"))); if (opts_out->csv_mode && strlen(opts_out->quote) != 1) ereport(ERROR, @@ -687,7 +687,7 @@ ProcessCopyOptions(ParseState *pstate, if (!opts_out->csv_mode && opts_out->escape != NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY escape available only in CSV mode"))); + errmsg("COPY ESCAPE requires CSV mode"))); if (opts_out->csv_mode && strlen(opts_out->escape) != 1) ereport(ERROR, @@ -698,46 +698,52 @@ ProcessCopyOptions(ParseState *pstate, if (!opts_out->csv_mode && (opts_out->force_quote || opts_out->force_quote_all)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force quote available only in CSV mode"))); + errmsg("COPY FORCE_QUOTE requires CSV mode"))); if ((opts_out->force_quote || opts_out->force_quote_all) && is_from) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force quote only available using COPY TO"))); + errmsg("COPY FORCE_QUOTE cannot be used with COPY FROM"))); /* Check force_notnull */ if (!opts_out->csv_mode && opts_out->force_notnull != NIL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force not null available only in CSV mode"))); + errmsg("COPY FORCE_NOT_NULL requires CSV mode"))); if (opts_out->force_notnull != NIL && !is_from) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force not null only available using COPY FROM"))); +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY FORCE_NOT_NULL cannot be used with COPY TO"))); /* Check force_null */ if (!opts_out->csv_mode && opts_out->force_null != NIL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force null available only in CSV mode"))); + errmsg("COPY FORCE_NULL requires CSV mode"))); if (opts_out->force_null != NIL && !is_from) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force null only available using COPY FROM"))); +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY FORCE_NULL cannot be used with COPY TO"))); /* Don't allow the delimiter to appear in the null string. */ if (strchr(opts_out->null_print, opts_out->delim[0]) != NULL) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY delimiter must not appear in the NULL specification"))); +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY delimiter character must not appear in the NULL specification"))); /* Don't allow the CSV quote char to appear in the null string. */ if (opts_out->csv_mode && strchr(opts_out->null_print, opts_out->quote[0]) != NULL) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("CSV quote character must not appear in the NULL specification"))); + /* Check freeze */ + if (opts_out->freeze && !is_from) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY FREEZE cannot be used with COPY TO"))); + if (opts_out->default_print) { if (!is_from) diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out index 95ec7363af..c4178b9c07 100644 --- a/src/test/regress/expected/copy2.out +++ b/src/test/regress/expected/copy2.out @@ -83,17 +83,17 @@ ERROR: cannot specify DELIMITER in BINARY mode COPY x to stdin (format BINARY, null
Re: Some performance degradation in REL_16 vs REL_15
Hi, On 2023-10-30 15:28:53 +0300, Anton A. Melnikov wrote: > For REL_16_STABLE at 7cc2f59dd the average TPS was: 2020+-70, > for REL_10_STABLE at c18c12c98 - 2260+-70 > > The percentage difference was approximately 11%. > Please see the 16vs10.png picture with the graphical representation of the > data obtained. > Also there are the raw data in the raw_data_s21.txt. > > In some days i hope to perform additional measurements that were mentioned > above in this letter. > It would be interesting to establish the reason for this difference. And i > would be very grateful > if you could advise me what other settings can be tweaked. There's really no point in comparing peformance with assertions enabled (leaving aside assertions that cause extreme performance difference, making development harder). We very well might have added assertions making things more expensive, without affecting performance in optimized/non-assert builds. Greetings, Andres Freund
Re: Row pattern recognition
On Tue, Oct 24, 2023 at 7:49 PM Tatsuo Ishii wrote: > I am impressed the simple NFA implementation. Thanks! > It would be nicer if it > could be implemented without using recursion. Yeah. If for some reason we end up going with a bespoke implementation, I assume we'd just convert the algorithm to an iterative one and optimize it heavily. But I didn't want to do that too early, since it'd probably make it harder to add new features... and anyway my goal is still to try to reuse src/backend/regex eventually. > > SELECT aid, first_value(aid) OVER w, > > count(*) OVER w > > FROM pgbench_accounts > > WINDOW w AS ( > > PARTITION BY bid > > ORDER BY aid > > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > > AFTER MATCH SKIP PAST LAST ROW > > INITIAL > > PATTERN (START UP+) > > DEFINE > > START AS TRUE, > > UP AS aid > PREV(aid) > > ); > > I ran this against your patch. It failed around > 60k rows. Nice, that's actually more frames than I expected. Looks like I have similar results here with my second test query (segfault at ~58k rows). Thanks, --Jacob
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Hi, On 2023-10-30 16:08:50 +0900, Michael Paquier wrote: > From 26a8432fe3ab8426e7797d85d19b0fe69d3384c9 Mon Sep 17 00:00:00 2001 > From: Michael Paquier > Date: Mon, 30 Oct 2023 16:02:52 +0900 > Subject: [PATCH v4] Require recovery.signal or standby.signal when reading a > backup_file > > Historically, the startup process uses two static variables to control > if archive recovery should happen, when either recovery.signal or > standby.signal are defined in the data folder at the beginning of > recovery: I think the problem with these variables is that they're a really messy state machine - something this patch doesn't meaningfully improve IMO. > This configuration was possible when recovering from a base backup taken > by pg_basebackup without -R. Note that the documentation requires at > least to set recovery.signal to restore from a backup, but the startup > process was not making this policy explicit. Maybe I just didn't check the right place, but from I saw, this, at most, is implied, rather than explicitly stated. > In most cases, one would have been able to complete recovery, but that's a > matter of luck, really, as it depends on the workload of the origin server. With -X ... we have all the necessary WAL locally, how does the workload on the primary matter? If you pass --no-slot, pg_basebackup might fail to fetch the necessary wal, but then you'd also have gotten an error. I agree with Robert that this would be a good error check on a green field, but that I am less convinced it's going to help more than hurt now. Right now running pg_basebackup with -X stream, without --write-recovery-conf, gives you a copy of a cluster that will come up correctly as a distinct instance. With this change applied, you need to know that the way to avoid the existing FATAL about restore_command at startup (when recovery.signal exists but restore_command isn't set)) is to is to set "restore_command = false", something we don't explain anywhere afaict. We should lessen the need to ever use restore_command, not increase it. It also seems risky to have people get used to restore_command = false, because that effectively disables detection of other timelines etc. But, this method does force a new timeline - which will be the same on each clone of the database... I also just don't think that it's always desirable to create a new timeline. Greetings, Andres Freund
Re: trying again to get incremental backup
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund wrote: > After playing with this for a while, I don't see a reason for wal_summarize_mb > from a memory usage POV at least. Cool! Thanks for testing. > I wonder if there are use cases that might like to consume WAL summaries > before the next checkpoint? For those wal_summarize_mb likely wouldn't be a > good control, but they might want to request a summary file to be created at > some point? It's possible. I actually think it's even more likely that there are use cases that will also want the WAL summarized, but in some different way. For example, you might want a summary that would give you the LSN or approximate LSN where changes to a certain block occurred. Such a summary would be way bigger than these summaries and therefore, at least IMHO, a lot less useful for incremental backup, but it could be really useful for something else. Or you might want summaries that focus on something other than which blocks got changed, like what relations were created or destroyed, or only changes to certain kinds of relations or relation forks, or whatever. In a way, you can even think of logical decoding as a kind of WAL summarization, just with a very different set of goals from this one. I won't be too surprised if the next hacker wants something that is different enough from what this does that it doesn't make sense to share mechanism, but if by chance they want the same thing but dumped a bit more frequently, well, that can be done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: trying again to get incremental backup
Hi, On 2023-10-30 10:45:03 -0400, Robert Haas wrote: > On Tue, Oct 24, 2023 at 12:08 PM Robert Haas wrote: > > Note that whether to remove summaries is a separate question from > > whether to generate them in the first place. Right now, I have > > wal_summarize_mb controlling whether they get generated in the first > > place, but as I noted in another recent email, that isn't an entirely > > satisfying solution. > > I did some more research on this. My conclusion is that I should > remove wal_summarize_mb and just have a GUC summarize_wal = on|off > that controls whether the summarizer runs at all. There will be one > summary file per checkpoint, no matter how far apart checkpoints are > or how large the summary gets. Below I'll explain the reasoning; let > me know if you disagree. > What I describe above would be a bad plan if it were realistically > possible for a summary file to get so large that it might run the > machine out of memory either when producing it or when trying to make > use of it for an incremental backup. This seems to be a somewhat > difficult scenario to create. So far, I haven't been able to generate > WAL summary files more than a few tens of megabytes in size, even when > summarizing 50+ GB of WAL per summary file. One reason why it's hard > to produce large summary files is because, for a single relation fork, > the WAL summary size converges to 1 bit per modified block when the > number of modified blocks is large. This means that, even if you have > a terabyte sized relation, you're looking at no more than perhaps 20MB > of summary data no matter how much of it gets modified. Now, somebody > could have a 30TB relation and then if they modify the whole thing > they could have the better part of a gigabyte of summary data for that > relation, but if you've got a 30TB table you probably have enough > memory that that's no big deal. I'm not particularly worried about the rewriting-30TB-table case - that'd also generate >= 30TB of WAL most of the time. Which realistically is going to trigger a few checkpoints, even on very big instances. > But, what if you have multiple relations? I initialized pgbench with a > scale factor of 3 and also with 3 partitions and did a 1-hour > run. I got 4 checkpoints during that time and each one produced an > approximately 16MB summary file. Hm, I assume the pgbench run will be fairly massively bottlenecked on IO, due to having to read data from disk, lots of full page write and having to write out lots of data? I.e. we won't do all that many transactions during the 1h? > To get a 1GB+ WAL summary file, you'd need to modify millions of relation > forks, maybe tens of millions, and most installations aren't even going to > have that many relation forks, let alone be modifying them all frequently. I tried to find bad cases for a bit - and I am not worried. I wrote a pgbench script to create 10k single-row relations in each script, ran that with 96 clients, checkpointed, and ran a pgbench script that updated the single row in each table. After creation of the relation WAL summarizer uses LOG: level: 1; Wal Summarizer: 378433680 total in 43 blocks; 5628936 free (66 chunks); 372804744 used and creates a 26MB summary file. After checkpoint & updates WAL summarizer uses: LOG: level: 1; Wal Summarizer: 369205392 total in 43 blocks; 5864536 free (26 chunks); 363340856 used and creates a 26MB summary file. Sure, 350MB ain't nothing, but simply just executing \dt in the database created by this makes the backend use 260MB after. Which isn't going away, whereas WAL summarizer drops its memory usage soon after. > But I think that's sufficiently niche that the current patch shouldn't > concern itself with such cases. If we find that they're common enough > to worry about, we might eventually want to do something to mitigate > them, but whether that thing looks anything like wal_summarize_mb > seems pretty unclear. So I conclude that it's a mistake to include > that GUC as currently designed and propose to replace it with a > Boolean as described above. After playing with this for a while, I don't see a reason for wal_summarize_mb from a memory usage POV at least. I wonder if there are use cases that might like to consume WAL summaries before the next checkpoint? For those wal_summarize_mb likely wouldn't be a good control, but they might want to request a summary file to be created at some point? Greetings, Andres Freund
Re: COPY TO (FREEZE)?
Bruce Momjian writes: > That is a good point. I reviewed more of the messages and added > capitalization where appropriate, patch attached. This is starting to look pretty good. I have one more thought, as long as we're touching all these messages anyway: how about s/FOO available only in CSV mode/FOO requires CSV mode/ ? That's both shorter and less telegraphic, as it's not omitting the verb. regards, tom lane
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Mon, Oct 30, 2023 at 3:09 AM Michael Paquier wrote: > I have been reviewing the patch, and applied portions of it as of > dc5bd388 and 1ffdc03c and they're quite independent pieces. After > that, the remaining bits of the patch to change the behavior is now > straight-forward. I have written a commit message for it, while on > it, as per the attached. I would encourage some caution here. In a vacuum, I'm in favor of this, and for the same reasons as you, namely, that the huge pile of Booleans that we use to control recovery is confusing, and it's difficult to make sure that all the code paths are adequately tested, and I think some of the things that actually work here are not documented. But in practice, I think there is a possibility of something like this backfiring very hard. Notice that the first two people who commented on the thread saw the error and immediately removed backup_label even though that's 100% wrong. It shows how utterly willing users are to remove backup_label for any reason or no reason at all. If we convert cases where things would have worked into cases where people nuke backup_label and then it appears to work, we're going to be worse off in the long run, no matter how crazy the idea of removing backup_label may seem to us. Also, Andres just recently mentioned to me that he uses this procedure of starting a server with a backup_label but no recovery.signal or standby.signal file regularly, and thinks other people do too. I was surprised, since I've never done that, except maybe when I was a noob and didn't have a clue. But Andres is far from a noob. -- Robert Haas EDB: http://www.enterprisedb.com
Re: trying again to get incremental backup
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak wrote: > If that is still an area open for discussion: wouldn't it be better to > just specify LSN as it would allow resyncing standby across major lag > where the WAL to replay would be enormous? Given that we had > primary->standby where standby would be stuck on some LSN, right now > it would be: > 1) calculate backup manifest of desynced 10TB standby (how? using > which tool?) - even if possible, that means reading 10TB of data > instead of just putting a number, isn't it? > 2) backup primary with such incremental backup >= LSN > 3) copy the incremental backup to standby > 4) apply it to the impaired standby > 5) restart the WAL replay As you may be able to tell from the flurry of posts and new patch sets, I'm trying hard to sort out the remaining open items that pertain to this patch set, and I'm now back to thinking about this one. TL;DR: I think the idea has some potential, but there are some pitfalls that I'm not sure how to address. I spent some time looking at how we currently use the data from the backup manifest. Currently, we do two things with it. First, when we're backing up each file, we check whether it's present in the backup manifest and, if not, we back it up in full. This actually feels fairly poor. If it makes any difference at all, then presumably the underlying algorithm is buggy and needs to be fixed. Maybe that should be ripped out altogether or turned into some kind of sanity check that causes a big explosion if it fails. Second, we check whether the WAL ranges reported by the client match up with the timeline history of the server (see PrepareForIncrementalBackup). This set of sanity checks seems fairly important to me, and I'd regret discarding them. I think there's some possibility that they might catch user error, like where somebody promotes multiple standbys and maybe they even get the same timeline on more than one of them, and then confusion might ensue. I also think that there's a real possibility that they might make it easier to track down bugs in my code, even if those bugs aren't necessarily timeline-related. If (or more realistically when) somebody ends up with a corrupted cluster after running pg_combinebackup, we're going to need to figure out whether that corruption is the result of bugs (and if so where they are) or user error (and if so what it was). The most obvious ways of ending up with a corrupted cluster are (1) taking an incremental backup against a prior backup that is not in the history of the server from which the backup is taken or (2) combining an incremental backup the wrong prior backup, so whatever sanity checks we can have that will tend to prevent those kinds of mistakes seem like a really good idea. And those kinds of checks seem relevant here, too. Consider that it wouldn't be valid to use pg_combinebackup to fast-forward a standby server if the incremental backup's backup-end-LSN preceded the standby server's minimum recovery point. Imagine that you have a standby whose last checkpoint's redo location was at LSN 2/48. Being the enterprising DBA that you are, you make a note of that LSN and go take an incremental backup based on it. You then stop the standby server and try to apply the incremental backup to fast-forward the standby. Well, it's possible that in the meanwhile the standby actually caught up, and now has a minimum recovery point that follows the backup-end-LSN of your incremental backup. In that case, you can't legally use that incremental backup to fast-forward that standby, but no code I've yet written would be smart enough to figure that out. Or, maybe you (or some other DBA on your team) got really excited and actually promoted that standby meanwhile, and now it's not even on the same timeline any more. In the "normal" case where you take an incremental backup based on an earlier base backup, these kinds of problems are detectable, and it seems to me that if we want to enable this kind of use case, it would be pretty smart to have a plan to detect similar mistakes here. I don't, currently, but maybe there is one. Another practical problem here is that, right now, pg_combinebackup doesn't have an in-place mode. It knows how to take a bunch of input backups and write out an output backup, but that output backup needs to go into a new, fresh directory (or directories plural, if there are user-defined tablespaces). I had previously considered adding such a mode, but the idea I had at the time wouldn't have worked for this case. I imagined that someone might want to run "pg_combinebackup --in-place full incr" and clobber the contents of the incr directory with the output, basically discarding the incremental backup you took in favor of a full backup that could have been taken at the same point in time. But here, you'd want to clobber the *first* input to pg_combinebackup, not the last one, so if we want to add something like this, the UI needs some thought. One thing that I find
Re: always use runtime checks for CRC-32C instructions
Nathan Bossart writes: > The aforementioned other thread [0] aims to further optimize this code by > using another instruction that requires additional configure and/or runtime > checks. $SUBJECT has been in the back of my mind for a while, but given > proposals to add further complexity to this code, I figured it might be a > good time to propose this simplification. Specifically, I think we > shouldn't worry about trying to compile only the special instrinics > versions, and instead always try to build both and choose the appropriate > one at runtime. On the one hand, I agree that we need to keep the complexity from getting out of hand. On the other hand, I wonder if this approach isn't optimizing for the wrong case. How many machines that PG 17 will ever be run on in production will lack SSE 4.2 (for Intel) or ARMv8 instructions (on the ARM side)? It seems like a shame to be burdening these instructions with a subroutine call for the benefit of long-obsolete hardware versions. Maybe that overhead is negligible, but it doesn't sound like you tried to measure it. I'm not quite sure what to propose instead, though. I thought for a little bit about a configure switch to select "test first" or "pedal to the metal". But in practice, package builders would probably have to select the conservative "test first" option; and we know that the vast majority of modern installations use prebuilt packages, so it's not clear that this answer would help many people. Anyway, I agree that the cost of a one-time-per-process probe should be negligible; it's the per-use cost that I worry about. If you can do some measurements proving that that worry is ill-founded, then I'm good with test-first. regards, tom lane
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Mon, Oct 30, 2023 at 1:09 AM Michael Paquier wrote: > > I have been reviewing the patch, and applied portions of it as of > dc5bd388 and 1ffdc03c and they're quite independent pieces. After > that, the remaining bits of the patch to change the behavior is now > straight-forward. I have written a commit message for it, while on > it, as per the attached. > A suggestion for the hint message in an effort to improve readability: "If you are restoring from a backup, ensure \"%s/recovery.signal\" or \"%s/standby.signal\" is present and add required recovery options." I realize the original use of "touch" is a valid shortcut for what I suggest above, however that will be less clear for the not-so-un*x-inclined users of Postgres, while for some it'll be downright confusing, IMHO. It also provides the advantage of being crystal clear on what needs to be done to fix the problem. Roberto
Re: CRC32C Parallel Computation Optimization on ARM
On Fri, Oct 27, 2023 at 07:01:10AM +, Xiang Gao wrote: > On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote: >>> We consider that a runtime check needs to be done in any scenario. >>> Here we only confirm that the compilation can be successful. >> >A runtime check will be done when choosing which algorithm. >> >You can think of us as merging USE_ARMV8_VMULL and >> >USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL. > >>Oh. Looking again, I see that we are using a runtime check for ARM in all >>cases with this patch. If so, maybe we should just remove >>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have >>USE_ARMV8_CRC32C always do the runtime check). I suspect there are other >>opportunities to simplify things, too. > > Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch. Thanks. I went ahead and split this prerequisite part out to a separate thread [0] since it's sort-of unrelated to your proposal here. It's not really a prerequisite, but I do think it will simplify things a bit. [0] https://postgr.es/m/20231030161706.GA3011%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
always use runtime checks for CRC-32C instructions
This is an offshoot of the "CRC32C Parallel Computation Optimization on ARM" thread [0]. I intend for this to be a prerequisite patch set. Presently, for the SSE 4.2 and ARMv8 CRC instructions used in the CRC32C code for WAL records, etc., we first check if the intrinsics are available with the default compiler flags. If so, we only bother compiling the implementation that uses those intrinsics. If not, we also check whether the intrinsics are available with some extra CFLAGS, and if they are, we compile both the implementation that uses the intrinsics as well as a fallback implementation that doesn't require any special instructions. Then, at runtime, we check what's available in the hardware and choose the appropriate CRC32C implementation. The aforementioned other thread [0] aims to further optimize this code by using another instruction that requires additional configure and/or runtime checks. $SUBJECT has been in the back of my mind for a while, but given proposals to add further complexity to this code, I figured it might be a good time to propose this simplification. Specifically, I think we shouldn't worry about trying to compile only the special instrinics versions, and instead always try to build both and choose the appropriate one at runtime. AFAICT the trade-offs aren't too bad. With some simple testing, I see that the runtime check occurs once at startup, so I don't anticipate any noticeable performance impact. I suppose each process might need to do the check in EXEC_BACKEND builds, but even so, I suspect the difference is negligible. I also see that the SSE 4.2 runtime check requires the CPUID instruction, so we wouldn't use the instrinsics for hardware that supports SSE 4.2 but not CPUID. However, I'm not sure such hardware even exists. Wikipedia says that CPUID was introduced in 1993 [1], and meson.build appears to omit the CPUID check when determining which CRC32C implementation to use. Furthermore, meson.build alludes to problems with some of the CPUID-related checks: # XXX: The configure.ac check for __cpuid() is broken, we don't copy that # here. To prevent problems due to two detection methods working, stop # checking after one. Are there any other reasons that we should try to avoid the runtime check when possible? I've attached two patches. 0001 adds a debug message to the SSE 4.2 runtime check that matches the one already present for the ARMv8 check. This message just notes whether the runtime check found that the special CRC instructions are available. 0002 is a first attempt at $SUBJECT. I've tested it on both x86 and ARM, and it seems to work as intended. You'll notice that I'm still checking for the intrinsics with the default compiler flags first. I didn't see any strong reason to change this, and doing so allows us to avoid sending extra CFLAGS when possible. Thoughts? [0] https://postgr.es/m/DB9PR08MB6991329A73923BF8ED4B3422F5DBA%40DB9PR08MB6991.eurprd08.prod.outlook.com [1] https://en.wikipedia.org/wiki/CPUID -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 964afc75976cce8d712d97f346d2b8eea9c1f1ee Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 28 Oct 2023 22:12:45 -0500 Subject: [PATCH v1 1/2] add debug message --- src/port/pg_crc32c_sse42_choose.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c index 41ff4a35ad..3689c38e92 100644 --- a/src/port/pg_crc32c_sse42_choose.c +++ b/src/port/pg_crc32c_sse42_choose.c @@ -30,10 +30,15 @@ #include "port/pg_crc32c.h" +#ifndef FRONTEND +#include "utils/elog.h" +#endif + static bool pg_crc32c_sse42_available(void) { unsigned int exx[4] = {0, 0, 0, 0}; + bool result; #if defined(HAVE__GET_CPUID) __get_cpuid(1, [0], [1], [2], [3]); @@ -43,7 +48,13 @@ pg_crc32c_sse42_available(void) #error cpuid instruction not available #endif - return (exx[2] & (1 << 20)) != 0; /* SSE 4.2 */ + result = ((exx[2] & (1 << 20)) != 0); /* SSE 4.2 */ + +#ifndef FRONTEND + elog(DEBUG1, "using sse42 crc32 hardware = %d", result); +#endif + + return result; } /* -- 2.37.1 (Apple Git-137.1) >From 7b8efae00327728000f7650a513ab0fd4fb15cd5 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 28 Oct 2023 21:16:19 -0500 Subject: [PATCH v1 2/2] always use runtime checks for sse4.2/armv8 crc32c code --- configure| 122 ++- configure.ac | 104 ++--- meson.build | 31 +++-- src/include/pg_config.h.in | 6 -- src/include/port/pg_crc32c.h | 19 +- src/port/meson.build | 2 - src/tools/msvc/Solution.pm | 2 - 7 files changed, 80 insertions(+), 206 deletions(-) diff --git a/configure b/configure index cfd968235f..47372dcd18 100755 --- a/configure +++ b/configure @@ -17885,28 +17885,6 @@ fi fi -# Are we targeting a
Re: Add semi-join pushdown to postgres_fdw
Hi, Alexander! Thank you for working on this. I believe this is a very interesting patch, which significantly improves our FDW-based distributed facilities. This is why I decided to review this. On Fri, Jan 20, 2023 at 11:00 AM Alexander Pyhalov wrote: > Tomas Vondra писал 2023-01-19 20:49: > > I took a quick look at the patch. It needs a rebase, although it > > applies > > fine using patch. > > > > A couple minor comments: > > > > 1) addl_conds seems a bit hard to understand, I'd use either the full > > wording (additional_conds) or maybe extra_conds > > Renamed to additional_conds. > > > > > 2) some of the lines got quite long, and need a wrap > Splitted some of them. Not sure if it's enough. > > > > > 3) unknown_subquery_rels name is a bit misleading - AFAIK it's the rels > > that can't be referenced from upper rels (per what the .h says). So > > they > > are known, but hidden. Is there a better name? > > Renamed to hidden_subquery_rels. These are rels, which can't be referred > to from upper join levels. > > > > > 4) joinrel_target_ok() needs a better comment, explaining *when* the > > reltarget is safe for pushdown. The conditions are on the same row, but > > the project style is to break after '&&'. > > Added comment. It seems to be a rephrasing of lower comment in > joinrel_target_ok(). + /* +* We can't push down join if its reltarget is not safe +*/ + if (!joinrel_target_ok(root, joinrel, jointype, outerrel, innerrel)) return false; As I get joinrel_target_ok() function do meaningful checks only for semi join and always return false for all other kinds of joins. I think we should call this only for semi join and name the function accordingly. + fpinfo->unknown_subquery_rels = bms_union(fpinfo_o->unknown_subquery_rels, + fpinfo_i->unknown_subquery_rels); Should the comment before this code block be revised? + case JOIN_SEMI: + fpinfo->joinclauses = list_concat(fpinfo->joinclauses, + fpinfo_i->remote_conds); + fpinfo->joinclauses = list_concat(fpinfo->joinclauses, + fpinfo->remote_conds); + fpinfo->remote_conds = list_copy(fpinfo_o->remote_conds); + fpinfo->unknown_subquery_rels = bms_union(fpinfo->unknown_subquery_rels, + innerrel->relids); + break; I think that comment before switch() should be definitely revised. + Relids hidden_subquery_rels; /* relids, which can't be referred to + * from upper relations */ Could this definition contain the positive part? Can't be referred to from upper relations, but used internally for semi joins (or something like that)? Also, I think the machinery around the append_conds could be somewhat simpler if we turn them into a list (list of strings). I think that should make code clearer and also save us some memory allocations. In [1] you've referenced the cases, when your patch can't push down semi-joins. It doesn't seem impossible to handle these cases, but that would make the patch much more complicated. I'm OK to continue with a simpler patch to handle the majority of cases. Could you please add the cases, which can't be pushed down with the current patch, to the test suite? Links 1. https://www.postgresql.org/message-id/816fa8b1bc2da09a87484d1ef239a332%40postgrespro.ru -- Regards, Alexander Korotkov
Re: POC, WIP: OR-clause support for indexes
On Mon, Oct 30, 2023 at 6:40 AM Robert Haas wrote: > I agree with that, with some caveats, mainly that the reverse is to > some extent also true. Maybe not completely, because arguably the > ANY() formulation should just be straight-up easier to deal with, but > in principle, the two are equivalent and it shouldn't matter which > representation we pick. I recently looked into MySQL's handling of these issues, which is more mature and better documented than what we can do. EXPLAIN ANALYZE will show an IN() list as if the query had been written as a list of ORs, even though it can efficiently execute an index scan that uses IN()/"OR var = constant" lists. So I agree with what you said here. It is perhaps just as accident of history that we're talking about converting to a ScalarArrayOpExpr, rather than talking about converting to some other clause type that we associate with OR lists. The essential point is that there ought to be one clause type that is easier to deal with. > But practically, it may, and we need to be sure that we don't put in > place a translation that is theoretically a win but in practice leads > to large regressions. Avoiding regressions here is more important than > capturing all the possible gains. A patch that wins in some scenarios > and does nothing in others can be committed; a patch that wins in even > more scenarios but causes serious regressions in some cases probably > can't. I agree. Most of the really big wins here will come from simple transformations. I see no reason why we can't take an incremental approach. In fact I think we almost have to do so, since as I understand it the transformations are just infeasible in certain extreme cases. -- Peter Geoghegan
Re: trying again to get incremental backup
While reviewing this thread today, I realized that I never responded to this email. That was inadvertent; my apologies. On Wed, Jun 14, 2023 at 4:34 PM Matthias van de Meent wrote: > Nice, I like this idea. Cool. > Skimming through the 7th patch, I see claims that FSM is not fully > WAL-logged and thus shouldn't be tracked, and so it indeed doesn't > track those changes. > I disagree with that decision: we now have support for custom resource > managers, which may use the various forks for other purposes than > those used in PostgreSQL right now. It would be a shame if data is > lost because of the backup tool ignoring forks because the PostgreSQL > project itself doesn't have post-recovery consistency guarantees in > that fork. So, unless we document that WAL-logged changes in the FSM > fork are actually not recoverable from backup, regardless of the type > of contents, we should still keep track of the changes in the FSM fork > and include the fork in our backups or only exclude those FSM updates > that we know are safe to ignore. I'm not sure what to do about this problem. I don't think any data would be *lost* in the scenario that you mention; what I think would happen is that the FSM forks would be backed up in their entirety even if they were owned by some other table AM or index AM that was WAL-logging all changes to whatever it was storing in that fork. So I think that there is not a correctness issue here but rather an efficiency issue. It would still be nice to fix that somehow, but I don't see how to do it. It would be easy to make the WAL summarizer stop treating the FSM as a special case, but there's no way for basebackup_incremental.c to know whether a particular relation fork is for the heap AM or some other AM that handles WAL-logging differently. It can't for example examine pg_class; it's not connected to any database, let alone every database. So we have to either trust that the WAL for the FSM is correct and complete in all cases, or assume that it isn't in any case. And the former doesn't seem like a safe or wise assumption given how the heap AM works. I think the reality here is unfortunately that we're missing a lot of important infrastructure to really enable a multi-table-AM world. The heap AM, and every other table AM, should include a metapage so we can tell what we're looking at just by examining the disk files. Relation forks don't scale and should be replaced with some better system that does. We should have at least two table AMs in core that are fully supported and do truly useful things. Until some of that stuff (and probably a bunch of other things) get sorted out, out-of-core AMs are going to have to remain second-class citizens to some degree. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_rewind WAL segments deletion pitfall
Hi, On Wed, 18 Oct 2023 at 08:50, torikoshia wrote: > > I have very minor questions on the regression tests mainly regarding the > consistency with other tests for pg_rewind: > Please find attached a new version of the patch. It addresses all your comments. Regards, -- Alexander Kukushkin From 08d2cac946b92f05d410d75ed026ebf6587b6671 Mon Sep 17 00:00:00 2001 From: Alexander Kukushkin Date: Mon, 30 Oct 2023 16:23:15 +0200 Subject: [PATCH] Be more picky with WAL segment deletion in pg_rewind Make pg_rewind to be a bit wiser in terms of creating filemap: preserve on the target all WAL segments that contain records between the last common checkpoint and the point of divergence. Co-authored-by: Polina Bungina --- src/bin/pg_rewind/filemap.c | 62 +- src/bin/pg_rewind/filemap.h | 4 ++ src/bin/pg_rewind/parsexlog.c | 22 +++ src/bin/pg_rewind/pg_rewind.c | 3 + src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 64 +++ 5 files changed, 153 insertions(+), 2 deletions(-) create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c index ecadd69dc5..ebaebfb149 100644 --- a/src/bin/pg_rewind/filemap.c +++ b/src/bin/pg_rewind/filemap.c @@ -64,6 +64,27 @@ static file_entry_t *lookup_filehash_entry(const char *path); static int final_filemap_cmp(const void *a, const void *b); static bool check_file_excluded(const char *path, bool is_source); +typedef struct skipwal_t { + const char *path; + uint32 status; +} skipwal_t; + +#define SH_PREFIX keepwalhash +#define SH_ELEMENT_TYPE skipwal_t +#define SH_KEY_TYPE const char * +#define SH_KEY path +#define SH_HASH_KEY(tb, key) hash_string_pointer(key) +#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0) +#define SH_SCOPE static inline +#define SH_RAW_ALLOCATOR pg_malloc0 +#define SH_DECLARE +#define SH_DEFINE +#include "lib/simplehash.h" + +static keepwalhash_hash *keepwalhash = NULL; + +static bool keepwalhash_entry_exists(const char *path); + /* * Definition of one element part of an exclusion list, used to exclude * contents when rewinding. "name" is the name of the file or path to @@ -207,6 +228,35 @@ lookup_filehash_entry(const char *path) return filehash_lookup(filehash, path); } +/* Initialize a hash table to store WAL file names that must be kept */ +void +keepwalhash_init(void) +{ + keepwalhash = keepwalhash_create(FILEHASH_INITIAL_SIZE, NULL); +} + +/* Prevent a given file deletion during rewind */ +void +insert_keepwalhash_entry(const char *path) +{ + skipwal_t *entry; + bool found; + + /* Should only be called with keepwalhash initialized */ + Assert(keepwalhash); + + entry = keepwalhash_insert(keepwalhash, path, ); + + if (!found) + entry->path = pg_strdup(path); +} + +static bool +keepwalhash_entry_exists(const char *path) +{ + return keepwalhash_lookup(keepwalhash, path) != NULL; +} + /* * Callback for processing source file list. * @@ -682,7 +732,16 @@ decide_file_action(file_entry_t *entry) } else if (entry->target_exists && !entry->source_exists) { - /* File exists in target, but not source. Remove it. */ + /* File exists in target, but not source. */ + + if (keepwalhash_entry_exists(path)) + { + /* This is a WAL file that should be kept. */ + pg_log_debug("Not removing %s because it is required for recovery", path); + return FILE_ACTION_NONE; + } + + /* Otherwise remove an unexpected file. */ return FILE_ACTION_REMOVE; } else if (!entry->target_exists && !entry->source_exists) @@ -818,7 +877,6 @@ decide_file_actions(void) return filemap; } - /* * Helper function for filemap hash table. */ diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h index 988d4590e0..f037448a0f 100644 --- a/src/bin/pg_rewind/filemap.h +++ b/src/bin/pg_rewind/filemap.h @@ -109,5 +109,9 @@ extern void process_target_wal_block_change(ForkNumber forknum, extern filemap_t *decide_file_actions(void); extern void calculate_totals(filemap_t *filemap); extern void print_filemap(filemap_t *filemap); +extern void preserve_file(char *filepath); + +extern void keepwalhash_init(void); +extern void insert_keepwalhash_entry(const char *path); #endif /* FILEMAP_H */ diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 0233ece88b..b89b65077d 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -176,6 +176,10 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex, char *errormsg; XLogPageReadPrivate private; + /* Track WAL segments opened while searching a checkpoint */ + XLogSegNo segno = 0; + TimeLineID tli = 0; + /* * The given fork pointer points to the end of the last common record, * which is not necessarily the beginning of the next record, if the @@ -217,6 +221,24 @@ findLastCheckpoint(const char *datadir, XLogRecPtr
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
Rebased patchset. No changes in actual patches. 0001 is squashed version of patches submitted at https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1csrjcmyyj8pd...@mail.gmail.com. On Tue, Oct 3, 2023 at 6:42 PM Ashutosh Bapat wrote: > > On Fri, Sep 29, 2023 at 8:36 AM Amit Langote wrote: > > IOW, something > > like the following would have sufficed: > > > > @@ -1735,6 +1735,10 @@ build_child_join_sjinfo(PlannerInfo *root, > > SpecialJoinInfo *parent_sjinfo, > > /* > > * free_child_sjinfo_members > > * Free memory consumed by members of a child SpecialJoinInfo. > > + * > > + * Only members that are translated copies of their counterpart in the > > parent > > + * SpecialJoinInfo are freed here. However, members that could be > > referenced > > + * elsewhere are not freed. > > */ > > static void > > free_child_sjinfo_members(SpecialJoinInfo *child_sjinfo) > > @@ -1755,10 +1759,7 @@ free_child_sjinfo_members(SpecialJoinInfo > > *child_sjinfo) > > bms_free(child_sjinfo->syn_lefthand); > > bms_free(child_sjinfo->syn_righthand); > > > > - /* > > -* But the list of operator OIDs and the list of expressions may be > > -* referenced somewhere else. Do not free those. > > -*/ > > + /* semi_rhs_exprs may be referenced, so don't free. */ > > } > > Works for me. PFA patchset with these changes. I have still left the > changes addressing your comments as a separate patch for easier > review. > > -- > Best Wishes, > Ashutosh Bapat -- Best Wishes, Ashutosh Bapat
Remove obsolete check for TTSOpsVirtual from slot_compile_deform
Hi hackers, While exploring the JIT support for tuple deforming process, I noticed that one check for TTSOpsVirtual in slot_compile_deform is obsolete. Since virtual tuples never need deforming and there's an assertion in llvm_compile_expr[1]. I simply replace it with an assertion in slot_compile_deform. Patch is attached. [1] https://github.com/postgres/postgres/blob/0c60e8ba80e03491b028204a19a9dca6d216df91/src/backend/jit/llvm/llvmjit_expr.c#L322 Best Regards, Xing From 6db602bd7d0964b60134c4b20b4107a1e8717b78 Mon Sep 17 00:00:00 2001 From: Xing Guo Date: Mon, 30 Oct 2023 22:34:10 +0800 Subject: [PATCH v1] Remove obsolete check for TTSOpsVirtual from slot_compile_deform. Since virtual tuples never need deforming and there's no EEOP_*_FETCHSOME step generated for it. We don't need to check it in slot_compile_deform. Probably we should replace it with an assertion. --- src/backend/jit/llvm/llvmjit_deform.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/backend/jit/llvm/llvmjit_deform.c b/src/backend/jit/llvm/llvmjit_deform.c index b7e88e7ca9..e652cda77f 100644 --- a/src/backend/jit/llvm/llvmjit_deform.c +++ b/src/backend/jit/llvm/llvmjit_deform.c @@ -90,9 +90,8 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int attnum; - /* virtual tuples never need deforming, so don't generate code */ - if (ops == ) - return NULL; + /* virtual tuples never need deforming */ + Assert(ops != ); /* decline to JIT for slot types we don't know to handle */ if (ops != && ops != && -- 2.42.0
Re: Fwd: Annoying corruption in PostgreSQL.
Greetings, Please don't top-post on these lists. * Kirill Reshke (reshkekir...@gmail.com) wrote: > We have physical backups and we can PITR. But restoring a cluster to some > point in the past is a bit of a different task: we need our client's > approval for these operations, since we are a Managed DBs Cloud Provider. > Will try to ask someone. Do you have page-level checksums enabled for these PG instances? Are you able to see if these clusters which are reporting the corruption have been restored in the past from a backup? What are you using to perform your backups and perform your restores? Are you able to see if these clusters have ever crashed and come back up after by doing WAL replay? Where I'm heading with these questions is essentially: I suspect either your backup/restore procedure is broken or you're running on a system that doesn't properly fsync data. Or possibly both. Oh, and you should probably have checksums enabled. Thanks, Stephen signature.asc Description: PGP signature
Re: trying again to get incremental backup
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas wrote: > Note that whether to remove summaries is a separate question from > whether to generate them in the first place. Right now, I have > wal_summarize_mb controlling whether they get generated in the first > place, but as I noted in another recent email, that isn't an entirely > satisfying solution. I did some more research on this. My conclusion is that I should remove wal_summarize_mb and just have a GUC summarize_wal = on|off that controls whether the summarizer runs at all. There will be one summary file per checkpoint, no matter how far apart checkpoints are or how large the summary gets. Below I'll explain the reasoning; let me know if you disagree. What I describe above would be a bad plan if it were realistically possible for a summary file to get so large that it might run the machine out of memory either when producing it or when trying to make use of it for an incremental backup. This seems to be a somewhat difficult scenario to create. So far, I haven't been able to generate WAL summary files more than a few tens of megabytes in size, even when summarizing 50+ GB of WAL per summary file. One reason why it's hard to produce large summary files is because, for a single relation fork, the WAL summary size converges to 1 bit per modified block when the number of modified blocks is large. This means that, even if you have a terabyte sized relation, you're looking at no more than perhaps 20MB of summary data no matter how much of it gets modified. Now, somebody could have a 30TB relation and then if they modify the whole thing they could have the better part of a gigabyte of summary data for that relation, but if you've got a 30TB table you probably have enough memory that that's no big deal. But, what if you have multiple relations? I initialized pgbench with a scale factor of 3 and also with 3 partitions and did a 1-hour run. I got 4 checkpoints during that time and each one produced an approximately 16MB summary file. The efficiency here drops considerably. For example, one of the files is 16495398 bytes and records information on 7498403 modified blocks, which works out to about 2.2 bytes per modified block. That's more than an order of magnitude worse than what I got in the single-relation case, where the summary file didn't even use two *bits* per modified block. But here again, the file just isn't that big in absolute terms. To get a 1GB+ WAL summary file, you'd need to modify millions of relation forks, maybe tens of millions, and most installations aren't even going to have that many relation forks, let alone be modifying them all frequently. My conclusion here is that it's pretty hard to have a database where WAL summarization is going to use too much memory. I wouldn't be terribly surprised if there are some extreme cases where it happens, but those databases probably aren't great candidates for incremental backup anyway. They're probably databases with millions of relations and frequent, widely-scattered modifications to those relations. And if you have that kind of high turnover rate then incremental backup isn't going to as helpful anyway, so there's probably no reason to enable WAL summarization in the first place. Maybe if you have that plus in the same database cluster you have a 100TB of completely static data that is never modified, and if you also do all of this on a pretty small machine, then you can find a case where incremental backup would have worked well but for the memory consumed by WAL summarization. But I think that's sufficiently niche that the current patch shouldn't concern itself with such cases. If we find that they're common enough to worry about, we might eventually want to do something to mitigate them, but whether that thing looks anything like wal_summarize_mb seems pretty unclear. So I conclude that it's a mistake to include that GUC as currently designed and propose to replace it with a Boolean as described above. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Report planning memory in EXPLAIN ANALYZE
> > > David challenged something at the begining, but IIUC he also agree the > value of patch 01 as the previous statement after discussion. Since the patch > is mild itself, so I will mark this commitfest entry as "Ready for committer". > Please correct me if anything is wrong. > Thanks Andy. Here's rebased patches. A conflict in explain.out resolved. -- Best Wishes, Ashutosh Bapat From 171ce2bd03f846e7ba3e6972b1f51a432d5f75c5 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 23 Aug 2023 16:23:12 +0530 Subject: [PATCH 2/3] Report memory allocated along with memory used in EXPLAIN Memory might be pallc'ed and pfree'ed during planning. The memory used at the end of planning may miss a large chunk of memory palloc'ed and pfree'ed during planning. But the corresponding memory may remain allocated in the memory context. Hence report both memory used and memory allocated to detect any such activity during planning. Ashutosh Bapat, per suggestion by David Rowley --- src/backend/commands/explain.c| 69 +++ src/backend/commands/prepare.c| 12 +++-- src/backend/utils/mmgr/mcxt.c | 13 ++--- src/include/commands/explain.h| 11 - src/include/utils/memutils.h | 3 +- src/test/regress/expected/explain.out | 25 +++--- src/tools/pgindent/typedefs.list | 1 + 7 files changed, 103 insertions(+), 31 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index dfa4447794..efc4887244 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -122,6 +122,8 @@ static const char *explain_get_index_name(Oid indexId); static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning); static void show_wal_usage(ExplainState *es, const WalUsage *usage); +static void show_planning_memory(ExplainState *es, + const MemUsage *usage); static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir, ExplainState *es); static void ExplainScanTarget(Scan *plan, ExplainState *es); @@ -397,11 +399,13 @@ ExplainOneQuery(Query *query, int cursorOptions, planduration; BufferUsage bufusage_start, bufusage; - Size mem_consumed; + MemoryContextCounters mem_counts_start; + MemoryContextCounters mem_counts_end; + MemUsage mem_usage; if (es->buffers) bufusage_start = pgBufferUsage; - mem_consumed = MemoryContextMemUsed(CurrentMemoryContext); + MemoryContextMemConsumed(CurrentMemoryContext, _counts_start); INSTR_TIME_SET_CURRENT(planstart); /* plan the query */ @@ -409,8 +413,8 @@ ExplainOneQuery(Query *query, int cursorOptions, INSTR_TIME_SET_CURRENT(planduration); INSTR_TIME_SUBTRACT(planduration, planstart); - mem_consumed = MemoryContextMemUsed(CurrentMemoryContext) - - mem_consumed; + MemoryContextMemConsumed(CurrentMemoryContext, _counts_end); + calc_mem_usage(_usage, _counts_end, _counts_start); /* calc differences of buffer counters. */ if (es->buffers) @@ -422,7 +426,7 @@ ExplainOneQuery(Query *query, int cursorOptions, /* run it (if needed) and produce output */ ExplainOnePlan(plan, into, es, queryString, params, queryEnv, , (es->buffers ? : NULL), - _consumed); + _usage); } } @@ -532,7 +536,7 @@ void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es, const char *queryString, ParamListInfo params, QueryEnvironment *queryEnv, const instr_time *planduration, - const BufferUsage *bufusage, const Size *mem_consumed) + const BufferUsage *bufusage, const MemUsage *mem_usage) { DestReceiver *dest; QueryDesc *queryDesc; @@ -635,9 +639,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es, ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 3, es); } - if (es->summary && mem_consumed) - ExplainPropertyUInteger("Planning Memory", "bytes", -(uint64) *mem_consumed, es); + if (es->summary && mem_usage) + { + ExplainOpenGroup("Planning Memory", "Planning Memory", true, es); + show_planning_memory(es, mem_usage); + ExplainCloseGroup("Planning Memory", "Planning Memory", true, es); + } /* Print info about runtime of triggers */ if (es->analyze) @@ -3775,6 +3782,50 @@ show_wal_usage(ExplainState *es, const WalUsage *usage) } } +/* + * Show planner's memory usage details. + */ +static void +show_planning_memory(ExplainState *es, const MemUsage *usage) +{ + if (es->format == EXPLAIN_FORMAT_TEXT) + { + appendStringInfo(es->str, + "Planning Memory: used=%zu bytes allocated=%zu bytes", + usage->mem_used, usage->mem_allocated); + appendStringInfoChar(es->str, '\n'); + } + else + { + ExplainPropertyInteger("Used", "bytes", usage->mem_used, es); + ExplainPropertyInteger("Allocated", "bytes", usage->mem_allocated, es); + } +} + +/* + * Compute memory usage from the start and end memory counts. + */ +void
Re: Not deleted mentions of the cleared path
On Mon, Oct 30, 2023 at 5:01 PM Alena Rybakina wrote: > > Hi, hackers! > > I have already written about the problem of InvalidPath [0] appearing. I > investigated this and found an error in the add_path() function, when we > reject a path, we free up the memory of the path, but do not delete various > mentions of it (for example, in the ancestor of relation, as in the example > below). > > Thus, we may face the problem of accessing the freed memory. > > I demonstrated this below using gdb when I call a query after running a test > in test/regression/sql/create_misc.sql: > > Query: > > :-- That ALTER TABLE should have added TOAST tables. > > SELECT relname, reltoastrelid <> 0 AS has_toast_table >FROM pg_class >WHERE oid::regclass IN ('a_star', 'c_star') >ORDER BY 1; > > --UPDATE b_star* > -- SET a = text 'gazpacho' > -- WHERE aa > 4; > > SELECT class, aa, a FROM a_star*; > > > gdb: > > 0x7ff3f7325fda in epoll_wait (epfd=5, events=0x55bf9ee75c38, maxevents=1, > timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. > (gdb) b /home/alena/postgrespro_3/src/backend/optimizer/util/pathnode.c:620 > Breakpoint 1 at 0x55bf9cfe4c65: file pathnode.c, line 621. > (gdb) c > Continuing. > > Breakpoint 1, add_path (parent_rel=0x55bf9ef7f5c0, new_path=0x55bf9ef7f4e0) > at pathnode.c:621 > 621 if (!IsA(new_path, IndexPath)) > (gdb) n > 622 pfree(new_path); > (gdb) n > 624 } > (gdb) p *new_path > $1 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x7f7f7f7f7f7f7f7f, > pathtarget = 0x7f7f7f7f7f7f7f7f, > param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe = 127, > parallel_workers = 2139062143, > rows = 1.3824172084878715e+306, startup_cost = 1.3824172084878715e+306, > total_cost = 1.3824172084878715e+306, > pathkeys = 0x7f7f7f7f7f7f7f7f} > (gdb) p new_path > $2 = (Path *) 0x55bf9ef7f4e0 At this point the new_path has not been added to the parent_rel. We do not set the cheapest* paths while paths are being added. The stack trace will give you an idea where this is happening. > > (gdb) p ((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements > [0].ptr_value)->subpath)->path->parent->cheapest_startup_path > $20 = (struct Path *) 0x55bf9ef7f4e0 This looks familiar though. There was some nearby thread where Tom Lane, if my memory serves well, provided a case where a path from lower rel was added to an upper rel without copying or changing its parent. This very much looks like that case. -- Best Wishes, Ashutosh Bapat
Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}
On Mon, Oct 30, 2023 at 10:45:05AM +0900, Michael Paquier wrote: > On Fri, Oct 27, 2023 at 04:58:20PM +0300, Nazir Bilal Yavuz wrote: > > I think switching it to 'shared' makes sense. That shouldn't confuse > > existing monitoring queries much as the numbers won't change, right? > > Also, if we keep 'shared/local' there could be similar complaints to > > this thread in the future; so, at least adding comments can be > > helpful. > > The problem is that it may impact existing tools that do explain > output deparsing. One of them is https://explain.depesz.com/ that > Hubert Depesz Lubaczewski has implemented, and it would be sad to > break anything related to it. > > I am adding Hubert in CC for comments about changing this > "shared/local" to "shared" on a released branch. Knowing that > "shared" and "local" will need to be handled as separate terms in 17~ > anyway, perhaps that's not a big deal, but let's be sure. Hi, some things will definitely break, but that's 100% OK. The change seems needed, and I can update my parser to deal with it :) Best regards, depesz
Re: Explicitly skip TAP tests under Meson if disabled
Aleksander Alekseev writes: > Personally I like the change. It makes the output more explicit. In my > use cases not running TAP tests typically is not something I want . So > I would appreciate being warned with a long list of bright yellow > "SKIP" messages. If I really want to skip TAP tests these messages are > just informative and don't bother me. +1 for counting such tests as "skipped" in the summary. -1 for emitting a message per skipped test. If I'm intentionally not running those tests, that would be very annoying noise, and potentially would obscure messages I actually need to see. (And about -100 for emitting such messages in yellow. Doesn't anybody who codes this stuff have a clue about vision problems?) regards, tom lane
Re: POC, WIP: OR-clause support for indexes
On Mon, Oct 30, 2023 at 3:40 PM Robert Haas wrote: > On Thu, Oct 26, 2023 at 5:05 PM Peter Geoghegan wrote: > > On Thu, Oct 26, 2023 at 12:59 PM Robert Haas wrote: > > > Alexander's example seems to show that it's not that simple. If I'm > > > reading his example correctly, with things like aid = 1, the > > > transformation usually wins even if the number of things in the OR > > > expression is large, but with things like aid + 1 * bid = 1, the > > > transformation seems to lose at least with larger numbers of items. So > > > it's not JUST the number of OR elements but also what they contain, > > > unless I'm misunderstanding his point. > > > > Alexander said "Generally, I don't see why ANY could be executed > > slower than the equivalent OR clause". I understood that this was his > > way of expressing the following idea: > > > > "In principle, there is no reason to expect execution of ANY() to be > > slower than execution of an equivalent OR clause (except for > > noise-level differences). While it might not actually look that way > > for every single type of plan you can imagine right now, that doesn't > > argue for making a cost-based decision. It actually argues for fixing > > the underlying issue, which can't possibly be due to some kind of > > fundamental advantage enjoyed by expression evaluation with ORs". > > > > This is also what I think of all this. > > I agree with that, with some caveats, mainly that the reverse is to > some extent also true. Maybe not completely, because arguably the > ANY() formulation should just be straight-up easier to deal with, but > in principle, the two are equivalent and it shouldn't matter which > representation we pick. > > But practically, it may, and we need to be sure that we don't put in > place a translation that is theoretically a win but in practice leads > to large regressions. Avoiding regressions here is more important than > capturing all the possible gains. A patch that wins in some scenarios > and does nothing in others can be committed; a patch that wins in even > more scenarios but causes serious regressions in some cases probably > can't. +1 Sure, I've identified two cases where patch shows regression [1]. The first one (quadratic complexity of expression processing) should be already addressed by usage of hash. The second one (planning regression with Bitmap OR) is not yet addressed. Links 1. https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com -- Regards, Alexander Korotkov
Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?
On Mon, Oct 30, 2023 at 02:22:08PM +1300, David Rowley wrote: > If I change STD_FUZZ_FACTOR to something like 1.001 then the plans > no longer change when I do: > > alter server loopback options (add fdw_tuple_cost '0.01'); > > alter server loopback options (drop fdw_tuple_cost); > > > Ordinarily, I'd not care too much about that, but I did test the > performance of one of the plans and the new plan came out slower than > the old one. > > I'm not exactly sure how best to proceed on this in the absence of any > feedback. I think you just go and change it. Your number is better than what we have, and if someone wants to suggest a better number, we can change it later. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: COPY TO (FREEZE)?
On Mon, Oct 30, 2023 at 03:16:58PM +0900, Kyotaro Horiguchi wrote: > At Sun, 29 Oct 2023 15:35:02 -0400, Bruce Momjian wrote in > > You are 100% correct. Updated patch attached. > > - errmsg("COPY force not null only available > using COPY FROM"))); > + errmsg("COPY force not null cannot be used > with COPY TO"))); > > I find the term "force not null" hard to translate, especially into > Japaese, as its literal translation doesn't align with the entire > message. The most recent translation for it is the literal rendition > of "FORCE_NOT_NULL option of COPY can only be used with COPY FROM". > > In short, for translation convenience, I would prefer if "force not > null" were "FORCE_NOT_NULL". That is a good point. I reviewed more of the messages and added capitalization where appropriate, patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index d12ba96497..18ecc69c33 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -224,6 +224,7 @@ COPY { table_name [ ( COPY FREEZE on a partitioned table. + This option is only allowed in COPY FROM. Note that all other sessions will immediately be able to see the data diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index c5d7d78645..145315debe 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -671,7 +671,7 @@ ProcessCopyOptions(ParseState *pstate, if (!opts_out->csv_mode && opts_out->quote != NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY quote available only in CSV mode"))); + errmsg("COPY QUOTE available only in CSV mode"))); if (opts_out->csv_mode && strlen(opts_out->quote) != 1) ereport(ERROR, @@ -687,7 +687,7 @@ ProcessCopyOptions(ParseState *pstate, if (!opts_out->csv_mode && opts_out->escape != NULL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY escape available only in CSV mode"))); + errmsg("COPY ESCAPE available only in CSV mode"))); if (opts_out->csv_mode && strlen(opts_out->escape) != 1) ereport(ERROR, @@ -698,46 +698,52 @@ ProcessCopyOptions(ParseState *pstate, if (!opts_out->csv_mode && (opts_out->force_quote || opts_out->force_quote_all)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force quote available only in CSV mode"))); + errmsg("COPY FORCE_QUOTE available only in CSV mode"))); if ((opts_out->force_quote || opts_out->force_quote_all) && is_from) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force quote only available using COPY TO"))); + errmsg("COPY FORCE_QUOTE cannot be used with COPY FROM"))); /* Check force_notnull */ if (!opts_out->csv_mode && opts_out->force_notnull != NIL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force not null available only in CSV mode"))); + errmsg("COPY FORCE_NOT_NULL available only in CSV mode"))); if (opts_out->force_notnull != NIL && !is_from) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force not null only available using COPY FROM"))); +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY FORCE_NOT_NULL cannot be used with COPY TO"))); /* Check force_null */ if (!opts_out->csv_mode && opts_out->force_null != NIL) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force null available only in CSV mode"))); + errmsg("COPY FORCE_NULL available only in CSV mode"))); if (opts_out->force_null != NIL && !is_from) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY force null only available using COPY FROM"))); +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY FORCE_NULL cannot be used with COPY TO"))); /* Don't allow the delimiter to appear in the null string. */ if (strchr(opts_out->null_print, opts_out->delim[0]) != NULL) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("COPY delimiter must not appear in the NULL specification"))); +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY delimiter character must not appear in the NULL specification"))); /* Don't allow the CSV quote char to appear in the null string. */ if (opts_out->csv_mode && strchr(opts_out->null_print, opts_out->quote[0]) != NULL) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("CSV quote character must not appear in the NULL specification"))); + /* Check freeze */ + if (opts_out->freeze && !is_from) + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), +
Including Doxyfile and Meson script for docs into main source tree
Hello, Stefan! What do you think about this? See quoted message from John Morris. P.S. Also we would like you to provide currently used Doxyfile, as Postgres doxigen's home page no longer have link to download said file. Regards, Bohdan. On 14.10.23 21:54, John Morris wrote: Thank you for trying the patch out and commenting on it. I'm starting to think of it as a project. Here's a quick project statement. The purpose is to generate improved Doxygen output while making maximal use of how Postgres currently does program comments. Thinking in terms of specific steps, and adding to what you have mentioned: * Configure Doxyfile so it produces output similar to previous output. * Only build Doxygen output if requested * Don't compile the filter or configure the Doxyfile if they aren't needed * Include contrib in the sources to document * Provide options for other (non-html) output. (Which ones?) * Update Postgres documentation to include instructions for creating Doxygen output. * Mention it in developer guidelines and provide sample code showing a "desired" commenting style. Does that list seem complete? I don't want people to think we're imposing new standards or legislating new commenting styles. It's more a matter of describing what we currently do, maybe with some minor suggestions for improving. * John Morris
Re: Explicitly skip TAP tests under Meson if disabled
Hi, > Under Meson, it is not very easy to see if TAP tests have been enabled > or disabled, if you rely on the default auto setting. You either need > to carefully study the meson setup output, or you notice, what a minute, > didn't there use to be like 250 tests, not only 80? > > I think it would be better if we still registered the TAP tests in Meson > even if the tap_tests option is disabled, but with a dummy command that > registers them as skipped. That way you get a more informative output like > > Ok: 78 > Expected Fail: 0 > Fail: 0 > Unexpected Pass:0 > Skipped:187 > Timeout:0 > > which is really a more accurate representation of what the test run > actually accomplished than "everything Ok". > > See attached patch for a possible implementation. (This uses perl as a > hard build requirement. We are planning to do that anyway, but > obviously other implementations, such as using python, would also be > possible.) I tested the patch and it works as intended. Personally I like the change. It makes the output more explicit. In my use cases not running TAP tests typically is not something I want . So I would appreciate being warned with a long list of bright yellow "SKIP" messages. If I really want to skip TAP tests these messages are just informative and don't bother me. +1 -- Best regards, Aleksander Alekseev
Re: Is this a problem in GenericXLogFinish()?
On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila wrote: > > Hmm. So my question is: do we need the cleanup lock on the write > > buffer even if there are no tuples, and even if primary bucket and the > > write bucket are the same? > > Yes, we need it to exclude any concurrent in-progress scans that could > return incorrect tuples during bucket squeeze operation. Amit, thanks for weighing in, but I'm not convinced. I thought we only ever used a cleanup lock on the main bucket page to guard against concurrent scans. Here you seem to be saying that we need a cleanup lock on some page that may be an overflow page somewhere in the middle of the chain, and that doesn't seem right to me. So ... are you sure? If yes, can you provide any more detailed justification? -- Robert Haas EDB: http://www.enterprisedb.com
Re: POC, WIP: OR-clause support for indexes
On Thu, Oct 26, 2023 at 5:05 PM Peter Geoghegan wrote: > On Thu, Oct 26, 2023 at 12:59 PM Robert Haas wrote: > > Alexander's example seems to show that it's not that simple. If I'm > > reading his example correctly, with things like aid = 1, the > > transformation usually wins even if the number of things in the OR > > expression is large, but with things like aid + 1 * bid = 1, the > > transformation seems to lose at least with larger numbers of items. So > > it's not JUST the number of OR elements but also what they contain, > > unless I'm misunderstanding his point. > > Alexander said "Generally, I don't see why ANY could be executed > slower than the equivalent OR clause". I understood that this was his > way of expressing the following idea: > > "In principle, there is no reason to expect execution of ANY() to be > slower than execution of an equivalent OR clause (except for > noise-level differences). While it might not actually look that way > for every single type of plan you can imagine right now, that doesn't > argue for making a cost-based decision. It actually argues for fixing > the underlying issue, which can't possibly be due to some kind of > fundamental advantage enjoyed by expression evaluation with ORs". > > This is also what I think of all this. I agree with that, with some caveats, mainly that the reverse is to some extent also true. Maybe not completely, because arguably the ANY() formulation should just be straight-up easier to deal with, but in principle, the two are equivalent and it shouldn't matter which representation we pick. But practically, it may, and we need to be sure that we don't put in place a translation that is theoretically a win but in practice leads to large regressions. Avoiding regressions here is more important than capturing all the possible gains. A patch that wins in some scenarios and does nothing in others can be committed; a patch that wins in even more scenarios but causes serious regressions in some cases probably can't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: DRAFT GIST support for ORDER BY
> On 30 Oct 2023, at 13:31, Matthias van de Meent > wrote: > >> The solution is not ideal as it requires registering “<“ and “>” operators >> as ordering operators in opfamily >> (which in turn makes it possible to issue somewhat meaningless “ORDER BY a < >> ‘constant’) > > I don't quite understand why we need to register new "<" and ">" > operators. Can't we update the current ones? I wasn’t precise: what is needed is adding pg_amop entries with amoppurpose = ‘o’ for existing “<" and “>" operators. > >> The problem is though that right now handling of ORDER BY column clauses is >> tightly coupled to BTree. >> It would be good to refactor the code so that semantics of ORDER BY column >> could be more flexible. > > The existence of a BTREE operator class for the type is the indicator > that (and how) the type can be ordered - that is where PostgreSQL gets > its methods for ordering most types. Although I agree that it's a > quirk, I don't mind it that much as an indicator of how a type is > ordered. > I do agree, though, that operator classes by themselves should be able > to say "hey, we support full ordered retrieval as well". Right now, > that seems to be limited to btrees, but indeed a GiST index with > btree_gist columns should be able to support the same. Right now opfamily and strategy are set in PathKey before creating index scan paths. The patch actually copies existing code from create_indexscan_plan that finds an operator OID for (pk_opfamily, pk_strategy). The operator is supposed to be binary with specific operand types. To create a path: 1) do the operator OID lookup as above 2) look for sortfamily of pg_amop entry for (operator did, index opfamily) If the sort family is the same as pk_opfamily we can create a path. The side effect is that it is possible to “ORDER BY column < ‘constant’” as we have more ordering operators in pg_amop. Ideally we could look up _unary_ operator in pg_amop instead - that would make sense we are actually measuring some “absolute distance”. But this would require more changes - createplan.c would need to decide when to lookup unary and when - binary operator. >> It would be great if someone could take a look at it. > > I've not looked in detail at the patch, but here's some comments: > >> --- a/contrib/btree_gist/btree_gist--1.6--1.7.sql >> +++ b/contrib/btree_gist/btree_gist--1.6--1.7.sql > > You seem to be modifying an existing migration of a released version > of the btree_bist extension. I suggest you instead add a migration > from 1.7 to a new version 1.8, and update the control file's default > installed version. Thanks. I didn’t know how to register a new migration so did it that way. Will try to fix that. > >> ORDER BY a == ORDER BY a <-> MIN_VALUE >> and >> ORDER BY a DESC == ORDER BY a <-> MAX_VALUE >> >> This allows implementing GIST ordered scans for btree_gist datatypes. >> >> This in turn makes using GIST with partitioning feasible (I have described >> issues with such usage in my previous e-mails - see below). > > Did you take into account that GiST's internal distance function uses > floating point, and is thus only an approximation for values that > require more than 2^54 significant bits in their distance function? > For example, GiST wouldn't be guaranteed to yield correct ordering of > int8/bigint when you use `my_column <-> UINT64_MAX` because as far as > the floating point math is concerned, 0 is about as far away from > INT64_MAX as (say) 20 and -21. Hmm… Good point but it means ORDER BY <-> is broken for these types then? The patch assumes it works correctly and just uses it for ordered scans. — Michal
Re: Parallel query behaving different with custom GUCs
On Thu, Oct 26, 2023 at 3:10 AM Rushabh Lathia wrote: > -- RESET the custom GUC and rerun the query. > postgres=> reset myapp.blah; > RESET > > -- Query should still run, but with forcing parallel plan, throwing an error. > postgres=> select count(*) from ptest2 where current_setting('myapp.blah') is > null; > ERROR: unrecognized configuration parameter "myapp.blah" > CONTEXT: parallel worker > > -- Disable the parallel plan and query just runs fine. > postgres=#set max_parallel_workers_per_gather to 0; > SET > postgres=#select count(*) from ptest2 where current_setting('myapp.blah') is > null; > count > --- > 0 > (1 row) > > We might need another placeholder for the custom GUCs. Currently, we are > maintaining 3 linked lists in guc.c - guc_nondef_list, guc_stack_list, > guc_report_list and to fix the above issue either we need a 4th list or do > changes in the existing list. I discussed this a bit with Rushabh off-list before he posted, and was hoping someone else would reply, but since no one has: Formally, I think this is a bug. However, the practical impact of it is fairly low, because you have to be using custom GUCs in your query and you have to RESET them instead of using SET to put them back to the default value, which I'm guessing is something that not a lot of people do. I'm a bit concerned that adding the necessary tracking could be expensive, and I'm not sure we want to slow down things in normal cases to cater to this somewhat strange case. On the other hand, maybe we can fix it without too much cost, in which case that would be good to do. I'm also alert to my own possible bias. Perhaps since I designed this mechanism, I'm more prone to viewing its deficiencies as minor than a neutral observer would be. So if anyone is sitting there reading this and thinking "wow, I can't believe Robert doesn't think it's important to fix this," feel free to write back and say so. -- Robert Haas EDB: http://www.enterprisedb.com
Re: "38.10.10. Shared Memory and LWLocks" may require a clarification
Hi, > PFA patch v4. I didn't like the commit message. Here is the corrected patch. Sorry for the noise. -- Best regards, Aleksander Alekseev v5-0001-Clarify-the-38.10.10.-Shared-Memory-and-LWLocks-s.patch Description: Binary data
Re: "38.10.10. Shared Memory and LWLocks" may require a clarification
Michael, On Fri, Oct 27, 2023 at 9:52 AM Michael Paquier wrote: > > On Tue, May 23, 2023 at 01:47:52PM +0300, Aleksander Alekseev wrote: > > That's me still talking to myself :) > > Let's be two then. Many thanks for your feedback. > + > +It is convenient to use shmem_startup_hook which > allows > +placing all the code responsible for initializing shared memory in one > place. > +When using shmem_startup_hook the extension still > needs > +to acquire AddinShmemInitLock in order to work > properly > +on all the supported platforms including Windows. > > Yeah, AddinShmemInitLock is useful because extensions have no base > point outside that, and they may want to update their local variables. > Still, this is not completely exact because EXEC_BACKEND on > non-Windows platform would still need it, so this should be mentioned. > Another thing is that extensions may do like autoprewarm.c, where > the shmem area is not initialized in the startup shmem hook. This is > a bit cheating because this is a scenario where shmem_request_hook is > not requested, so shmem needs overallocation, but you'd also need a > LWLock in this case, even for non-WIN32. Got it. Let's simply remove the "including Windows" part then. > > +on all the supported platforms including Windows. This is also the reason > +why the return value of GetNamedLWLockTranche is > +conventionally stored in shared memory instead of local process memory. > + > > Not sure to follow this sentence, the result of GetNamedLWLockTranche > is the lock, so this sentence does not seem really necessary? To be honest, by now I don't remember what was meant here, so I removed the sentence. > While we're on it, why not improving this part of the documentation more > modern? We don't mention LWLockNewTrancheId() and > LWLockRegisterTranche() at all, so do you think that it would be worth > adding a sample of code with that, mentioning autoprewarm.c as > example? Agree, these functions deserve to be mentioned in this section. PFA patch v4. -- Best regards, Aleksander Alekseev v4-0001-Clarify-the-38.10.10.-Shared-Memory-and-LWLocks-s.patch Description: Binary data
Re: DRAFT GIST support for ORDER BY
On Mon, 30 Oct 2023 at 09:04, Michał Kłeczek wrote: > > Hi All, > > Attached is a first attempt to implement GIST index (only) scans for ORDER BY > column clauses. Cool! > The solution is not ideal as it requires registering “<“ and “>” operators as > ordering operators in opfamily > (which in turn makes it possible to issue somewhat meaningless “ORDER BY a < > ‘constant’) I don't quite understand why we need to register new "<" and ">" operators. Can't we update the current ones? > The problem is though that right now handling of ORDER BY column clauses is > tightly coupled to BTree. > It would be good to refactor the code so that semantics of ORDER BY column > could be more flexible. The existence of a BTREE operator class for the type is the indicator that (and how) the type can be ordered - that is where PostgreSQL gets its methods for ordering most types. Although I agree that it's a quirk, I don't mind it that much as an indicator of how a type is ordered. I do agree, though, that operator classes by themselves should be able to say "hey, we support full ordered retrieval as well". Right now, that seems to be limited to btrees, but indeed a GiST index with btree_gist columns should be able to support the same. > It would be great if someone could take a look at it. I've not looked in detail at the patch, but here's some comments: > --- a/contrib/btree_gist/btree_gist--1.6--1.7.sql > +++ b/contrib/btree_gist/btree_gist--1.6--1.7.sql You seem to be modifying an existing migration of a released version of the btree_bist extension. I suggest you instead add a migration from 1.7 to a new version 1.8, and update the control file's default installed version. > ORDER BY a == ORDER BY a <-> MIN_VALUE > and > ORDER BY a DESC == ORDER BY a <-> MAX_VALUE > > This allows implementing GIST ordered scans for btree_gist datatypes. > > This in turn makes using GIST with partitioning feasible (I have described > issues with such usage in my previous e-mails - see below). Did you take into account that GiST's internal distance function uses floating point, and is thus only an approximation for values that require more than 2^54 significant bits in their distance function? For example, GiST wouldn't be guaranteed to yield correct ordering of int8/bigint when you use `my_column <-> UINT64_MAX` because as far as the floating point math is concerned, 0 is about as far away from INT64_MAX as (say) 20 and -21. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: pg_resetwal tests, logging, and docs update
Hi, > Here are updated versions of the remaining patches. I took out the > "FIXME" notes about the multipliers applying to the -c option and > replaced them by gentler comments. I don't intend to resolve those > issues here. The patch LGTM. However, postgresql:pg_resetwal test suite doesn't pass on Windows according to cfbot. Seems to be a matter of picking a more generic regular expression: ``` at C:/cirrus/src/bin/pg_resetwal/t/001_basic.pl line 54. 'pg_resetwal: error: could not change directory to "foo": No such file or directory doesn't match '(?^:error: could not read permissions of directory)' ``` Should we simply use something like: ``` qr/error: could not (read|change).* directory/ ``` ... instead? -- Best regards, Aleksander Alekseev
Re: Open a streamed block for transactional messages during decoding
On Mon, Oct 30, 2023 at 2:17 PM Zhijie Hou (Fujitsu) wrote: > > On Monday, October 30, 2023 12:20 PM Amit Kapila > wrote: > > > > On Thu, Oct 26, 2023 at 2:01 PM Zhijie Hou (Fujitsu) > > > > wrote: > > > > > > On Thursday, October 26, 2023 12:42 PM Amit Kapila > > wrote: > > > > > > > > On Tue, Oct 24, 2023 at 5:27 PM Zhijie Hou (Fujitsu) > > > > > > > > wrote: > > > > > > > > > > While reviewing the test_decoding code, I noticed that when > > > > > skip_empty_xacts option is specified, it doesn't open the > > > > > streaming > > > > block( e.g. > > > > > pg_output_stream_start) before streaming the transactional MESSAGE > > > > > even if it's the first change in a streaming block. > > > > > > > > > > It looks inconsistent with what we do when streaming DML changes(e.g. > > > > > pg_decode_stream_change()). > > > > > > > > > > Here is a small patch to open the stream block in this case. > > > > > > > > > > > > > The change looks good to me though I haven't tested it yet. BTW, can > > > > we change the comment: "Output stream start if we haven't yet, but > > > > only for the transactional case." to "Output stream start if we > > > > haven't yet for transactional messages"? > > > > > > Thanks for the review and I changed this as suggested. > > > > > > > --- a/contrib/test_decoding/expected/stream.out > > +++ b/contrib/test_decoding/expected/stream.out > > @@ -29,7 +29,10 @@ COMMIT; > > SELECT data FROM pg_logical_slot_get_changes('regression_slot', > > NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', > > '1'); > > data > > -- > > + opening a streamed block for transaction > > streaming message: transactional: 1 prefix: test, sz: 50 > > + closing a streamed block for transaction aborting streamed > > + (sub)transaction > > > > I was analyzing the reason for the additional message: "aborting streamed > > (sub)transaction" in the above test and it seems to be due to the below > > check in > > the function pg_decode_stream_abort(): > > > > if (data->skip_empty_xacts && !xact_wrote_changes) return; > > > > Before the patch, we won't be setting the 'xact_wrote_changes' flag in > > txndata > > which is fixed now. So, this looks okay to me. However, I have another > > observation in this code which is that for aborts or subtransactions, we > > are not > > checking the flag 'stream_wrote_changes', so we may end up emitting the > > abort message even when no actual change has been streamed. I haven't tried > > to generate a test to verify this observation, so I could be wrong as well > > but it is > > worth analyzing such cases. > > I have confirmed that the mentioned case is possible(steps[1]): the > sub-transaction doesn't output any data, but the stream abort for this > sub-transaction will still be sent. > > But I think this may not be a problemic behavior, as even the pgoutput can > behave similarly, e.g. If all the changes are filtered by row filter or table > filter, then the stream abort will still be sent. The subscriber will skip > handling the STREAM ABORT if the aborted txn was not applied. > > And if we want to fix this, in output plugin, we need to record if we have > sent > any changes for each sub-transaction so that we can decide whether to send the > following stream abort or not. We cannot use 'stream_wrote_changes' because > it's a per streamed block flag and there could be serval streamed blocks for > one > sub-txn. It looks a bit complicate to me. > I agree with your analysis. So, pushed the existing patch. BTW, sorry, by mistake I used Peter's name as author. -- With Regards, Amit Kapila.
Not deleted mentions of the cleared path
Hi, hackers! I have already written about the problem of InvalidPath [0] appearing. I investigated this and found an error in the add_path() function, when we reject a path, we free up the memory of the path, but do not delete various mentions of it (for example, in the ancestor of relation, as in the example below). Thus, we may face the problem of accessing the freed memory. I demonstrated this below using gdb when I call a query after running a test in test/regression/sql/create_misc.sql: *Query:* :-- That ALTER TABLE should have added TOAST tables. SELECT relname, reltoastrelid <> 0 AS has_toast_table FROM pg_class WHERE oid::regclass IN ('a_star', 'c_star') ORDER BY 1; --UPDATE b_star* -- SET a = text 'gazpacho' -- WHERE aa > 4; SELECT class, aa, a FROM a_star*; *gdb: * 0x7ff3f7325fda in epoll_wait (epfd=5, events=0x55bf9ee75c38, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory. (gdb) b /home/alena/postgrespro_3/src/backend/optimizer/util/pathnode.c:620 Breakpoint 1 at 0x55bf9cfe4c65: file pathnode.c, line 621. (gdb) c Continuing. Breakpoint 1, add_path (parent_rel=0x55bf9ef7f5c0, new_path=0x55bf9ef7f4e0) at pathnode.c:621 621 if (!IsA(new_path, IndexPath)) (gdb) n 622 pfree(new_path); (gdb) n 624 } (gdb) p *new_path $1 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f, param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe = 127, parallel_workers = 2139062143, rows = 1.3824172084878715e+306, startup_cost = 1.3824172084878715e+306, total_cost = 1.3824172084878715e+306, pathkeys = 0x7f7f7f7f7f7f7f7f} *(gdb) p new_path $2 = (Path *) 0x55bf9ef7f4e0* (gdb) p ((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements [0].ptr_value)->subpath)->path->parent->cheapest_startup_path *$20 = (struct Path *) 0x55bf9ef7f4e0* (gdb) p *((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements [0].ptr_value)->subpath)->path->parent->cheapest_startup_path $17 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f, param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe = 127, parallel_workers = 2139062143, rows = 1.3824172084878715e+306, startup_cost = 1.3824172084878715e+306, total_cost = 1.3824172084878715e+306, pathkeys = 0x7f7f7f7f7f7f7f7f} (gdb) p (Path*)(((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements [0].ptr_value)->subpath)->path->parent->pathlist->elements[1].ptr_value) *$21 = (Path *) 0x55bf9ef7f4e0* (gdb) p *(Path*)(((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements [0].ptr_value)->subpath)->path->parent->pathlist->elements[1].ptr_value) $19 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f, param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe = 127, parallel_workers = 2139062143, rows = 1.3824172084878715e+306, startup_cost = 1.3824172084878715e+306, total_cost = 1.3824172084878715e+306, pathkeys = 0x7f7f7f7f7f7f7f7f} (gdb) The same problem may be in the add_partial_path() function. Unfortunately, I have not yet been able to find a problematic query with the described case, but I have prepared a patch to fix this problem. What do you think? 0. https://www.postgresql.org/message-id/flat/01aa76cc-93e1-4de9-ab34-b3a890bf7...@postgrespro.ru -- Regards, Alena Rybakina Postgres Professional From b9483800eb9d8dc9bafb114c602c63d5cf596ee7 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Mon, 30 Oct 2023 14:22:21 +0300 Subject: [PATCH] Fix reject released path: we shouldn't only clean path, but we should delete all mentions ottherwice we will face the problem access to released memory. --- src/backend/optimizer/util/pathnode.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 0b1d17b9d33..4df9d34c9f1 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -619,7 +619,19 @@ add_path(RelOptInfo *parent_rel, Path *new_path) { /* Reject and recycle the new path */ if (!IsA(new_path, IndexPath)) - pfree(new_path); + { + if(new_path == new_path->parent->cheapest_startup_path) +new_path->parent->cheapest_startup_path = NULL; + if(new_path == new_path->parent->cheapest_total_path) +new_path->parent->cheapest_total_path = NULL; + foreach(p1, new_path->parent->pathlist) + { +Path *path = (Path *) lfirst(p1); + +if (path == new_path) + new_path->parent->pathlist = foreach_delete_current(new_path->parent->pathlist, p1); + } + } } } @@ -849,7 +861,17 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path) else { /* Reject and recycle the new path */ - pfree(new_path); +
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
On Fri, Oct 27, 2023 at 3:23 AM David Rowley wrote: > > On Thu, 26 Oct 2023 at 17:00, David Rowley wrote: > > Thanks for looking at this again. I fixed up each of those and pushed > > the result, mentioning the incompatibility in the commit message. > > > > Now that that's done, I've attached a patch which makes use of the new > > initReadOnlyStringInfo initializer function for the original case > > mentioned when I opened this thread. I don't think there are any > > remaining objections to this, but I'll let it sit for a bit to see. > > I've just pushed the deserial function optimisation patch. > > I was just looking at a few other places where we might want to make > use of initReadOnlyStringInfo. > > * parallel.c in HandleParallelMessages(): > > Drilling into HandleParallelMessage(), I see the PqMsg_BackendKeyData > case just reads a fixed number of bytes. In some of the other > "switch" cases, I see calls pq_getmsgrawstring() either directly or > indirectly. I see the counterpart to pq_getmsgrawstring() is > pq_sendstring() which always appends the NUL char to the StringInfo, > so I don't think not NUL terminating the received bytes is a problem > as cstrings seem to be sent with the NUL terminator. > > This case just seems to handle ERROR/NOTICE messages coming from > parallel workers. Not tuples themselves. It may not be that > interesting a case to speed up. > > * applyparallelworker.c in HandleParallelApplyMessages(): > > Drilling into HandleParallelApplyMessage(), I don't see anything there > that needs the input StringInfo to be NUL terminated. > Both the above calls are used to handle ERROR/NOTICE messages from parallel workers as you have also noticed. The comment atop initReadOnlyStringInfo() clearly states that it is used in the performance-critical path. So, is it worth changing these places? In the future, this may pose the risk of this API being used inconsistently. -- With Regards, Amit Kapila.
Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?
Looks like the value goes long back to d0d75c402217421b691050857eb3d7af82d0c770. The comment there adds "above and beyond cpu_tuple_cost". So certainly it's expected to be higher than cpu_tuple_cost. I have no memories of this. But looking at the surrounding code, I think DEFAULT_FDW_STARTUP_COST takes care of network costs and bandwidths. So DEFAULT_FDW_TUPLE_COST is just assembling row from bytes on network. That might have been equated to assembling row from heap buffer. But I think you are right, it should be comparable to the parallel tuple cost which at least is IPC like socket. This will also mean that operations which reduce the number of rows will be favoured and pushed down. That's what is desired. -- Best Wishes, Ashutosh Bapat On Mon, Oct 30, 2023 at 6:52 AM David Rowley wrote: > > On Sun, 29 Oct 2023 at 12:45, Bruce Momjian wrote: > > Has anything been done about this issue? > > Nothing has been done. I was hoping to get the attention of a few > people who have dealt more with postgres_fdw in the past. > > I've attached a patch with adjusts DEFAULT_FDW_TUPLE_COST and sets it > to 0.2. I set it to this because my experiment in [1] showed that it > was about 21x lower than the actual costs (on my machine with a > loopback fdw connecting to the same instance and database using my > example query). Given that we have parallel_tuple_cost set to 0.1 by > default, the network cost of a tuple from an FDW of 0.2 seems > reasonable to me. Slightly higher is probably also reasonable, but > given the seeming lack of complaints, I think I'd rather err on the > low side. > > Changing it to 0.2, I see 4 plans change in postgres_fdw's regression > tests. All of these changes are due to STD_FUZZ_FACTOR causing some > other plan to win in add_path(). > > For example the query EXPLAIN (VERBOSE, ANALYZE) SELECT a, sum(b), > min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY > 1; the plan switches from a HashAggregate to a GroupAggregate. This is > because after increasing the DEFAULT_FDW_TUPLE_COST to 0.2 the sorted > append child (fuzzily) costs the same as the unsorted seq scan path > and the sorted path wins in add_path due to having better pathkeys. > The seq scan path is then thrown away and we end up doing the Group > Aggregate using the sorted append children. > > If I change STD_FUZZ_FACTOR to something like 1.001 then the plans > no longer change when I do: > > alter server loopback options (add fdw_tuple_cost '0.01'); > > alter server loopback options (drop fdw_tuple_cost); > > > Ordinarily, I'd not care too much about that, but I did test the > performance of one of the plans and the new plan came out slower than > the old one. > > I'm not exactly sure how best to proceed on this in the absence of any > feedback. > > David > > [1] > https://postgr.es/m/caaphdvopvjjfh5c1ed2hrvddfom2depmwwiu5-f1anmyprj...@mail.gmail.com
Re: Is this a problem in GenericXLogFinish()?
On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier wrote: > > On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote: > > Yes, we need it to exclude any concurrent in-progress scans that could > > return incorrect tuples during bucket squeeze operation. > > Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the > primary and write buffers are the same and there are no tuples to > move. Say with something like the attached? > What if the primary and write buffer are not the same but ntups is zero? Won't we hit the assertion again in that case? -- With Regards, Amit Kapila.
Explicitly skip TAP tests under Meson if disabled
Under Meson, it is not very easy to see if TAP tests have been enabled or disabled, if you rely on the default auto setting. You either need to carefully study the meson setup output, or you notice, what a minute, didn't there use to be like 250 tests, not only 80? I think it would be better if we still registered the TAP tests in Meson even if the tap_tests option is disabled, but with a dummy command that registers them as skipped. That way you get a more informative output like Ok: 78 Expected Fail: 0 Fail: 0 Unexpected Pass:0 Skipped:187 Timeout:0 which is really a more accurate representation of what the test run actually accomplished than "everything Ok". See attached patch for a possible implementation. (This uses perl as a hard build requirement. We are planning to do that anyway, but obviously other implementations, such as using python, would also be possible.)From 19e78e4c5a16337c0ac4e661beb4729505736016 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 30 Oct 2023 05:32:45 -0400 Subject: [PATCH] Explicitly skip TAP tests under Meson if disabled If the tap_tests option is disabled under Meson, the TAP tests are currently not registered at all. But this makes it harder to see what is going one, why suddently there are fewer tests than before. Instead, register the tests anyway but with a dummy command that marks them as skipped. That way, the total list and count of tests is constant whether the option is enabled or not. --- meson.build | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/meson.build b/meson.build index 2d516c8f372..1795dd8159e 100644 --- a/meson.build +++ b/meson.build @@ -3248,10 +3248,6 @@ foreach test_dir : tests testport += 1 elif kind == 'tap' - if not tap_tests_enabled -continue - endif - test_command = [ perl.path(), '-I', meson.source_root() / 'src/test/perl', @@ -3286,16 +3282,23 @@ foreach test_dir : tests onetap_p = fs.stem(onetap_p) endif -test(test_dir['name'] / onetap_p, - python, - kwargs: test_kwargs, - args: testwrap_base + [ -'--testgroup', test_dir['name'], -'--testname', onetap_p, -'--', test_command, -test_dir['sd'] / onetap, - ], -) +if tap_tests_enabled + test(test_dir['name'] / onetap_p, +python, +kwargs: test_kwargs, +args: testwrap_base + [ + '--testgroup', test_dir['name'], + '--testname', onetap_p, + '--', test_command, + test_dir['sd'] / onetap, +], + ) +else + test(test_dir['name'] / onetap_p, + perl, + args: ['-e', 'print "1..0 # Skipped: TAP tests not enabled"'], + kwargs: test_kwargs) +endif endforeach install_suites += test_group else -- 2.42.0
Re: pg_upgrade and logical replication
On Fri, 27 Oct 2023 at 12:09, vignesh C wrote: > > On Thu, 21 Sept 2023 at 11:27, Michael Paquier wrote: > > > > On Fri, Sep 15, 2023 at 03:08:21PM +0530, vignesh C wrote: > > > On Tue, 12 Sept 2023 at 14:25, Hayato Kuroda (Fujitsu) > > > wrote: > > >> Is there a possibility that apply worker on old cluster connects to the > > >> publisher during the upgrade? Regarding the pg_upgrade on publisher, the > > >> we > > >> refuse TCP/IP connections from remotes and port number is also changed, > > >> so we can > > >> assume that subscriber does not connect to. But IIUC such settings may > > >> not affect > > >> to the connection source, so that the apply worker may try to connect to > > >> the > > >> publisher. Also, is there any hazards if it happens? > > > > > > Yes, there is a possibility that the apply worker gets started and new > > > transaction data is being synced from the publisher. I have made a fix > > > not to start the launcher process in binary ugprade mode as we don't > > > want the launcher to start apply worker during upgrade. > > > > Hmm. I was wondering if 0001 is the right way to handle this case, > > but at the end I'm OK to paint one extra isBinaryUpgrade in the code > > path where apply launchers are registered. I don't think that the > > patch is complete, though. A comment should be added in pg_upgrade's > > server.c, exactly start_postmaster(), to tell that -b also stops apply > > workers. I am attaching a version updated as of the attached, that > > I'd be OK to apply. > > I have added comments > > > I don't really think that we need to worry about a subscriber > > connecting back to a publisher in this case, though? I mean, each > > postmaster instance started by pg_upgrade restricts the access to the > > instance with unix_socket_directories set to a custom path and > > permissions at 0700, and a subscription's connection string does not > > know the unix path used by pg_upgrade. I certainly agree that > > stopping these processes could lead to inconsistencies in the data the > > subscribers have been holding though, if we are not careful, so > > preventing them from running is a good practice anyway. > > I have made the fix similar to how upgrade publisher has done to keep > it consistent. > > > I have also reviewed 0002. As a whole, I think that I'm OK with the > > main approach of the patch in pg_dump to use a new type of dumpable > > object for subscription relations that are dumped with their upgrade > > functions after. This still needs more work, and more documentation. > > Added documentation > > > Also, perhaps we should really have an option to control if this part > > of the copy happens or not. With a --no-subscription-relations for > > pg_dump at least? > > Currently this is done by default in binary upgrade mode, I will add a > separate patch to skip dump of subscription relations from upgrade and > dump a little later. > > > > > +{ oid => '4551', descr => 'add a relation with the specified relation > > state to pg_subscription_rel table', > > > > During a development cycle, any new function added needs to use an OID > > in range 8000-. Running unused_oids will suggest new random OIDs. > > Modified > > > FWIW, I am not convinced that there is a need for two functions to add > > an entry to pg_subscription_rel, with sole difference between both the > > handling of a valid or invalid LSN. We should have only one function > > that's able to handle NULL for the LSN. So let's remove rel_state_a > > and rel_state_b, and have a single rel_state(). The description of > > the SQL functions is inconsistent with the other binary upgrade ones, > > I would suggest for the two functions > > "for use by pg_upgrade (relation for pg_subscription_rel)" > > "for use by pg_upgrade (remote_lsn for origin)" > > Removed rel_state_a and rel_state_b and updated the description accordingly > > > + i_srsublsn = PQfnumber(res, "srsublsn"); > > [...] > > + subrinfo[cur_rel].srsublsn = pg_strdup(PQgetvalue(res, i, > > i_srsublsn)); > > > > In getSubscriptionTables(), this should check for PQgetisnull() > > because we would have a NULL value for InvalidXLogRecPtr in the > > catalog. Using a char* for srsublsn is OK, but just assign NULL to > > it, then just pass a hardcoded NULL value to the function as we do in > > other places. So I don't quite get why this is not the same handling > > as suboriginremotelsn. > > Modified > > > > > getSubscriptionTables() is entirely skipped if we don't want any > > subscriptions, if we deal with a server of 9.6 or older or if we don't > > do binary upgrades, which is OK. > > > > +/* > > + * getSubscriptionTables > > + * get information about subscription membership for dumpable tables. > > + */ > > This commit is slightly misleading and should mention that this is an > > upgrade-only path? > > Modified > > > > > The code for dumpSubscriptionTable() is a copy-paste of > > dumpPublicationTable(), but a lot of what you are
Re: A recent message added to pg_upgade
On Mon, Oct 30, 2023 at 2:31 PM Bharath Rupireddy wrote: > Never mind. Previous message was accidentally sent before I finished writing my comments. > Yeah. The check_hook is called even after the param is specified in > postgresql.conf during the upgrade, so I see no problem there. > > > The error message, which is deemed impossible, adds an additional > > message translation. In another thread, we are discussing the > > reduction of translatable messages. Therefore, I suggest using elog() > > for the condition at the very least. Whether it should be elog() or > > Assert() remains open for discussion, as I don't have a firm stance on > > it. > > I get it. I agree to go with just the assert because the GUC > check_hook kinda tightens the screws against setting > max_slot_wal_keep_size to a value other than -1 during the binary > upgrade, A few comments on your inhibit_m_s_w_k_s_during_upgrade_2.txt: 1. + * All WAL files on the publisher node must be retained during an upgrade to + * maintain connections from downstreams. While pg_upgrade explicitly sets It's not just the publisher, anyone using logical slots. Also, no downstream please. If you want, you can repurpose the comment that's added by 29d0a77f. /* * Use max_slot_wal_keep_size as -1 to prevent the WAL removal by the * checkpointer process. If WALs required by logical replication slots * are removed, the slots are unusable. This setting prevents the * invalidation of slots during the upgrade. We set this option when * cluster is PG17 or later because logical replication slots can only be * migrated since then. Besides, max_slot_wal_keep_size is added in PG13. */ 2. At present, only logical slots really require + * this. Can we remove the above comment as the code with SlotIsLogical(s) explains it all? 3. +GUC_check_errdetail("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade."); +return false; How about we be explicit like the following which helps users reason about this restriction instead of them looking at the comments/docs? GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE); GUC_check_errmsg(""\"max_slot_wal_keep_size\" must be set to -1 when in binary upgrade mode"); GUC_check_errdetail("A value of -1 prevents the removal of WAL required for logical slots upgrade."); return false; 4. I think a test case to hit the error in the check hook in 003_logical_slots.pl will help a lot here - not only covers the code but also helps demonstrate how one can reach the error. 5. I think the check_hook is better defined in xlog.c the place where it's actually being declared and in action. IMV, there's no reason for it to be in slot.c as it doesn't deal with any slot related variables/structs. This also avoids an unnecessary "utils/guc_hooks.h" inclusion in slot.c. +bool +check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source) +{ 5. A possible problem with this check_hook approach is that it doesn't let anyone setting max_slot_wal_keep_size to a value other than -1 during pg_ugprade even if someone doesn't have logical slots or doesn't want to upgrade logical slots in which case the WAL file growth during pg_upgrade may be huge (transiently) unless the pg_resetwal step of pg_upgrade removes it at the end. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PGDOCS - add more links in the pub/sub reference pages
On Mon, Oct 16, 2023 at 6:15 AM Peter Smith wrote: > > Thanks for pushing the 0001 patch! I am unsure what the decision was > for the remaining patches, but anyway, here they are again (rebased). > To keep the link names the same in logical replication-related commands, I have pushed your patch. -- With Regards, Amit Kapila.
RE: PATCH: document for regression test forgets libpq test
Hi, > Wednesday, September 6, 2023 2:06 AM Bruce Momjian wrote: > Yes, good point. I modifed the patch, attached, and applied it to all > supported versions. Thank you. # I forgot to send mail. Best Regards Ryo Matsumura > -Original Message- > From: Bruce Momjian > Sent: Wednesday, September 6, 2023 2:06 AM > To: Matsumura, Ryo/松村 量 > Cc: pgsql-hack...@postgresql.org > Subject: Re: PATCH: document for regression test forgets libpq test > > On Fri, Sep 1, 2023 at 08:01:47AM +, Ryo Matsumura (Fujitsu) wrote: > > Hi, > > > > I found a small mistake in document in 33.1.3. Additional Test Suites. > > > > > The additional tests that can be invoked this way include: > > The list doesn't include interface/libpq/test. > > > > I attached patch. > > Yes, good point. I modifed the patch, attached, and applied it to all > supported versions. > > -- > Bruce Momjian https://momjian.us > EDB https://enterprisedb.com > > Only you can decide what is important to you.
RE: A recent message added to pg_upgade
Dear Bharath, > Will the check_hook approach work correctly? I tested by using the first version and worked well (rejected). Please see the log which recorded the output and log. Below lines were copied from server log and found that max_slot_wal_keep_size must not be >= 0. ``` waiting for server to start2023-10-30 08:53:32.529 GMT [6903] FATAL: invalid value for parameter "max_slot_wal_keep_size": 1 stopped waiting pg_ctl: could not start serve ``` > I haven't checked that by > myself, but I see InitializeGUCOptions() getting called before > IsBinaryUpgrade is set to true and the passed-in config options ('c') > are parsed. I thought the key point was that user-defined options are aligned after the "-b". User-defined options are set after the '-b' option, so check_hook could work as we expected. Thought? Best Regards, Hayato Kuroda FUJITSU LIMITED $ pg_upgrade -b /usr/local/pgsql/bin/ -B /usr/local/pgsql/bin/ -d data_N1/ -D data_N2/ -r -U po stgres -o "-c max_slot_wal_keep_size=1MB" Performing Consistency Checks - Checking cluster versions ok *failure* Consult the last few lines of "data_N2/pg_upgrade_output.d/20231030T085332.388/log/pg_upgrade_server.log" for the probable cause of the failure. connection to server on socket "/home/hayato/test/pgdump/.s.PGSQL.50432" failed: No such file or directory Is the server running locally and accepting connections on that socket? could not connect to source postmaster started with the command: "/usr/local/pgsql/bin/pg_ctl" -w -l "data_N2/pg_upgrade_output.d/20231030T085332.388/log/pg_upgrade_server.log" -D "data_N1" -o "-p 50432 -b -c max_slot_wal_keep_size=-1 -c max_slot_wal_keep_size=1MB -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/home/hayato/test/pgdump'" start Failure, exiting pg_upgrade_server.log Description: pg_upgrade_server.log
Re: A recent message added to pg_upgade
On Mon, Oct 30, 2023 at 1:42 PM Kyotaro Horiguchi wrote: > > At Mon, 30 Oct 2023 12:38:47 +0530, Bharath Rupireddy > wrote in > > Will the check_hook approach work correctly? I haven't checked that by > > myself, but I see InitializeGUCOptions() getting called before > > IsBinaryUpgrade is set to true and the passed-in config options ('c') > > are parsed. > > I'm not sure about the wanted behavior exactly, but the fast you > pointed doesn't matter because the check is required after parsing the > command line options. On the other hand, I'm not sure about the > behavior that a setting in postgresql.conf is rejected. Yeah. The check_hook is called even after the param is specified in postgresql.conf during the upgrade, so I see no problem there. > > If the check_hook approach works correctly, I think we must add a test > > hitting the error in check_max_slot_wal_keep_size for the > > IsBinaryUpgrade case. And, I agree with Amit to have a detailed > > messaging with GUC_check_errmsg/GUC_check_errdetail. Also, IMV, > > leaving the error message in InvalidatePossiblyObsoleteSlot() there > > (if required with a better wording as discussed initially in this > > thread) does no harm. Actually, it acts as another safety net given > > that max_slot_wal_keep_size GUC is reloadable via SIGHUP. > > The error message, which is deemed impossible, adds an additional > message translation. In another thread, we are discussing the > reduction of translatable messages. Therefore, I suggest using elog() > for the condition at the very least. Whether it should be elog() or > Assert() remains open for discussion, as I don't have a firm stance on > it. I get it. I agree to go with just the assert because the GUC check_hook kinda tightens the screws against setting max_slot_wal_keep_size to a value other than -1 during the binary upgrade, A few comments on your inhibit_m_s_w_k_s_during_upgrade_2.txt: 1. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: Open a streamed block for transactional messages during decoding
On Monday, October 30, 2023 12:20 PM Amit Kapila wrote: > > On Thu, Oct 26, 2023 at 2:01 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Thursday, October 26, 2023 12:42 PM Amit Kapila > wrote: > > > > > > On Tue, Oct 24, 2023 at 5:27 PM Zhijie Hou (Fujitsu) > > > > > > wrote: > > > > > > > > While reviewing the test_decoding code, I noticed that when > > > > skip_empty_xacts option is specified, it doesn't open the > > > > streaming > > > block( e.g. > > > > pg_output_stream_start) before streaming the transactional MESSAGE > > > > even if it's the first change in a streaming block. > > > > > > > > It looks inconsistent with what we do when streaming DML changes(e.g. > > > > pg_decode_stream_change()). > > > > > > > > Here is a small patch to open the stream block in this case. > > > > > > > > > > The change looks good to me though I haven't tested it yet. BTW, can > > > we change the comment: "Output stream start if we haven't yet, but > > > only for the transactional case." to "Output stream start if we > > > haven't yet for transactional messages"? > > > > Thanks for the review and I changed this as suggested. > > > > --- a/contrib/test_decoding/expected/stream.out > +++ b/contrib/test_decoding/expected/stream.out > @@ -29,7 +29,10 @@ COMMIT; > SELECT data FROM pg_logical_slot_get_changes('regression_slot', > NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', > '1'); > data > -- > + opening a streamed block for transaction > streaming message: transactional: 1 prefix: test, sz: 50 > + closing a streamed block for transaction aborting streamed > + (sub)transaction > > I was analyzing the reason for the additional message: "aborting streamed > (sub)transaction" in the above test and it seems to be due to the below check > in > the function pg_decode_stream_abort(): > > if (data->skip_empty_xacts && !xact_wrote_changes) return; > > Before the patch, we won't be setting the 'xact_wrote_changes' flag in txndata > which is fixed now. So, this looks okay to me. However, I have another > observation in this code which is that for aborts or subtransactions, we are > not > checking the flag 'stream_wrote_changes', so we may end up emitting the > abort message even when no actual change has been streamed. I haven't tried > to generate a test to verify this observation, so I could be wrong as well > but it is > worth analyzing such cases. I have confirmed that the mentioned case is possible(steps[1]): the sub-transaction doesn't output any data, but the stream abort for this sub-transaction will still be sent. But I think this may not be a problemic behavior, as even the pgoutput can behave similarly, e.g. If all the changes are filtered by row filter or table filter, then the stream abort will still be sent. The subscriber will skip handling the STREAM ABORT if the aborted txn was not applied. And if we want to fix this, in output plugin, we need to record if we have sent any changes for each sub-transaction so that we can decide whether to send the following stream abort or not. We cannot use 'stream_wrote_changes' because it's a per streamed block flag and there could be serval streamed blocks for one sub-txn. It looks a bit complicate to me. [1] SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); BEGIN; savepoint p1; CREATE TABLE test(a int); INSERT INTO test VALUES(1); savepoint p2; CREATE TABLE test2(a int); ROLLBACK TO SAVEPOINT p2; COMMIT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '1', 'stream-changes', '1'); data -- opening a streamed block for transaction TXN 734 streaming change for TXN 734 closing a streamed block for transaction TXN 734 aborting streamed (sub)transaction TXN 736 committing streamed transaction TXN 734 Best Regards, Hou zj
Re: Add new option 'all' to pg_stat_reset_shared()
On Mon, Oct 30, 2023 at 1:47 PM Kyotaro Horiguchi wrote: > > At Mon, 30 Oct 2023 16:35:19 +0900, torikoshia > wrote in > > Hi, > > > > After 96f052613f3, we have below 6 types of parameter for > > pg_stat_reset_shared(). > > > > "archiver", "bgwriter", "checkpointer", "io", "recovery_prefetch", > > "wal" > > > > How about adding a new option 'all' to delete all targets above? > > > > I imagine there are cases where people want to initialize all of them > > at the same time in addition to initializing one at a time. > > FWIW, I fairly often wanted it, but forgot about that:p Isn't calling pg_stat_reset_shared() for all stats types helping you out? Is there any problem with it? Can you be more specific about the use-case? IMV, I don't see any point for adding another pseudo (rather non-existent) shared stats target which might confuse users - it's easy to specify pg_stat_reset_shared('all'); to clear things out when someone actually doesn't want to reset all - an accidental usage of the 'all' option will reset all shared memory stats. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add new option 'all' to pg_stat_reset_shared()
At Mon, 30 Oct 2023 16:35:19 +0900, torikoshia wrote in > Hi, > > After 96f052613f3, we have below 6 types of parameter for > pg_stat_reset_shared(). > > "archiver", "bgwriter", "checkpointer", "io", "recovery_prefetch", > "wal" > > How about adding a new option 'all' to delete all targets above? > > I imagine there are cases where people want to initialize all of them > at the same time in addition to initializing one at a time. FWIW, I fairly often wanted it, but forgot about that:p regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: A recent message added to pg_upgade
At Mon, 30 Oct 2023 12:38:47 +0530, Bharath Rupireddy wrote in > Will the check_hook approach work correctly? I haven't checked that by > myself, but I see InitializeGUCOptions() getting called before > IsBinaryUpgrade is set to true and the passed-in config options ('c') > are parsed. I'm not sure about the wanted behavior exactly, but the fast you pointed doesn't matter because the check is required after parsing the command line options. On the other hand, I'm not sure about the behavior that a setting in postgresql.conf is rejected. > If the check_hook approach works correctly, I think we must add a test > hitting the error in check_max_slot_wal_keep_size for the > IsBinaryUpgrade case. And, I agree with Amit to have a detailed > messaging with GUC_check_errmsg/GUC_check_errdetail. Also, IMV, > leaving the error message in InvalidatePossiblyObsoleteSlot() there > (if required with a better wording as discussed initially in this > thread) does no harm. Actually, it acts as another safety net given > that max_slot_wal_keep_size GUC is reloadable via SIGHUP. The error message, which is deemed impossible, adds an additional message translation. In another thread, we are discussing the reduction of translatable messages. Therefore, I suggest using elog() for the condition at the very least. Whether it should be elog() or Assert() remains open for discussion, as I don't have a firm stance on it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
DRAFT GIST support for ORDER BY
Hi All,Attached is a first attempt to implement GIST index (only) scans for ORDER BY column clauses.The idea is that it order by column for some datatypes is a special case of ordering by distance:ORDER BY a == ORDER BY a <-> MIN_VALUEandORDER BY a DESC == ORDER BY a <-> MAX_VALUEThis allows implementing GIST ordered scans for btree_gist datatypes.This in turn makes using GIST with partitioning feasible (I have described issues with such usage in my previous e-mails - see below).The solution is not ideal as it requires registering “<“ and “>” operators as ordering operators in opfamily(which in turn makes it possible to issue somewhat meaningless “ORDER BY a < ‘constant’)The problem is though that right now handling of ORDER BY column clauses is tightly coupled to BTree.It would be good to refactor the code so that semantics of ORDER BY column could be more flexible.It would be great if someone could take a look at it.Thanks,Michal DRAFT-gist-orderby.patch Description: Binary data On 24 Oct 2023, at 13:22, Michał Kłeczek wrote:Hi,Some time ago I’ve provided some details with the issues we face when trying to use GIST and partitioning at the same time in the postgresql-general mailing list:GIST index and ORDER BYpostgresql.orgWe decided to go with the solution to partition our table by:RANGE (‘2100-01-01' <-> operation_date).While it (somewhat) solves partition pruning issues described above there is another problem:It is impossible to create a unique constraint on the partitioned table.So now we cannot use INSERT … ON CONFLICT (…) DO UPDATEMy question to hackers:Would it be feasible to implement ORDER BY column GIST index (only) scan for types with total order and sensible greatest and least values?Thanks,Michal
Re: A performance issue with Memoize
On Thu, Oct 26, 2023 at 12:07 PM Andrei Lepikhov wrote: > Do you've thought about the case, fixed with the commit 1db5667? As I > see, that bugfix still isn't covered by regression tests. Could your > approach of a PARAM_EXEC slot reusing break that case? Hm, I don't think so. The issue fixed by commit 1db5667 was caused by sharing PARAM_EXEC slots between different levels of NestLoop. AFAICS it's safe to share PARAM_EXEC slots within the same level of NestLoop. The change here is about sharing PARAM_EXEC slots between subquery's subplan_params and outer-relation variables, which happens within the same level of NestLoop. Actually, even without this change, we'd still share PARAM_EXEC slots between subquery's subplan_params and outer-relation variables in some cases. As an example, consider explain (costs off) select * from t t1 left join (t t2 left join lateral (select t1.a as t1a, t2.a as t2a, * from t t3) s on t2.b = s.b) on t1.b = s.b and t1.a = t2.a; QUERY PLAN --- Nested Loop Left Join -> Seq Scan on t t1 -> Nested Loop Join Filter: (t1.a = t2.a) -> Seq Scan on t t2 -> Subquery Scan on s Filter: ((t1.b = s.b) AND (t2.b = s.b)) -> Seq Scan on t t3 (8 rows) For outer-relation Var 't1.a' from qual 't1.a = t2.a', it shares PARAM_EXEC slot 0 with the PlannerParamItem for 't1.a' within the subquery (from its targetlist). Did you notice a case that the change here breaks? Hi Tom, could you share your insights on this issue and the proposed fix? Thanks Richard
Re: A recent message added to pg_upgade
At Mon, 30 Oct 2023 03:36:41 +, "Zhijie Hou (Fujitsu)" wrote in > Thanks for the diff and I think the approach basically works. > > One notable behavior of this approach it will reject the GUC setting even if > there > are no slots on old cluster or user set the value to a big enough value which > doesn't cause invalidation. The behavior doesn't look bad to me but just > mention it > for reference. Indeed. pg_upgrade anyway sets the variable to -1 irrespective of the slot's existence, and I see no justification for allowing users to forcibly change it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: A recent message added to pg_upgade
At Mon, 30 Oct 2023 08:51:18 +0530, Amit Kapila wrote in > I think we can simply change that error message to assert if we want > to go with the check hook idea of yours. BTW, can we add > GUC_check_errdetail() with a better message as some of the other check > function uses? Also, I guess we can add some comments or at least > refer to the existing comments to explain the reason of such a check. Definitely. I've attached the following changes. 1. Added GUC_check_errdetail() to the check function. 2. Added a comment to the check function (based on my knowledge about the feature). 3. Altered the ereport() into Assert() in InvalidatePossiblyObsoleteSlot(). I considered removing the !SlotIsLogical() condition since pg_upgrade always sets max_slot_wal_keep_size to -1, but I left it unchanged. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 99823df3c7..f7dc966600 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -52,6 +52,7 @@ #include "storage/proc.h" #include "storage/procarray.h" #include "utils/builtins.h" +#include "utils/guc_hooks.h" /* * Replication slot on-disk data structure. @@ -111,6 +112,26 @@ static void RestoreSlotFromDisk(const char *name); static void CreateSlotOnDisk(ReplicationSlot *slot); static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel); +/* + * GUC check_hook for max_slot_wal_keep_size + * + * All WAL files on the publisher node must be retained during an upgrade to + * maintain connections from downstreams. While pg_upgrade explicitly sets + * this variable to -1, there are ways for users to modify it. Ensure it + * remains unchanged for safety. See InvalidatePossiblyObsoleteSlot() and + * start_postmaster() in pg_upgrade for more details. + */ +bool +check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source) +{ + if (IsBinaryUpgrade && *newval != -1) + { + GUC_check_errdetail("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade."); + return false; + } + return true; +} + /* * Report shared-memory space needed by ReplicationSlotsShmemInit. */ @@ -1424,18 +1445,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, SpinLockRelease(>mutex); /* -* The logical replication slots shouldn't be invalidated as -* max_slot_wal_keep_size GUC is set to -1 during the upgrade. -* -* The following is just a sanity check. +* check_max_slot_wal_keep_size() ensures max_slot_wal_keep_size is set +* to -1, so, slot invalidation for logical slots shouldn't happen +* during an upgrade. At present, only logical slots really require +* this. */ - if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade) - { - ereport(ERROR, - errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("replication slots must not be invalidated during the upgrade"), - errhint("\"max_slot_wal_keep_size\" must be set to -1 during the upgrade")); - } + Assert (!*invalidated || !SlotIsLogical(s) || !IsBinaryUpgrade); if (active_pid != 0) { diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 7605eff9b9..818ffdb2ae 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2845,7 +2845,7 @@ struct config_int ConfigureNamesInt[] = }, _slot_wal_keep_size_mb, -1, -1, MAX_KILOBYTES, - NULL, NULL, NULL + check_max_slot_wal_keep_size, NULL, NULL }, { diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h index 2a191830a8..3d74483f44 100644 --- a/src/include/utils/guc_hooks.h +++ b/src/include/utils/guc_hooks.h @@ -84,6 +84,8 @@ extern bool check_maintenance_io_concurrency(int *newval, void **extra, extern void assign_maintenance_io_concurrency(int newval, void *extra); extern bool check_max_connections(int *newval, void **extra, GucSource source); extern bool check_max_wal_senders(int *newval, void **extra, GucSource source); +extern bool check_max_slot_wal_keep_size(int *newval, void **extra, + GucSource source); extern void assign_max_wal_size(int newval, void *extra); extern bool check_max_worker_processes(int *newval, void **extra, GucSource source);
Add new option 'all' to pg_stat_reset_shared()
Hi, After 96f052613f3, we have below 6 types of parameter for pg_stat_reset_shared(). "archiver", "bgwriter", "checkpointer", "io", "recovery_prefetch", "wal" How about adding a new option 'all' to delete all targets above? I imagine there are cases where people want to initialize all of them at the same time in addition to initializing one at a time. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
Hi, Alexander! On Wed, 25 Oct 2023 at 00:13, Alexander Korotkov wrote: > > On Wed, Sep 28, 2022 at 11:44 AM Aleksander Alekseev > wrote: > > > I think, this patch was marked as "Waiting on Author", probably, by > > > mistake. Since recent changes were done without any significant code > > > changes and CF bot how happy again. > > > > > > I'm going to move it to RfC, could I? If not, please tell why. > > > > I restored the "Ready for Committer" state. I don't think it's a good > > practice to change the state every time the patch has a slight > > conflict or something. This is not helpful at all. Such things happen > > quite regularly and typically are fixed in a couple of days. > > This patch seems useful to me. I went through the thread, it seems > that all the critics are addressed. > > I've rebased this patch. Also, I've run perltidy for tests, split > long errmsg() into errmsg(), errdetail() and errhint(), and do other > minor enchantments. > > I think this patch is ready to go. I'm going to push it if there are > no objections. > > -- > Regards, > Alexander Korotkov It's very good that this long-standing patch is finally committed. Thanks a lot! Regards, Pavel Borisov
Add BF member koel-like indentation checks to SanityCheck CI
Hi, How about adding code indent checks (like what BF member koel has) to the SanityCheck CI task? This helps catch indentation problems way before things are committed so that developers can find them out in their respective CI runs and lets developers learn the postgres code indentation stuff. It also saves committers time - git log | grep 'koel' | wc -l gives me 11 commits and git log | grep 'indentation' | wc -l gives me 97. Most, if not all of these commits went into fixing code indentation problems that could have been reported a bit early and fixed by developers/patch authors themselves. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
On Fri, Oct 27, 2023 at 09:31:10AM -0400, David Steele wrote: > That sounds like the right plan to me. Nice and simple. I'll tackle that in a separate thread with a patch registered for the upcoming CF of November. > I'm still +1 for the patch as it stands. I have been reviewing the patch, and applied portions of it as of dc5bd388 and 1ffdc03c and they're quite independent pieces. After that, the remaining bits of the patch to change the behavior is now straight-forward. I have written a commit message for it, while on it, as per the attached. -- Michael From 26a8432fe3ab8426e7797d85d19b0fe69d3384c9 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 30 Oct 2023 16:02:52 +0900 Subject: [PATCH v4] Require recovery.signal or standby.signal when reading a backup_file Historically, the startup process uses two static variables to control if archive recovery should happen, when either recovery.signal or standby.signal are defined in the data folder at the beginning of recovery: - ArchiveRecoveryRequested, set to "true" if either .signal file exists. This controls if archive recovery has been requested. - InArchiveRecovery, that controls if recovery should switch to archive mode during recovery. It is initially "false" but switched to "true" during recovery for two cases: -- Just after finding and reading a valid backup_label file, close to the beginning of recovery. -- When no backup_label file is found, it would be set to "false" to force crash recovery, where all the WAL located in pg_wal/ is read first. When all the local WAL records have been consumed, it is changed to "true" to switch to archive recovery mode, equivalent to a base backup. Now comes the fact that recovery has an old logical problem: when a backup_label is found but no .signal file is defined, the startup process finishes in a state where ArchiveRecoveryRequested is "false" but InArchiveRecovery is "true". This is proving to cause various issues, that got worse since restart points can run during crash recovery for the benefit of being able to start and stop instances without doing crash recovery from its initial point (7ff23c6d277d): - Some standby states needed by archive recovery would not be set, causing correctness issues, like assertion failures during recovery. - recoveryWakeupLatch would not be set. - Some GUCs are messed up, hot_standby that depends on ArchiveRecoveryRequested. This configuration was possible when recovering from a base backup taken by pg_basebackup without -R. Note that the documentation requires at least to set recovery.signal to restore from a backup, but the startup process was not making this policy explicit. In most cases, one would have been able to complete recovery, but that's a matter of luck, really, as it depends on the workload of the origin server. This has the advantage of simplifying the logic for the archive recovery case, as InArchiveRecovery now requires ArchiveRecoveryRequested to be set. Recovering a self-contained backup now requires a recovery.signal, with a restore_command set in postgresql.conf (or related conf file). One test of pg_basebackup and one test of pg_rewind need to be tweaked to avoid the FATAL introduced by this patch when a base_backup is found without a .signal file, even if the intention of both is to have a recovery.signal set. Reviewed-by: David Steele, Bowen Shi Discussion: https://postgr.es/m/zarvomifjze7f...@paquier.xyz Discussion: https://postgr.es/m/Y/Q/17rpys7yg...@paquier.xyz Discussion: https://postgr.es/m/Y/v0c+3W89NBT/i...@paquier.xyz --- src/backend/access/transam/xlogrecovery.c | 22 --- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 3 ++- src/bin/pg_rewind/t/008_min_recovery_point.pl | 1 + 3 files changed, 22 insertions(+), 4 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index c6156a..0088dfec2e 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -125,14 +125,18 @@ static TimeLineID curFileTLI; /* * When ArchiveRecoveryRequested is set, archive recovery was requested, - * ie. signal files were present. When InArchiveRecovery is set, we are - * currently recovering using offline XLOG archives. These variables are only - * valid in the startup process. + * ie. signal files or backup_label were present. When InArchiveRecovery is + * set, we are currently recovering using offline XLOG archives. These + * variables are only valid in the startup process. Note that the presence of + * a backup_label file requires the presence of one of the two .signal files + * to enforce archive recovery. * * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're * currently performing crash recovery using only XLOG files in pg_wal, but * will switch to using offline XLOG archives as soon as we reach the end of * WAL in pg_wal. + * + * InArchiveRecovery should never be
Re: A recent message added to pg_upgade
On Mon, Oct 30, 2023 at 8:51 AM Amit Kapila wrote: > > > This discussion seems like a bit off from my point. I suggested adding > > a check for that setting when IsBinaryUpgraded is true at the GUC > > level as shown in the attached patch. I believe Álvaro made a similar > > suggestion. While the error message is somewhat succinct, I think it > > is sufficient given the low possilibility of the scenario and the fact > > that it cannot occur inadvertently. > > > > I think we can simply change that error message to assert if we want > to go with the check hook idea of yours. BTW, can we add > GUC_check_errdetail() with a better message as some of the other check > function uses? Also, I guess we can add some comments or at least > refer to the existing comments to explain the reason of such a check. Will the check_hook approach work correctly? I haven't checked that by myself, but I see InitializeGUCOptions() getting called before IsBinaryUpgrade is set to true and the passed-in config options ('c') are parsed. If the check_hook approach works correctly, I think we must add a test hitting the error in check_max_slot_wal_keep_size for the IsBinaryUpgrade case. And, I agree with Amit to have a detailed messaging with GUC_check_errmsg/GUC_check_errdetail. Also, IMV, leaving the error message in InvalidatePossiblyObsoleteSlot() there (if required with a better wording as discussed initially in this thread) does no harm. Actually, it acts as another safety net given that max_slot_wal_keep_size GUC is reloadable via SIGHUP. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Introduce a new view for checkpointer related stats
On Mon, Oct 30, 2023 at 6:19 AM Michael Paquier wrote: > > On Fri, Oct 27, 2023 at 10:23:34AM +0530, Bharath Rupireddy wrote: > > +1. Changed in the attached v10-001. FWIW, having a test case in > > stats.sql emitting this error message and hint would have helped here. > > If okay, I can add one. > > That may be something to do. At least it was missed on this thread, > even if we don't add a new pgstat shared type every day. Right. Adding test coverage for the error-case is no bad IMV (https://coverage.postgresql.org/src/backend/utils/adt/pgstatfuncs.c.gcov.html). Here's the attached 0001 patch for that. > > PS: I'll park the SLRU flush related patch aside until the approach is > > finalized. I'm posting the pg_stat_checkpointer patch as v10-0001. > > +-- Test that reset_shared with checkpointer specified as the stats type works > +SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset > +SELECT pg_stat_reset_shared('checkpointer'); > +SELECT stats_reset > :'checkpointer_reset_ts'::timestamptz FROM > pg_stat_checkpointer; > +SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset > > Note that you have forgotten to update the test of > pg_stat_reset_shared(NULL) to check for the value of > checkpointer_reset_ts. I've added an extra SELECT to check that for > pg_stat_checkpointer, and applied v8. Oh, thanks for taking care of this. While at it, I noticed that there's no coverage for pg_stat_reset_shared('recovery_prefetch') and XLogPrefetchResetStats() https://coverage.postgresql.org/src/backend/access/transam/xlogprefetcher.c.gcov.html. Most of the recovery_prefetch code is covered with recovery/streaming related tests, but the reset stats part is missing. So, I've added coverage for it in the attached 0002. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v1-0001-Add-error-case-test-for-pg_stat_reset_shared-with.patch Description: Binary data v1-0002-Add-test-for-resetting-recovery_prefetch-shared-s.patch Description: Binary data
Re: COPY TO (FREEZE)?
At Sun, 29 Oct 2023 15:35:02 -0400, Bruce Momjian wrote in > You are 100% correct. Updated patch attached. -errmsg("COPY force not null only available using COPY FROM"))); +errmsg("COPY force not null cannot be used with COPY TO"))); I find the term "force not null" hard to translate, especially into Japaese, as its literal translation doesn't align with the entire message. The most recent translation for it is the literal rendition of "FORCE_NOT_NULL option of COPY can only be used with COPY FROM". In short, for translation convenience, I would prefer if "force not null" were "FORCE_NOT_NULL". regards. -- Kyotaro Horiguchi NTT Open Source Software Center