Re: Synchronizing slots from primary to standby

2023-10-30 Thread Amit Kapila
On Tue, Oct 31, 2023 at 7:16 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > 2.
> > > Currently ctx->failover is set only in the pgoutput_startup(), but not 
> > > sure it is
> > OK.
> > > Can we change the parameter in CreateDecodingContext() or similar 
> > > functions?
> > >
> > > Because IIUC it means that only slots which have pgoutput can wait. Other
> > > output plugins must understand the change and set faliover flag as well -
> > > I felt it is not good. E.g., you might miss to enable the parameter in
> > test_decoding.
> > >
> > > Regarding the two_phase parameter, setting on plugin layer is good 
> > > because it
> > > quite affects the output. As for the failover, it is not related with the
> > > content so that all of slots should be enabled.
> > >
> > > I think CreateDecodingContext or StartupDecodingContext() is the common
> > path.
> > > Or, is it the out-of-scope for now?
> >
> > Currently, the failover field is part of the options list in the
> > StartReplicationCmd. This gives some
> > level of flexibility such that only plugins that are interested in
> > this need to handle it. The options list
> > is only deparsed by plugins.  If we move it to outside of the options list,
> > this sort of changes the protocol for START_REPLICATION and will
> > impact all plugins.
> >  But I agree to your larger point that, we need to do it in such a way that
> > other plugins do not unintentionally change the 'failover' behaviour
> > of the originally created slot.
> > Maybe I can code it in such a way that, only if the failover option is
> > specified in the list of options
> > passed as part of START_REPLICATION  will it change the original slot
> > created 'failover' flag by adding
> > another flag "failover_opt_given". Plugins that set this, will be able
> > to change the failover flag of the slot,
> > while plugins that do not support this will not set this and the
> > failover flag of the created slot will remain.
> > What do you think?
>
> May be OK, but I came up with a corner case that external plugins have a 
> streaming
> option 'failover'. What should be? Has the option been reserved?
>

Sorry, your question is not clear to me. Did you intend to say that
the value of the existing streaming option could be 'failover'?

-- 
With Regards,
Amit Kapila.




Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2023-10-30 Thread Ashutosh Bapat
On Fri, Sep 8, 2023 at 3:22 PM Ashutosh Bapat
 wrote:
>
> On Fri, Jul 28, 2023 at 3:16 PM Ashutosh Bapat
>  wrote:
> >
> > Hi Tom, Richard,
> >
> > On Mon, Jul 24, 2023 at 8:17 AM Richard Guo  wrote:
> > >
> > > Thanks for pushing it!
> >
> > With this fix, I saw a noticeable increase in the memory consumption
> > of planner. I was running experiments mentioned in [1] The reason is
> > the Bitmapset created by bms_union() are not freed during planning and
> > when there are thousands of child joins involved, bitmapsets takes up
> > a large memory and there any a large number of bitmaps.
> >
> > Attached 0002 patch fixes the memory consumption by calculating
> > appinfos only once and using them twice. The number look like below
> >
> > Number of tables joined | without patch | with patch |
> > --
> >   2 |  40.8 MiB |   40.3 MiB |
> >   3 | 151.6 MiB |  146.9 MiB |
> >   4 | 463.9 MiB |  445.5 MiB |
> >   5 |1663.9 MiB | 1563.3 MiB |
> >
> > The memory consumption is prominent at higher number of joins as that
> > exponentially increases the number of child joins.
> >
> > Attached setup.sql and queries.sql and patch 0001 were used to measure
> > memory similar to [1].
> >
> > I don't think it's a problem with the patch itself. We should be able
> > to use Bitmapset APIs similar to what patch is doing. But there's a
> > problem with our Bitmapset implementation. It's not space efficient
> > for thousands of partitions. A quick calculation reveals this.
> >
> > If the number of partitions is 1000, the matching partitions will
> > usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting
> > the relids will be {b 1000, 2000, 3000, ...}. To represent a single
> > 1000th digit current Bitmapset implementation will allocate 1000/8 =
> > 125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of
> > memory. This is even true for lower relid numbers since they will be
> > 1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child
> > joins require 625KiB memory. Doing this as many times as the number of
> > joins we get 120 * 625 KiB = 75 MiB which is closer to the memory
> > difference we see above.
> >
> > Even if we allocate a list to hold 5 integers it will not take 625
> > bytes. I think we need to improve our Bitmapset representation to be
> > efficient in such cases. Of course whatever representation we choose
> > has to be space efficient for a small number of tables as well and
> > should gel well with our planner logic. So I guess some kind of
> > dynamic representation which changes the underlying layout based on
> > the contents is required. I have looked up past hacker threads to see
> > if this has been discussed previously.
> >
> > I don't think this is the thread to discuss it and also I am not
> > planning to work on it in the immediate future. But I thought I would
> > mention it where I found it.
> >
> > [1] 
> > https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com
> >
>
> Adding this small patch to the commitfest in case somebody finds it
> worth fixing this specific memory consumption. With a new subject.

Rebased patches.
0001 - is same as the squashed version of patches at
https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1csrjcmyyj8pd...@mail.gmail.com.
0002 is the actual fix described earlier

-- 
Best Wishes,
Ashutosh Bapat
From e46e9cfe528ecbb04b17c21bb79c55cb8e23289b Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 26 Jul 2023 12:08:55 +0530
Subject: [PATCH 2/2] Reuse child_relids bitmapset in partitionwise joinrels

Bitmapsets containing child relids consume large memory when thousands of
partitions are involved. Create them once, use multiple times and free them up
once their use is over.

Ashutosh Bapat
---
 src/backend/optimizer/path/joinrels.c | 10 ++
 src/backend/optimizer/util/relnode.c  | 18 +++---
 src/include/optimizer/pathnode.h  |  3 ++-
 3 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 015a0b3cbe..e4835c10fc 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1545,6 +1545,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 		RelOptInfo *child_joinrel;
 		AppendRelInfo **appinfos;
 		int			nappinfos;
+		Bitmapset  *child_relids = NULL;
 
 		if (joinrel->partbounds_merged)
 		{
@@ -1640,9 +1641,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 			   child_rel2->relids);
 
 		/* Find the AppendRelInfo structures */
-		appinfos = find_appinfos_by_relids(root,
-		   bms_union(child_rel1->relids,
-	 child_rel2->relids),
+		child_relids = 

Re: A performance issue with Memoize

2023-10-30 Thread Andrei Lepikhov

On 30/10/2023 14:55, Richard Guo wrote:


On Thu, Oct 26, 2023 at 12:07 PM Andrei Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:


Do you've thought about the case, fixed with the commit 1db5667? As I
see, that bugfix still isn't covered by regression tests. Could your
approach of a PARAM_EXEC slot reusing break that case?


Hm, I don't think so.  The issue fixed by commit 1db5667 was caused by
sharing PARAM_EXEC slots between different levels of NestLoop.  AFAICS
it's safe to share PARAM_EXEC slots within the same level of NestLoop.

The change here is about sharing PARAM_EXEC slots between subquery's
subplan_params and outer-relation variables, which happens within the
same level of NestLoop.
...
Did you notice a case that the change here breaks?

Hi Tom, could you share your insights on this issue and the proposed
fix?


I think your patch works correctly so far. I mentioned the commit 
1db5667 because, as I see, the origin of the problem was parallel 
workers. I have thought about pushing Memoize down to a parallel worker 
and couldn't imagine whether such a solution would be correct.

Sorry if I disturbed you in vain.

--
regards,
Andrei Lepikhov
Postgres Professional





Re: Reducing memory consumed by RestrictInfo list translations in partitionwise join planning

2023-10-30 Thread Ashutosh Bapat
On Thu, Sep 14, 2023 at 4:39 PM Ashutosh Bapat
 wrote:
>
> The patch set is thus
> 0001 - patch used to measure memory used during planning
>
> 0002 - Patch to free intermediate Relids computed by
> adjust_child_relids_multilevel(). I didn't test memory consumption for
> multi-level partitioning. But this is clear improvement. In that
> function we free the AppendInfos array which as many pointers long as
> the number of relids. So it doesn't make sense not to free the Relids
> which can be {largest relid}/8 bytes long at least.
>
> 0003 - patch to save and reuse commuted RestrictInfo. This patch by
> itself shows a small memory saving (3%) in the query below where the
> same clause is commuted twice. The query does not contain any
> partitioned tables.
> create table t2 (a int primary key, b int, c int);
> create index t2_a_b on t2(a, b);
> select * from t2 where 10 = a
> Memory used without patch: 13696 bytes
> Memory used with patch: 13264 bytes
>
> 0004 - Patch which implements the hash table of hash table described
> above and also code to avoid repeated RestrictInfo list translations.
>
> I will add this patchset to next commitfest.
>
> --
> Best Wishes,
> Ashutosh Bapat

PFA rebased patches. Nothing changes in 0002, 0003 and 0004. 0001 is
the squashed version of the latest patch set at
https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1csrjcmyyj8pd...@mail.gmail.com.

-- 
Best Wishes,
Ashutosh Bapat
From de823e9cf0b03f21f8de51b49347df58ecb27c5a Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 31 Aug 2023 19:02:23 +0530
Subject: [PATCH 2/4] Free intermediate Relids created in
 adjust_child_relids_multilevel()

adjust_child_relids_multilevel() creates Relids for all the intermediate stages
in a partition hierarchy. These relid sets are not required finally. Relids or
Bitmapset take reasonably large space when thousands of partitions are
invovled. Hence free these intermediate relid sets.

Ashutosh Bapat
---
 src/backend/optimizer/util/appendinfo.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index f456b3b0a4..4008e52de2 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -591,6 +591,8 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 {
 	AppendRelInfo **appinfos;
 	int			nappinfos;
+	Relids		tmp_relids = NULL;
+	Relids		child_relids;
 
 	/*
 	 * If the given relids set doesn't contain any of the parent relids, it
@@ -599,13 +601,14 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 	if (!bms_overlap(relids, parentrel->relids))
 		return relids;
 
+	tmp_relids = relids;
 	/* Recurse if immediate parent is not the top parent. */
 	if (childrel->parent != parentrel)
 	{
 		if (childrel->parent)
-			relids = adjust_child_relids_multilevel(root, relids,
-	childrel->parent,
-	parentrel);
+			tmp_relids = adjust_child_relids_multilevel(root, relids,
+		childrel->parent,
+		parentrel);
 		else
 			elog(ERROR, "childrel is not a child of parentrel");
 	}
@@ -613,11 +616,15 @@ adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 	/* Now translate for this child. */
 	appinfos = find_appinfos_by_relids(root, childrel->relids, );
 
-	relids = adjust_child_relids(relids, nappinfos, appinfos);
+	child_relids = adjust_child_relids(tmp_relids, nappinfos, appinfos);
+
+	/* Free any intermediate Relids created. */
+	if (tmp_relids != relids)
+		bms_free(tmp_relids);
 
 	pfree(appinfos);
 
-	return relids;
+	return child_relids;
 }
 
 /*
-- 
2.25.1

From 69cfd8bd6ea38a99a715268423d92fa4179f7cc7 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Mon, 4 Sep 2023 14:56:21 +0530
Subject: [PATCH 3/4] Save commuted RestrictInfo for later use

A commuted RestrictInfo may be produced as many times as the number of indexes
it is used for. It's the same RestrictInfo always. Save some CPU and memory by
saving the result in the original RestrictInfo.

Ashutosh Bapat
---
 src/backend/optimizer/util/appendinfo.c   |  6 ++
 src/backend/optimizer/util/restrictinfo.c | 22 +-
 src/include/nodes/pathnodes.h |  9 +
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 4008e52de2..53e1233dce 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -492,6 +492,12 @@ adjust_appendrel_attrs_mutator(Node *node,
 		newinfo->left_mcvfreq = -1;
 		newinfo->right_mcvfreq = -1;
 
+		/*
+		 * Wipe out commuted parent RestrictInfo. The caller will compute
+		 * commuted clause if required.
+		 */
+		newinfo->comm_rinfo = NULL;
+
 		return (Node *) newinfo;
 	}
 
diff --git a/src/backend/optimizer/util/restrictinfo.c b/src/backend/optimizer/util/restrictinfo.c
index 

Re: small erreport bug over partitioned table pgrowlocks module

2023-10-30 Thread David Rowley
On Tue, 31 Oct 2023 at 13:18, David Rowley  wrote:
> Here's a patch that puts the relam check last.

I've pushed that patch.

David




Re: always use runtime checks for CRC-32C instructions

2023-10-30 Thread Nathan Bossart
On Mon, Oct 30, 2023 at 01:48:29PM -0700, Jeff Davis wrote:
> I assume you are concerned about the call going through a function
> pointer? If so, is it possible that setting a flag and then branching
> would be better?
> 
> Also, if it's a concern, should we also consider making an inlineable
> version of pg_comp_crc32c_sse42()?

I tested pg_waldump -z with 50M 65-byte records for the following
implementations on an ARM system:

 * slicing-by-8: ~3.08s
 * proposed patches applied (runtime check): ~2.44s
 * only CRC intrinsics implementation compiled : ~2.42s
 * forced inlining : ~2.38s

Avoiding the runtime check produced a 0.8% improvement, and forced inlining
produced another 1.7% improvement.  In comparison, even the runtime check
implementation produced a 20.8% improvement over the slicing-by-8 one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Not deleted mentions of the cleared path

2023-10-30 Thread Richard Guo
On Mon, Oct 30, 2023 at 7:31 PM Alena Rybakina 
wrote:

> I have already written about the problem of InvalidPath [0] appearing. I
> investigated this and found an error in the add_path() function, when we
> reject a path, we free up the memory of the path, but do not delete various
> mentions of it (for example, in the ancestor of relation, as in the example
> below).
>

I agree that what you observed is true - add_path() may free a path
while it's still referenced from some lower rels.  For instance, when
creating ordered paths, we may use the input path unchanged without
copying if it's already well ordered, and it might be freed afterwards
if it fails when competing in add_path().

But this doesn't seem to be a problem in practice.  We will not access
these references from the lower rels.

I'm not sure if this is an issue that we need to fix, or we need to live
with.  But I do think it deserves some explanation in the comment of
add_path().

Thanks
Richard


Re: Is this a problem in GenericXLogFinish()?

2023-10-30 Thread Amit Kapila
On Mon, Oct 30, 2023 at 7:13 PM Robert Haas  wrote:
>
> On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila  wrote:
> > > Hmm.  So my question is: do we need the cleanup lock on the write
> > > buffer even if there are no tuples, and even if primary bucket and the
> > > write bucket are the same?
> >
> > Yes, we need it to exclude any concurrent in-progress scans that could
> > return incorrect tuples during bucket squeeze operation.
>
> Amit, thanks for weighing in, but I'm not convinced. I thought we only
> ever used a cleanup lock on the main bucket page to guard against
> concurrent scans.
>

My understanding is the same. It is possible that my wording is not
clear. Let me try to clarify again, Michael asked: "do we need the
cleanup lock on the write buffer even if there are no tuples, and even
if primary bucket and the write bucket are the same?" My reading of
his question was do we need a cleanup lock even if the primary bucket
and write bucket are the same which means the write bucket will be the
first page in the chain and we need a cleanup lock on it. I think the
second condition (no tuples) seems irrelevant here as whether that is
true or false we need a cleanup lock.

>
 Here you seem to be saying that we need a cleanup
> lock on some page that may be an overflow page somewhere in the middle
> of the chain, and that doesn't seem right to me.
>

Sorry, I don't intend to say this.


--
With Regards,
Amit Kapila.




Re: Add new option 'all' to pg_stat_reset_shared()

2023-10-30 Thread Kyotaro Horiguchi
At Mon, 30 Oct 2023 14:15:53 +0530, Bharath Rupireddy 
 wrote in 
> > > I imagine there are cases where people want to initialize all of them
> > > at the same time in addition to initializing one at a time.
> >
> > FWIW, I fairly often wanted it, but forgot about that:p
> 
> Isn't calling pg_stat_reset_shared() for all stats types helping you
> out? Is there any problem with it? Can you be more specific about the
> use-case?

Oh. Sorry, it's my bad. pg_stat_reset_shared() is sufficient.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: COPY TO (FREEZE)?

2023-10-30 Thread Zhang Mingli
HI,


Zhang Mingli
www.hashdata.xyz
On Oct 31, 2023 at 03:55 +0800, Bruce Momjian , wrote:
>
> Sure, updated patch attached.

LGTM.


RE: Synchronizing slots from primary to standby

2023-10-30 Thread Hayato Kuroda (Fujitsu)
Dear Ajin,

Thanks for your reply!

> On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Shveta,
> >
> > > PFA v25 patch set. The changes are:
> >
> > Thanks for making the patch! It seems that there are lots of comments, so
> > I can put some high-level comments for 0001.
> > Sorry if there are duplicated comments.
> >
> > 1.
> > The patch seemed not to consider the case that failover option between
> replication
> > slot and subscription were different. Currently slot option will be 
> > overwritten
> > by subscription one.
> >
> > Actually, I'm not sure what specification is better. Regarding the 
> > two_phase,
> > 2PC will be decoded only when the both of settings are true. Should we 
> > follow?
> >
> 
> But this is the intention, we want the Alter subscription to be able
> to change the failover behaviour
> of the slot.

I had not understood how two_phase is enabled. I found that slot->data.two_phase
is overwritten in CreateDecodingContext(), so the failover option now follows 
two_phase, right?
(I think the overwritten of data.failover should be also done at 
CreateDecodingContext()).

> > 2.
> > Currently ctx->failover is set only in the pgoutput_startup(), but not sure 
> > it is
> OK.
> > Can we change the parameter in CreateDecodingContext() or similar functions?
> >
> > Because IIUC it means that only slots which have pgoutput can wait. Other
> > output plugins must understand the change and set faliover flag as well -
> > I felt it is not good. E.g., you might miss to enable the parameter in
> test_decoding.
> >
> > Regarding the two_phase parameter, setting on plugin layer is good because 
> > it
> > quite affects the output. As for the failover, it is not related with the
> > content so that all of slots should be enabled.
> >
> > I think CreateDecodingContext or StartupDecodingContext() is the common
> path.
> > Or, is it the out-of-scope for now?
> 
> Currently, the failover field is part of the options list in the
> StartReplicationCmd. This gives some
> level of flexibility such that only plugins that are interested in
> this need to handle it. The options list
> is only deparsed by plugins.  If we move it to outside of the options list,
> this sort of changes the protocol for START_REPLICATION and will
> impact all plugins.
>  But I agree to your larger point that, we need to do it in such a way that
> other plugins do not unintentionally change the 'failover' behaviour
> of the originally created slot.
> Maybe I can code it in such a way that, only if the failover option is
> specified in the list of options
> passed as part of START_REPLICATION  will it change the original slot
> created 'failover' flag by adding
> another flag "failover_opt_given". Plugins that set this, will be able
> to change the failover flag of the slot,
> while plugins that do not support this will not set this and the
> failover flag of the created slot will remain.
> What do you think?

May be OK, but I came up with a corner case that external plugins have a 
streaming
option 'failover'. What should be? Has the option been reserved?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: "38.10.10. Shared Memory and LWLocks" may require a clarification

2023-10-30 Thread Michael Paquier
On Mon, Oct 30, 2023 at 04:10:17PM +0300, Aleksander Alekseev wrote:
> I didn't like the commit message. Here is the corrected patch. Sorry
> for the noise.

Sounds pretty much OK to me.  Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Michael Paquier
On Mon, Oct 30, 2023 at 12:47:41PM -0700, Andres Freund wrote:
> I think the problem with these variables is that they're a really messy state
> machine - something this patch doesn't meaningfully improve IMO.

Okay.  Yes, this is my root issue as well.  We're at the stage where
we should reduce the possible set of combinations and assumptions
we're inventing because people can do undocumented stuff, then perhaps
refactor the code on top of that (say, if one combination with too
booleans is not possible, switch to a three-state enum rather than 2
bools, etc).

>> This configuration was possible when recovering from a base backup taken
>> by pg_basebackup without -R.  Note that the documentation requires at
>> least to set recovery.signal to restore from a backup, but the startup
>> process was not making this policy explicit.
> 
> Maybe I just didn't check the right place, but from I saw, this, at most, is
> implied, rather than explicitly stated.

See the doc reference here:
https://www.postgresql.org/message-id/zubm6bnqneh7l...@paquier.xyz

So it kind of implies it, still also mentions restore_command.  It's
like Schrödinger's cat, yes and no at the same time.

> With -X ... we have all the necessary WAL locally, how does the workload on
> the primary matter? If you pass --no-slot, pg_basebackup might fail to fetch
> the necessary wal, but then you'd also have gotten an error.
>
> [...]
> 
> Right now running pg_basebackup with -X stream, without --write-recovery-conf,
> gives you a copy of a cluster that will come up correctly as a distinct
> instance.
>
> [...]
> 
> I also just don't think that it's always desirable to create a new timeline.

Yeah.  Another argument I was mentioning to Robert is that we may want
to just treat the case where you have a backup_label without any
signal files just the same as crash recovery, replaying all the local
pg_wal/, and nothing else.  For example, something like the attached
should make sure that InArchiveRecovery=true should never be set if
ArchiveRecoveryRequested is not set.

The attached would still cause redo to complain on a "WAL ends before
end of online backup" if not all the WAL is here (reason behind the
tweak of 010_pg_basebackup.pl, but the previous tweak to pg_rewind's
008_min_recovery_point.pl is not required here).

Attached is the idea I had in mind, in terms of code, FWIW.
--
Michael
From fbd9fa407c04799ad4401d3ba5c6b67a0d922631 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 31 Oct 2023 10:14:01 +0900
Subject: [PATCH] Force crash recovery with backup_label and no .signal files

---
 src/backend/access/transam/xlogrecovery.c| 29 
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  3 +-
 doc/src/sgml/backup.sgml |  4 ++-
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156a..de5787d7e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -133,6 +133,8 @@ static TimeLineID curFileTLI;
  * currently performing crash recovery using only XLOG files in pg_wal, but
  * will switch to using offline XLOG archives as soon as we reach the end of
  * WAL in pg_wal.
+ *
+ * InArchiveRecovery should never be set without ArchiveRecoveryRequested.
  */
 bool		ArchiveRecoveryRequested = false;
 bool		InArchiveRecovery = false;
@@ -595,13 +597,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		List	   *tablespaces = NIL;
 
 		/*
-		 * Archive recovery was requested, and thanks to the backup label
-		 * file, we know how far we need to replay to reach consistency. Enter
-		 * archive recovery directly.
+		 * If archive recovery was requested, and we know how far we need to
+		 * replay to reach consistency thanks to the backup label file, then
+		 * enter archive recovery directly in this case.
+		 *
+		 * If archive recovery was not requested, then do crash recovery and
+		 * replay all the local WAL.  This still checks that all the WAL up
+		 * to backupEndRequired has been replayed.  This case is useful when
+		 * restoring from a standalone base backup, taken with pg_basebackup
+		 * --wal-method=stream, for example.
 		 */
-		InArchiveRecovery = true;
-		if (StandbyModeRequested)
-			EnableStandbyMode();
+		if (ArchiveRecoveryRequested)
+		{
+			InArchiveRecovery = true;
+			if (StandbyModeRequested)
+EnableStandbyMode();
+		}
 
 		/*
 		 * When a backup_label file is present, we want to roll forward from
@@ -1591,6 +1602,12 @@ ShutdownWalRecovery(void)
 	 */
 	if (ArchiveRecoveryRequested)
 		DisownLatch(>recoveryWakeupLatch);
+
+	/*
+	 * InArchiveRecovery should never have been set without
+	 * ArchiveRecoveryRequested.
+	 */
+	Assert(ArchiveRecoveryRequested || !InArchiveRecovery);
 }
 
 /*
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 

Re: Synchronizing slots from primary to standby

2023-10-30 Thread Ajin Cherian
On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shveta,
>
> > PFA v25 patch set. The changes are:
>
> Thanks for making the patch! It seems that there are lots of comments, so
> I can put some high-level comments for 0001.
> Sorry if there are duplicated comments.
>
> 1.
> The patch seemed not to consider the case that failover option between 
> replication
> slot and subscription were different. Currently slot option will be 
> overwritten
> by subscription one.
>
> Actually, I'm not sure what specification is better. Regarding the two_phase,
> 2PC will be decoded only when the both of settings are true. Should we follow?
>

But this is the intention, we want the Alter subscription to be able
to change the failover behaviour
of the slot.

> 2.
> Currently ctx->failover is set only in the pgoutput_startup(), but not sure 
> it is OK.
> Can we change the parameter in CreateDecodingContext() or similar functions?
>
> Because IIUC it means that only slots which have pgoutput can wait. Other
> output plugins must understand the change and set faliover flag as well -
> I felt it is not good. E.g., you might miss to enable the parameter in 
> test_decoding.
>
> Regarding the two_phase parameter, setting on plugin layer is good because it
> quite affects the output. As for the failover, it is not related with the
> content so that all of slots should be enabled.
>
> I think CreateDecodingContext or StartupDecodingContext() is the common path.
> Or, is it the out-of-scope for now?

Currently, the failover field is part of the options list in the
StartReplicationCmd. This gives some
level of flexibility such that only plugins that are interested in
this need to handle it. The options list
is only deparsed by plugins.  If we move it to outside of the options list,
this sort of changes the protocol for START_REPLICATION and will
impact all plugins.
 But I agree to your larger point that, we need to do it in such a way that
other plugins do not unintentionally change the 'failover' behaviour
of the originally created slot.
Maybe I can code it in such a way that, only if the failover option is
specified in the list of options
passed as part of START_REPLICATION  will it change the original slot
created 'failover' flag by adding
another flag "failover_opt_given". Plugins that set this, will be able
to change the failover flag of the slot,
while plugins that do not support this will not set this and the
failover flag of the created slot will remain.
What do you think?

regards,
Ajin Cherian
Fujitsu Australia




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Michael Paquier
On Mon, Oct 30, 2023 at 10:32:28AM -0600, Roberto Mello wrote:
> I realize the original use of "touch" is a valid shortcut for what I
> suggest above, however that will be less clear for the not-so-un*x-inclined
> users of Postgres, while for some it'll be downright confusing, IMHO. It
> also provides the advantage of being crystal clear on what needs to be done
> to fix the problem.

Indeed, "touch" may be better in this path if we'd throw an ERROR to
enforce a given policy, and that's more consistent with the rest of
the area.
--
Michael


signature.asc
Description: PGP signature


Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Michael Paquier
On Mon, Oct 30, 2023 at 01:55:13PM -0400, Robert Haas wrote:
> I would encourage some caution here.

Thanks for chiming here.

> In a vacuum, I'm in favor of this, and for the same reasons as you,
> namely, that the huge pile of Booleans that we use to control recovery
> is confusing, and it's difficult to make sure that all the code paths
> are adequately tested, and I think some of the things that actually
> work here are not documented.

Yep, same feeling here.

> But in practice, I think there is a possibility of something like this
> backfiring very hard. Notice that the first two people who commented
> on the thread saw the error and immediately removed backup_label even
> though that's 100% wrong. It shows how utterly willing users are to
> remove backup_label for any reason or no reason at all. If we convert
> cases where things would have worked into cases where people nuke
> backup_label and then it appears to work, we're going to be worse off
> in the long run, no matter how crazy the idea of removing backup_label
> may seem to us.

As far as I know, there's one paragraph in the docs that implies this
mode without giving an actual hint that this may be OK or not, so
shrug:
https://www.postgresql.org/docs/devel/continuous-archiving.html#BACKUP-TIPS
"As with base backups, the easiest way to produce a standalone hot
backup is to use the pg_basebackup tool. If you include the -X
parameter when calling it, all the write-ahead log required to use the
backup will be included in the backup automatically, and no special
action is required to restore the backup."

And a few lines down we imply to use restore_command, something that
we check is set only if recovery.signal is set.  See additionally
validateRecoveryParameters(), where the comments imply that
InArchiveRecovery would be set only when there's a restore command.

As you're telling me, and I've considered that as an option as well,
perhaps we should just consider the presence of a backup_label file
with no .signal files as a synonym of crash recovery?  In the recovery
path, currently the essence of the problem is when we do
InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning
that it should do archive recovery but we don't want it, and that does
not really make sense.  The rest of the code sort of implies that this
is not a suported combination.  So basically, my suggestion here, is
to just replay WAL up to the end of what's in your local pg_wal/ and
hope for the best, without TLI jumps, except that we'd do nothing.
Doing a pg_basebackup -X stream followed by a restart would work fine
with that, because all the WAL is here.

A point of contention is if we'd better be stricter about satisfying
backupEndPoint in such a case, but the redo code only wants to do
something here when ArchiveRecoveryRequested is set (aka there's a
.signal file set), and we would not want a TLI jump at the end of
recovery, so I don't see an argument with caring about backupEndPoint
in this case.

At the end, I'm OK as long as ArchiveRecoveryRequested=false
InArchiveRecovery=true does not exist anymore, because it's much
easier to get what's going on with the redo path, IMHO.

(I have a patch at hand to show the idea, will post it with a reply to
Andres' message.)

> Also, Andres just recently mentioned to me that he uses this procedure
> of starting a server with a backup_label but no recovery.signal or
> standby.signal file regularly, and thinks other people do too. I was
> surprised, since I've never done that, except maybe when I was a noob
> and didn't have a clue. But Andres is far from a noob.

At this stage, that's basically at your own risk, as the code thinks
it's OK to force what's basically archive-recovery-without-being-it.
So it basically works, but it can also easily backfire, as well..
--
Michael


signature.asc
Description: PGP signature


Re: small erreport bug over partitioned table pgrowlocks module

2023-10-30 Thread David Rowley
On Tue, 31 Oct 2023 at 13:00, jian he  wrote:
> BEGIN;
> CREATE TABLE fk_parted_pk (a int PRIMARY KEY) PARTITION BY LIST (a);
> SELECT * FROM pgrowlocks('fk_parted_pk');
> ERROR:  only heap AM is supported
>
> error should be the following part:
> if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> ereport(ERROR,
> (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> errmsg("\"%s\" is a partitioned table",
> RelationGetRelationName(rel)),
> errdetail("Partitioned tables do not contain rows.")));

Yeah.  Seems that 4b8266415 didn't look closely enough at the other
error messages and mistakenly put the relam check first instead of
last.

Here's a patch that puts the relam check last.

David
diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index a04e187ec4..adbc8279c3 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -81,10 +81,6 @@ pgrowlocks(PG_FUNCTION_ARGS)
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
rel = relation_openrv(relrv, AccessShareLock);
 
-   if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
-   ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-   errmsg("only heap AM is 
supported")));
-
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -96,6 +92,10 @@ pgrowlocks(PG_FUNCTION_ARGS)
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("\"%s\" is not a table",
RelationGetRelationName(rel;
+   else if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("only heap AM is supported")));
 
/*
 * check permissions: must have SELECT on table or be in


Re: Infinite Interval

2023-10-30 Thread jian he
On Mon, Oct 30, 2023 at 6:01 PM Ashutosh Bapat
 wrote:
>
>
> Here's my version of commit message
>
> ```
> Support Infinite interval values
>
> Interval datatype uses the same input and output representation for
> infinite intervals as other datatypes representing time that support
> infinity. An interval larger than any other interval is represented by
> string literal 'infinity' or '+infinity'. An interval which is smaller
> than any other interval is represented as '-infinity'. Internally
> positive infinity is represented as maximum values supported by all
> the member types of Interval datastructure and negative infinity is
> represented as minimum values set to all the members. INTERVAL_NOBEGIN
> and INTERVAL_NOEND macros can be used to set an Interval structure to
> negative and positive infinity respectively. INTERVAL_IS_NOBEGIN and
> INTERVAL_IS_NOEND macros are used to test respective values.
> INTERVAL_NOT_FINITE macro is used to test whether a given Interval
> value is infinite.
>
> Implementation of all known operators now handles infinite interval
> values along with operations related to BRIN index, windowing and
> selectivity. Regression tests are added to test these implementation.
>
> If a user has stored interval values '-2147483648 months -2147483648
> days -9223372036854775807 us' and '2147483647 months 2147483647 days
> 9223372036854775806 us' in PostgreSQL versions 16 or earlier. Those
> values will turn into '-infinity' and 'infinity' respectively after
> upgrading to v17. These values are outside the documented range
> supported by interval datatype and thus there's almost no possibility
> of this occurrence. But it will be good to watch for these values
> during upgrade.
> ```
>
the message is plain enough. I can understand it. thanks!




small erreport bug over partitioned table pgrowlocks module

2023-10-30 Thread jian he
hi.
erreport bug over partitioned table in pgrowlocks.

BEGIN;
CREATE TABLE fk_parted_pk (a int PRIMARY KEY) PARTITION BY LIST (a);
SELECT * FROM pgrowlocks('fk_parted_pk');
ERROR:  only heap AM is supported

error should be the following part:
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("\"%s\" is a partitioned table",
RelationGetRelationName(rel)),
errdetail("Partitioned tables do not contain rows.")));




Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

2023-10-30 Thread Michael Paquier
On Mon, Oct 30, 2023 at 03:14:16PM +0100, hubert depesz lubaczewski wrote:
> some things will definitely break, but that's 100% OK. The change seems
> needed, and I can update my parser to deal with it :)

Thanks for the input.  I was looking yesterday if this code was
available somewhere, but couldn't find it..  Until this morning:
https://gitlab.com/depesz/explain.depesz.com.git

And..  It looks like things would become better if we change
"shared/local" to "shared", because the parsing code seems to have an
issue once you add a '/'.  All the fields in I/O Timings are
considered as part of a Node, and they're just included in the output.
Now, pasting a plan that includes "shared/local" drops entirely the
string from the output result, so some information is lost.  In short,
imagine that we have the following string in a node:
I/O Timings: shared/local write=23.77

This would show up like that, meaning that the context where the
write/read timings happened is lost:
I/O Timings: write=23.77

If we switch back to "shared", the context would be kept around.  Of
course, this does not count for all the parsers that may be out
there, but at least that's something.
--
Michael


signature.asc
Description: PGP signature


Re: Introduce a new view for checkpointer related stats

2023-10-30 Thread Michael Paquier
On Mon, Oct 30, 2023 at 11:59:10AM +0530, Bharath Rupireddy wrote:
> Oh, thanks for taking care of this. While at it, I noticed that
> there's no coverage for pg_stat_reset_shared('recovery_prefetch') and
> XLogPrefetchResetStats()
> https://coverage.postgresql.org/src/backend/access/transam/xlogprefetcher.c.gcov.html.
> Most of the recovery_prefetch code is covered with recovery/streaming
> related tests, but the reset stats part is missing. So, I've added
> coverage for it in the attached 0002.

Indeed, good catch.  I didn't notice the hole in the coverage reports.
I have merged 0001 and 0002 together, and applied them as of
5b2147d9fcc1.
--
Michael


signature.asc
Description: PGP signature


Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?

2023-10-30 Thread David Rowley
On Tue, 31 Oct 2023 at 03:01, Bruce Momjian  wrote:
> I think you just go and change it.  Your number is better than what we
> have, and if someone wants to suggest a better number, we can change it
> later.

I did some more experimentation on the actual costs of getting a tuple
from a foreign server.

Using the attached setup, I did:

postgres=# explain (analyze, timing off) SELECT * FROM t;
 QUERY PLAN

 Seq Scan on t  (cost=0.00..144248.48 rows=1048 width=4) (actual
rows=1000 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 385.978 ms

postgres=# explain (analyze, timing off) SELECT * FROM ft;
QUERY PLAN
---
 Foreign Scan on ft  (cost=100.00..244348.96 rows=1048 width=4)
(actual rows=1000 loops=1)
 Planning Time: 0.126 ms
 Execution Time: 8335.392 ms

So, let's take the first query and figure out the total cost per
millisecond of execution time. We can then multiply that by the
execution time of the 2nd query to calculate what we might expect the
costs to be for the foreign table scan based on how long it took
compared to the local table scan.

postgres=# select 144248.48/385.978*8335.392;
  ?column?
-
 3115119.5824740270171341280

So, the above number is what we expect the foreign table scan to cost
with the assumption that the cost per millisecond is about right for
the local scan.  We can then calculate how much we'll need to charge
for a foreign tuple by subtracting the total cost of that query from
our calculated value to calculate how much extra we need to charge, in
total, then divide that by the number of tuples to get actual foreign
tuple cost for this query.

postgres=# select (3115119.58-244348.96)/1000;
?column?

 0.287077062000

This is on an AMD 3990x running Linux 6.5 kernel.  I tried the same on
an Apple M2 mini and got:

postgres=# select 144247.77/257.763*3052.084;
  ?column?
-
 1707988.7759402241595200680

postgres=# select (1707988.78-244347.54)/1000;
?column?

 0.146364124000

So the actual foreign tuple cost on the M2 seems about half of what it
is on the Zen 2 machine.

Based on this, I agree with my original analysis that setting
DEFAULT_FDW_TUPLE_COST to 0.2 is about right. Of course, this is a
loopback onto localhost so remote networks likely would benefit from
higher values, but based on this 0.01 is far too low and we should
change it to at least 0.2.

I'd be happy if anyone else would like to try the same experiment to
see if there's some other value of DEFAULT_FDW_TUPLE_COST that might
suit better.

David


setup_fdw.sql
Description: Binary data


Re: Trigger violates foreign key constraint

2023-10-30 Thread David G. Johnston
On Mon, Oct 30, 2023 at 2:50 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tue, Oct 3, 2023 at 12:52 AM Laurenz Albe 
> wrote:
>
>> On Mon, 2023-10-02 at 09:49 -0400, Tom Lane wrote:
>> > This is by design: triggers operate at a lower level than
>> > foreign keys, so an ill-conceived trigger can break an FK constraint.
>> > That's documented somewhere, though maybe not visibly enough.
>>
>> Not having found any documentation, I propose the attached caution.
>>
>>
> I dislike scaring the user like this without providing any context on what
> conditions or actions are problematic.
>
> The ON DELETE and ON UPDATE clauses of foreign keys are implemented as
> system triggers on the referenced table that invoke additional delete or
> update commands on the referencing table.  The final outcome of these
> additional commands are not checked - it is the responsibility of the DBA
> to ensure that the user triggers on the referencing table actually remove
> the rows they are requested to remove, or update to NULL any referencing
> foreign key columns.  In particular, before row triggers that return NULL
> will prevent the delete/update from occurring and thus result in a violated
> foreign key constraint.
>
> Add sgml as needed, note the original patch missed adding ""
> to PostgreSQL.
>
>
Additionally, the existing place this is covered is here:

"""
Trigger functions invoked by per-statement triggers should always return
NULL. Trigger functions invoked by per-row triggers can return a table row
(a value of type HeapTuple) to the calling executor, if they choose. A
row-level trigger fired before an operation has the following choices:

It can return NULL to skip the operation for the current row. This
instructs the executor to not perform the row-level operation that invoked
the trigger (the insertion, modification, or deletion of a particular table
row).

For row-level INSERT and UPDATE triggers only, the returned row becomes the
row that will be inserted or will replace the row being updated. This
allows the trigger function to modify the row being inserted or updated.

A row-level BEFORE trigger that does not intend to cause either of these
behaviors must be careful to return as its result the same row that was
passed in (that is, the NEW row for INSERT and UPDATE triggers, the OLD row
for DELETE triggers).
"""

We should probably add a note pointing back to the DDL chapter and that
more concisely says.

"Note: If this table also contains any foreign key constraints with on
update or on delete clauses, then a failure to return the same row that was
passed in for update and delete triggers is going to result in broken
referential integrity for the affected row."

I do like "broken referential integrity" from the original patch over
"violated foreign key constraint" - so that needs to be substituted in for
the final part of my earlier proposal if we go with its more detailed
wording.  My issue with "violated" is that it sounds like the system is
going to catch it at the end - broken doesn't have the same implication.

David J.


Re: PGDOCS - add more links in the pub/sub reference pages

2023-10-30 Thread Peter Smith
Thanks for pushing!

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Trigger violates foreign key constraint

2023-10-30 Thread David G. Johnston
On Tue, Oct 3, 2023 at 12:52 AM Laurenz Albe 
wrote:

> On Mon, 2023-10-02 at 09:49 -0400, Tom Lane wrote:
> > This is by design: triggers operate at a lower level than
> > foreign keys, so an ill-conceived trigger can break an FK constraint.
> > That's documented somewhere, though maybe not visibly enough.
>
> Not having found any documentation, I propose the attached caution.
>
>
I dislike scaring the user like this without providing any context on what
conditions or actions are problematic.

The ON DELETE and ON UPDATE clauses of foreign keys are implemented as
system triggers on the referenced table that invoke additional delete or
update commands on the referencing table.  The final outcome of these
additional commands are not checked - it is the responsibility of the DBA
to ensure that the user triggers on the referencing table actually remove
the rows they are requested to remove, or update to NULL any referencing
foreign key columns.  In particular, before row triggers that return NULL
will prevent the delete/update from occurring and thus result in a violated
foreign key constraint.

Add sgml as needed, note the original patch missed adding ""
to PostgreSQL.

David J.


Re: Trigger violates foreign key constraint

2023-10-30 Thread shihao zhong
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

It seems like people have been talking about this problem since 2010 
(https://stackoverflow.com/questions/3961825/foreign-keys-in-postgresql-can-be-violated-by-trigger),
 and we finally got our act together and put it in the official document today, 
only 13 years later.

The new status of this patch is: Ready for Committer


Re: always use runtime checks for CRC-32C instructions

2023-10-30 Thread Nathan Bossart
On Mon, Oct 30, 2023 at 12:39:23PM -0400, Tom Lane wrote:
> On the one hand, I agree that we need to keep the complexity from
> getting out of hand.  On the other hand, I wonder if this approach
> isn't optimizing for the wrong case.  How many machines that PG 17
> will ever be run on in production will lack SSE 4.2 (for Intel)
> or ARMv8 instructions (on the ARM side)?

For the CRC instructions in use today, I wouldn't be surprised if that
number is pretty small, but for newer or optional instructions (like ARM's
PMULL), I don't think we'll be so lucky.  Even if we do feel comfortable
assuming the presence of SSE 4.2, etc., we'll likely still need to add
runtime checks for future optimizations.

> It seems like a shame
> to be burdening these instructions with a subroutine call for the
> benefit of long-obsolete hardware versions.  Maybe that overhead
> is negligible, but it doesn't sound like you tried to measure it.

When I went to measure this, I noticed that my relatively new x86 machine
with a relatively new version of gcc uses the runtime check.  I then
skimmed through a few dozen buildfarm machines and found that, of all x86
and ARM machines that supported the specialized CRC instructions, only one
ARM machine did not use the runtime check.  Of course, this is far from a
scientific data point, but it seems to indicate that the runtime check is
the norm.

(I still need to measure it.)

> Anyway, I agree that the cost of a one-time-per-process probe should
> be negligible; it's the per-use cost that I worry about.  If you can
> do some measurements proving that that worry is ill-founded, then
> I'm good with test-first.

Will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-30 Thread David Rowley
On Mon, 30 Oct 2023 at 23:48, Amit Kapila  wrote:
>
> On Fri, Oct 27, 2023 at 3:23 AM David Rowley  wrote:
> > * parallel.c in HandleParallelMessages():
> > * applyparallelworker.c in HandleParallelApplyMessages():
>
> Both the above calls are used to handle ERROR/NOTICE messages from
> parallel workers as you have also noticed. The comment atop
> initReadOnlyStringInfo() clearly states that it is used in the
> performance-critical path. So, is it worth changing these places? In
> the future, this may pose the risk of this API being used
> inconsistently.

I'm ok to leave those ones out.  But just a note on the performance
side, if we go around needlessly doing palloc/memcpy then we'll be
flushing possibly useful cachelines out and cause slowdowns elsewhere.
That's a pretty hard thing to quantify, but something to keep in mind.

David




Re: always use runtime checks for CRC-32C instructions

2023-10-30 Thread Jeff Davis
On Mon, 2023-10-30 at 12:39 -0400, Tom Lane wrote:
> It seems like a shame
> to be burdening these instructions with a subroutine call for the
> benefit of long-obsolete hardware versions.

It's already doing a call to pg_comp_crc32c_sse42() regardless, right?

I assume you are concerned about the call going through a function
pointer? If so, is it possible that setting a flag and then branching
would be better?

Also, if it's a concern, should we also consider making an inlineable
version of pg_comp_crc32c_sse42()?

Regards,
Jeff Davis





Re: MERGE ... RETURNING

2023-10-30 Thread Jeff Davis
On Fri, 2023-10-27 at 15:46 +0100, Dean Rasheed wrote:
> 
> One fiddly part is resolving the shift/reduce conflicts in the
> grammar. Specifically, on seeing "RETURNING expr when ...", there is
> ambiguity over whether the "when" is a column alias or the start of
> the next merge action. I've resolved that by assigning a slightly
> higher precedence to an expression without an alias, so WHEN is
> assumed to not be an alias. It seems pretty ugly though (in terms of
> having to duplicate so much code), and I'd be interested to know if
> there's a neater way to do it.

Can someone else comment on whether this is a reasonable solution to
the grammar problem?

> From a usability perspective, I'm still somewhat sceptical about this
> approach. It's a much more verbose syntax, and it gets quite tedious
> having to repeat the RETURNING list for every action, and keep them
> in
> sync.

If we go with the single RETURNING-clause-at-the-end approach, how
important is it that the action can be a part of an arbitrary
expression?

Perhaps something closer to your original proposal would be a good
compromise (sorry to backtrack yet again...)? It couldn't be used in an
arbitrary expression, but that also means that it couldn't end up in
the wrong kind of expression.

Regards,
Jeff Davis





Adding ordering to list of available extensions

2023-10-30 Thread Greg Sabino Mullane
Please find attached a patch to provide some basic ordering to the system
views pg_available_extensions and pg_available_extension_versions. It is
sorely tempting to add ORDER BYs to many of the other views in that file,
but I understand that would be contentious as there are reasons for not
adding an ORDER BY. However, in the case of pg_available_extensions, it's a
very, very small resultset, with an obvious default ordering, and extremely
unlikely to be a part of a larger complex query. It's much more likely
people like myself are just doing a "SELECT * FROM pg_available_extensions"
and then get annoyed at the random ordering.

Cheers,
Greg


show_extensions_in_natural_order.pg.patch
Description: Binary data


Re: COPY TO (FREEZE)?

2023-10-30 Thread Bruce Momjian
On Mon, Oct 30, 2023 at 02:29:05PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > That is a good point.  I reviewed more of the messages and added
> > capitalization where appropriate, patch attached.
> 
> This is starting to look pretty good.  I have one more thought,
> as long as we're touching all these messages anyway: how about
> s/FOO available only in CSV mode/FOO requires CSV mode/ ?
> That's both shorter and less telegraphic, as it's not omitting the verb.

Sure, updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index d12ba96497..18ecc69c33 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -224,6 +224,7 @@ COPY { table_name [ ( COPY FREEZE on
   a partitioned table.
+  This option is only allowed in COPY FROM.
  
  
   Note that all other sessions will immediately be able to see the data
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c5d7d78645..cfad47b562 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -671,7 +671,7 @@ ProcessCopyOptions(ParseState *pstate,
 	if (!opts_out->csv_mode && opts_out->quote != NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY quote available only in CSV mode")));
+ errmsg("COPY QUOTE requires CSV mode")));
 
 	if (opts_out->csv_mode && strlen(opts_out->quote) != 1)
 		ereport(ERROR,
@@ -687,7 +687,7 @@ ProcessCopyOptions(ParseState *pstate,
 	if (!opts_out->csv_mode && opts_out->escape != NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY escape available only in CSV mode")));
+ errmsg("COPY ESCAPE requires CSV mode")));
 
 	if (opts_out->csv_mode && strlen(opts_out->escape) != 1)
 		ereport(ERROR,
@@ -698,46 +698,52 @@ ProcessCopyOptions(ParseState *pstate,
 	if (!opts_out->csv_mode && (opts_out->force_quote || opts_out->force_quote_all))
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force quote available only in CSV mode")));
+ errmsg("COPY FORCE_QUOTE requires CSV mode")));
 	if ((opts_out->force_quote || opts_out->force_quote_all) && is_from)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force quote only available using COPY TO")));
+ errmsg("COPY FORCE_QUOTE cannot be used with COPY FROM")));
 
 	/* Check force_notnull */
 	if (!opts_out->csv_mode && opts_out->force_notnull != NIL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force not null available only in CSV mode")));
+ errmsg("COPY FORCE_NOT_NULL requires CSV mode")));
 	if (opts_out->force_notnull != NIL && !is_from)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force not null only available using COPY FROM")));
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY FORCE_NOT_NULL cannot be used with COPY TO")));
 
 	/* Check force_null */
 	if (!opts_out->csv_mode && opts_out->force_null != NIL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force null available only in CSV mode")));
+ errmsg("COPY FORCE_NULL requires CSV mode")));
 
 	if (opts_out->force_null != NIL && !is_from)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force null only available using COPY FROM")));
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY FORCE_NULL cannot be used with COPY TO")));
 
 	/* Don't allow the delimiter to appear in the null string. */
 	if (strchr(opts_out->null_print, opts_out->delim[0]) != NULL)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY delimiter must not appear in the NULL specification")));
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY delimiter character must not appear in the NULL specification")));
 
 	/* Don't allow the CSV quote char to appear in the null string. */
 	if (opts_out->csv_mode &&
 		strchr(opts_out->null_print, opts_out->quote[0]) != NULL)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("CSV quote character must not appear in the NULL specification")));
 
+	/* Check freeze */
+	if (opts_out->freeze && !is_from)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY FREEZE cannot be used with COPY TO")));
+
 	if (opts_out->default_print)
 	{
 		if (!is_from)
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 95ec7363af..c4178b9c07 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -83,17 +83,17 @@ ERROR:  cannot specify DELIMITER in BINARY mode
 COPY x to stdin (format BINARY, null 

Re: Some performance degradation in REL_16 vs REL_15

2023-10-30 Thread Andres Freund
Hi,

On 2023-10-30 15:28:53 +0300, Anton A. Melnikov wrote:
> For REL_16_STABLE at 7cc2f59dd the average TPS was: 2020+-70,
> for REL_10_STABLE at c18c12c98 - 2260+-70
> 
> The percentage difference was approximately 11%.
> Please see the 16vs10.png picture with the graphical representation of the 
> data obtained.
> Also there are the raw data in the raw_data_s21.txt.
> 
> In some days i hope to perform additional measurements that were mentioned 
> above in this letter.
> It would be interesting to establish the reason for this difference. And  i 
> would be very grateful
> if you could advise me what other settings can be tweaked.

There's really no point in comparing peformance with assertions enabled
(leaving aside assertions that cause extreme performance difference, making
development harder). We very well might have added assertions making things
more expensive, without affecting performance in optimized/non-assert builds.

Greetings,

Andres Freund




Re: Row pattern recognition

2023-10-30 Thread Jacob Champion
On Tue, Oct 24, 2023 at 7:49 PM Tatsuo Ishii  wrote:
> I am impressed the simple NFA implementation.

Thanks!

> It would be nicer if it
> could be implemented without using recursion.

Yeah. If for some reason we end up going with a bespoke
implementation, I assume we'd just convert the algorithm to an
iterative one and optimize it heavily. But I didn't want to do that
too early, since it'd probably make it harder to add new features...
and anyway my goal is still to try to reuse src/backend/regex
eventually.

> > SELECT aid, first_value(aid) OVER w,
> > count(*) OVER w
> > FROM pgbench_accounts
> > WINDOW w AS (
> > PARTITION BY bid
> > ORDER BY aid
> > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> > AFTER MATCH SKIP PAST LAST ROW
> > INITIAL
> > PATTERN (START UP+)
> > DEFINE
> > START AS TRUE,
> > UP AS aid > PREV(aid)
> > );
>
> I ran this against your patch. It failed around > 60k rows.

Nice, that's actually more frames than I expected. Looks like I have
similar results here with my second test query (segfault at ~58k
rows).

Thanks,
--Jacob




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Andres Freund
Hi,

On 2023-10-30 16:08:50 +0900, Michael Paquier wrote:
> From 26a8432fe3ab8426e7797d85d19b0fe69d3384c9 Mon Sep 17 00:00:00 2001
> From: Michael Paquier 
> Date: Mon, 30 Oct 2023 16:02:52 +0900
> Subject: [PATCH v4] Require recovery.signal or standby.signal when reading a
>  backup_file
>
> Historically, the startup process uses two static variables to control
> if archive recovery should happen, when either recovery.signal or
> standby.signal are defined in the data folder at the beginning of
> recovery:

I think the problem with these variables is that they're a really messy state
machine - something this patch doesn't meaningfully improve IMO.


> This configuration was possible when recovering from a base backup taken
> by pg_basebackup without -R.  Note that the documentation requires at
> least to set recovery.signal to restore from a backup, but the startup
> process was not making this policy explicit.

Maybe I just didn't check the right place, but from I saw, this, at most, is
implied, rather than explicitly stated.


> In most cases, one would have been able to complete recovery, but that's a
> matter of luck, really, as it depends on the workload of the origin server.

With -X ... we have all the necessary WAL locally, how does the workload on
the primary matter? If you pass --no-slot, pg_basebackup might fail to fetch
the necessary wal, but then you'd also have gotten an error.


I agree with Robert that this would be a good error check on a green field,
but that I am less convinced it's going to help more than hurt now.


Right now running pg_basebackup with -X stream, without --write-recovery-conf,
gives you a copy of a cluster that will come up correctly as a distinct
instance.

With this change applied, you need to know that the way to avoid the existing
FATAL about restore_command at startup (when recovery.signal exists but
restore_command isn't set)) is to is to set "restore_command = false",
something we don't explain anywhere afaict. We should lessen the need to ever
use restore_command, not increase it.

It also seems risky to have people get used to restore_command = false,
because that effectively disables detection of other timelines etc. But, this
method does force a new timeline - which will be the same on each clone of the
database...

I also just don't think that it's always desirable to create a new timeline.

Greetings,

Andres Freund




Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
On Mon, Oct 30, 2023 at 2:46 PM Andres Freund  wrote:
> After playing with this for a while, I don't see a reason for wal_summarize_mb
> from a memory usage POV at least.

Cool! Thanks for testing.

> I wonder if there are use cases that might like to consume WAL summaries
> before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
> good control, but they might want to request a summary file to be created at
> some point?

It's possible. I actually think it's even more likely that there are
use cases that will also want the WAL summarized, but in some
different way. For example, you might want a summary that would give
you the LSN or approximate LSN where changes to a certain block
occurred. Such a summary would be way bigger than these summaries and
therefore, at least IMHO, a lot less useful for incremental backup,
but it could be really useful for something else. Or you might want
summaries that focus on something other than which blocks got changed,
like what relations were created or destroyed, or only changes to
certain kinds of relations or relation forks, or whatever. In a way,
you can even think of logical decoding as a kind of WAL summarization,
just with a very different set of goals from this one. I won't be too
surprised if the next hacker wants something that is different enough
from what this does that it doesn't make sense to share mechanism, but
if by chance they want the same thing but dumped a bit more
frequently, well, that can be done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-30 Thread Andres Freund
Hi,

On 2023-10-30 10:45:03 -0400, Robert Haas wrote:
> On Tue, Oct 24, 2023 at 12:08 PM Robert Haas  wrote:
> > Note that whether to remove summaries is a separate question from
> > whether to generate them in the first place. Right now, I have
> > wal_summarize_mb controlling whether they get generated in the first
> > place, but as I noted in another recent email, that isn't an entirely
> > satisfying solution.
> 
> I did some more research on this. My conclusion is that I should
> remove wal_summarize_mb and just have a GUC summarize_wal = on|off
> that controls whether the summarizer runs at all. There will be one
> summary file per checkpoint, no matter how far apart checkpoints are
> or how large the summary gets. Below I'll explain the reasoning; let
> me know if you disagree.

> What I describe above would be a bad plan if it were realistically
> possible for a summary file to get so large that it might run the
> machine out of memory either when producing it or when trying to make
> use of it for an incremental backup. This seems to be a somewhat
> difficult scenario to create. So far, I haven't been able to generate
> WAL summary files more than a few tens of megabytes in size, even when
> summarizing 50+ GB of WAL per summary file. One reason why it's hard
> to produce large summary files is because, for a single relation fork,
> the WAL summary size converges to 1 bit per modified block when the
> number of modified blocks is large. This means that, even if you have
> a terabyte sized relation, you're looking at no more than perhaps 20MB
> of summary data no matter how much of it gets modified. Now, somebody
> could have a 30TB relation and then if they modify the whole thing
> they could have the better part of a gigabyte of summary data for that
> relation, but if you've got a 30TB table you probably have enough
> memory that that's no big deal.

I'm not particularly worried about the rewriting-30TB-table case - that'd also
generate >= 30TB of WAL most of the time. Which realistically is going to
trigger a few checkpoints, even on very big instances.


> But, what if you have multiple relations? I initialized pgbench with a
> scale factor of 3 and also with 3 partitions and did a 1-hour
> run. I got 4 checkpoints during that time and each one produced an
> approximately 16MB summary file.

Hm, I assume the pgbench run will be fairly massively bottlenecked on IO, due
to having to read data from disk, lots of full page write and having to write
out lots of data?  I.e. we won't do all that many transactions during the 1h?


> To get a 1GB+ WAL summary file, you'd need to modify millions of relation
> forks, maybe tens of millions, and most installations aren't even going to
> have that many relation forks, let alone be modifying them all frequently.

I tried to find bad cases for a bit - and I am not worried. I wrote a pgbench
script to create 10k single-row relations in each script, ran that with 96
clients, checkpointed, and ran a pgbench script that updated the single row in
each table.

After creation of the relation WAL summarizer uses
LOG:  level: 1; Wal Summarizer: 378433680 total in 43 blocks; 5628936 free (66 
chunks); 372804744 used
and creates a 26MB summary file.

After checkpoint & updates WAL summarizer uses:
LOG:  level: 1; Wal Summarizer: 369205392 total in 43 blocks; 5864536 free (26 
chunks); 363340856 used
and creates a 26MB summary file.

Sure, 350MB ain't nothing, but simply just executing \dt in the database
created by this makes the backend use 260MB after. Which isn't going away,
whereas WAL summarizer drops its memory usage soon after.


> But I think that's sufficiently niche that the current patch shouldn't
> concern itself with such cases. If we find that they're common enough
> to worry about, we might eventually want to do something to mitigate
> them, but whether that thing looks anything like wal_summarize_mb
> seems pretty unclear. So I conclude that it's a mistake to include
> that GUC as currently designed and propose to replace it with a
> Boolean as described above.

After playing with this for a while, I don't see a reason for wal_summarize_mb
from a memory usage POV at least.

I wonder if there are use cases that might like to consume WAL summaries
before the next checkpoint? For those wal_summarize_mb likely wouldn't be a
good control, but they might want to request a summary file to be created at
some point?

Greetings,

Andres Freund




Re: COPY TO (FREEZE)?

2023-10-30 Thread Tom Lane
Bruce Momjian  writes:
> That is a good point.  I reviewed more of the messages and added
> capitalization where appropriate, patch attached.

This is starting to look pretty good.  I have one more thought,
as long as we're touching all these messages anyway: how about
s/FOO available only in CSV mode/FOO requires CSV mode/ ?
That's both shorter and less telegraphic, as it's not omitting the verb.

regards, tom lane




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Robert Haas
On Mon, Oct 30, 2023 at 3:09 AM Michael Paquier  wrote:
> I have been reviewing the patch, and applied portions of it as of
> dc5bd388 and 1ffdc03c and they're quite independent pieces.  After
> that, the remaining bits of the patch to change the behavior is now
> straight-forward.  I have written a commit message for it, while on
> it, as per the attached.

I would encourage some caution here.

In a vacuum, I'm in favor of this, and for the same reasons as you,
namely, that the huge pile of Booleans that we use to control recovery
is confusing, and it's difficult to make sure that all the code paths
are adequately tested, and I think some of the things that actually
work here are not documented.

But in practice, I think there is a possibility of something like this
backfiring very hard. Notice that the first two people who commented
on the thread saw the error and immediately removed backup_label even
though that's 100% wrong. It shows how utterly willing users are to
remove backup_label for any reason or no reason at all. If we convert
cases where things would have worked into cases where people nuke
backup_label and then it appears to work, we're going to be worse off
in the long run, no matter how crazy the idea of removing backup_label
may seem to us.

Also, Andres just recently mentioned to me that he uses this procedure
of starting a server with a backup_label but no recovery.signal or
standby.signal file regularly, and thinks other people do too. I was
surprised, since I've never done that, except maybe when I was a noob
and didn't have a clue. But Andres is far from a noob.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
On Thu, Sep 28, 2023 at 6:22 AM Jakub Wartak
 wrote:
> If that is still an area open for discussion: wouldn't it be better to
> just specify LSN as it would allow resyncing standby across major lag
> where the WAL to replay would be enormous? Given that we had
> primary->standby where standby would be stuck on some LSN, right now
> it would be:
> 1) calculate backup manifest of desynced 10TB standby (how? using
> which tool?)  - even if possible, that means reading 10TB of data
> instead of just putting a number, isn't it?
> 2) backup primary with such incremental backup >= LSN
> 3) copy the incremental backup to standby
> 4) apply it to the impaired standby
> 5) restart the WAL replay

As you may be able to tell from the flurry of posts and new patch
sets, I'm trying hard to sort out the remaining open items that
pertain to this patch set, and I'm now back to thinking about this
one.

TL;DR: I think the idea has some potential, but there are some
pitfalls that I'm not sure how to address.

I spent some time looking at how we currently use the data from the
backup manifest. Currently, we do two things with it. First, when
we're backing up each file, we check whether it's present in the
backup manifest and, if not, we back it up in full. This actually
feels fairly poor. If it makes any difference at all, then presumably
the underlying algorithm is buggy and needs to be fixed. Maybe that
should be ripped out altogether or turned into some kind of sanity
check that causes a big explosion if it fails. Second, we check
whether the WAL ranges reported by the client match up with the
timeline history of the server (see PrepareForIncrementalBackup). This
set of sanity checks seems fairly important to me, and I'd regret
discarding them. I think there's some possibility that they might
catch user error, like where somebody promotes multiple standbys and
maybe they even get the same timeline on more than one of them, and
then confusion might ensue. I also think that there's a real
possibility that they might make it easier to track down bugs in my
code, even if those bugs aren't necessarily timeline-related. If (or
more realistically when) somebody ends up with a corrupted cluster
after running pg_combinebackup, we're going to need to figure out
whether that corruption is the result of bugs (and if so where they
are) or user error (and if so what it was). The most obvious ways of
ending up with a corrupted cluster are (1) taking an incremental
backup against a prior backup that is not in the history of the server
from which the backup is taken or (2) combining an incremental backup
the wrong prior backup, so whatever sanity checks we can have that
will tend to prevent those kinds of mistakes seem like a really good
idea.

And those kinds of checks seem relevant here, too. Consider that it
wouldn't be valid to use pg_combinebackup to fast-forward a standby
server if the incremental backup's backup-end-LSN preceded the standby
server's minimum recovery point. Imagine that you have a standby whose
last checkpoint's redo location was at LSN 2/48. Being the
enterprising DBA that you are, you make a note of that LSN and go take
an incremental backup based on it. You then stop the standby server
and try to apply the incremental backup to fast-forward the standby.
Well, it's possible that in the meanwhile the standby actually caught
up, and now has a minimum recovery point that follows the
backup-end-LSN of your incremental backup. In that case, you can't
legally use that incremental backup to fast-forward that standby, but
no code I've yet written would be smart enough to figure that out. Or,
maybe you (or some other DBA on your team) got really excited and
actually promoted that standby meanwhile, and now it's not even on the
same timeline any more. In the "normal" case where you take an
incremental backup based on an earlier base backup, these kinds of
problems are detectable, and it seems to me that if we want to enable
this kind of use case, it would be pretty smart to have a plan to
detect similar mistakes here. I don't, currently, but maybe there is
one.

Another practical problem here is that, right now, pg_combinebackup
doesn't have an in-place mode. It knows how to take a bunch of input
backups and write out an output backup, but that output backup needs
to go into a new, fresh directory (or directories plural, if there are
user-defined tablespaces). I had previously considered adding such a
mode, but the idea I had at the time wouldn't have worked for this
case. I imagined that someone might want to run "pg_combinebackup
--in-place full incr" and clobber the contents of the incr directory
with the output, basically discarding the incremental backup you took
in favor of a full backup that could have been taken at the same point
in time. But here, you'd want to clobber the *first* input to
pg_combinebackup, not the last one, so if we want to add something
like this, the UI needs some thought.

One thing that I find 

Re: always use runtime checks for CRC-32C instructions

2023-10-30 Thread Tom Lane
Nathan Bossart  writes:
> The aforementioned other thread [0] aims to further optimize this code by
> using another instruction that requires additional configure and/or runtime
> checks.  $SUBJECT has been in the back of my mind for a while, but given
> proposals to add further complexity to this code, I figured it might be a
> good time to propose this simplification.  Specifically, I think we
> shouldn't worry about trying to compile only the special instrinics
> versions, and instead always try to build both and choose the appropriate
> one at runtime.

On the one hand, I agree that we need to keep the complexity from
getting out of hand.  On the other hand, I wonder if this approach
isn't optimizing for the wrong case.  How many machines that PG 17
will ever be run on in production will lack SSE 4.2 (for Intel)
or ARMv8 instructions (on the ARM side)?  It seems like a shame
to be burdening these instructions with a subroutine call for the
benefit of long-obsolete hardware versions.  Maybe that overhead
is negligible, but it doesn't sound like you tried to measure it.

I'm not quite sure what to propose instead, though.  I thought for
a little bit about a configure switch to select "test first" or
"pedal to the metal".  But in practice, package builders would
probably have to select the conservative "test first" option; and
we know that the vast majority of modern installations use prebuilt
packages, so it's not clear that this answer would help many people.

Anyway, I agree that the cost of a one-time-per-process probe should
be negligible; it's the per-use cost that I worry about.  If you can
do some measurements proving that that worry is ill-founded, then
I'm good with test-first.

regards, tom lane




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Roberto Mello
On Mon, Oct 30, 2023 at 1:09 AM Michael Paquier  wrote:

>
> I have been reviewing the patch, and applied portions of it as of
> dc5bd388 and 1ffdc03c and they're quite independent pieces.  After
> that, the remaining bits of the patch to change the behavior is now
> straight-forward.  I have written a commit message for it, while on
> it, as per the attached.
>

A suggestion for the hint message in an effort to improve readability:

"If you are restoring from a backup, ensure \"%s/recovery.signal\" or
\"%s/standby.signal\" is present and add required recovery options."

I realize the original use of "touch" is a valid shortcut for what I
suggest above, however that will be less clear for the not-so-un*x-inclined
users of Postgres, while for some it'll be downright confusing, IMHO. It
also provides the advantage of being crystal clear on what needs to be done
to fix the problem.

Roberto


Re: CRC32C Parallel Computation Optimization on ARM

2023-10-30 Thread Nathan Bossart
On Fri, Oct 27, 2023 at 07:01:10AM +, Xiang Gao wrote:
> On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>>> We consider that a runtime check needs to be done in any scenario.
>>> Here we only confirm that the compilation can be successful.
>> >A runtime check will be done when choosing which algorithm.
>> >You can think of us as merging USE_ARMV8_VMULL and 
>> >USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.
> 
>>Oh.  Looking again, I see that we are using a runtime check for ARM in all
>>cases with this patch.  If so, maybe we should just remove
>>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
>>USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
>>opportunities to simplify things, too.
> 
> Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.

Thanks.  I went ahead and split this prerequisite part out to a separate
thread [0] since it's sort-of unrelated to your proposal here.  It's not
really a prerequisite, but I do think it will simplify things a bit.

[0] https://postgr.es/m/20231030161706.GA3011%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




always use runtime checks for CRC-32C instructions

2023-10-30 Thread Nathan Bossart
This is an offshoot of the "CRC32C Parallel Computation Optimization on
ARM" thread [0].  I intend for this to be a prerequisite patch set.

Presently, for the SSE 4.2 and ARMv8 CRC instructions used in the CRC32C
code for WAL records, etc., we first check if the intrinsics are available
with the default compiler flags.  If so, we only bother compiling the
implementation that uses those intrinsics.  If not, we also check whether
the intrinsics are available with some extra CFLAGS, and if they are, we
compile both the implementation that uses the intrinsics as well as a
fallback implementation that doesn't require any special instructions.
Then, at runtime, we check what's available in the hardware and choose the
appropriate CRC32C implementation.

The aforementioned other thread [0] aims to further optimize this code by
using another instruction that requires additional configure and/or runtime
checks.  $SUBJECT has been in the back of my mind for a while, but given
proposals to add further complexity to this code, I figured it might be a
good time to propose this simplification.  Specifically, I think we
shouldn't worry about trying to compile only the special instrinics
versions, and instead always try to build both and choose the appropriate
one at runtime.

AFAICT the trade-offs aren't too bad.  With some simple testing, I see that
the runtime check occurs once at startup, so I don't anticipate any
noticeable performance impact.  I suppose each process might need to do the
check in EXEC_BACKEND builds, but even so, I suspect the difference is
negligible.

I also see that the SSE 4.2 runtime check requires the CPUID instruction,
so we wouldn't use the instrinsics for hardware that supports SSE 4.2 but
not CPUID.  However, I'm not sure such hardware even exists.  Wikipedia
says that CPUID was introduced in 1993 [1], and meson.build appears to omit
the CPUID check when determining which CRC32C implementation to use.
Furthermore, meson.build alludes to problems with some of the CPUID-related
checks:

# XXX: The configure.ac check for __cpuid() is broken, we don't copy 
that
# here. To prevent problems due to two detection methods working, stop
# checking after one.

Are there any other reasons that we should try to avoid the runtime check
when possible?

I've attached two patches.  0001 adds a debug message to the SSE 4.2
runtime check that matches the one already present for the ARMv8 check.
This message just notes whether the runtime check found that the special
CRC instructions are available.  0002 is a first attempt at $SUBJECT.  I've
tested it on both x86 and ARM, and it seems to work as intended.  You'll
notice that I'm still checking for the intrinsics with the default compiler
flags first.  I didn't see any strong reason to change this, and doing so
allows us to avoid sending extra CFLAGS when possible.

Thoughts?

[0] 
https://postgr.es/m/DB9PR08MB6991329A73923BF8ED4B3422F5DBA%40DB9PR08MB6991.eurprd08.prod.outlook.com
[1] https://en.wikipedia.org/wiki/CPUID

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 964afc75976cce8d712d97f346d2b8eea9c1f1ee Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 28 Oct 2023 22:12:45 -0500
Subject: [PATCH v1 1/2] add debug message

---
 src/port/pg_crc32c_sse42_choose.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 41ff4a35ad..3689c38e92 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -30,10 +30,15 @@
 
 #include "port/pg_crc32c.h"
 
+#ifndef FRONTEND
+#include "utils/elog.h"
+#endif
+
 static bool
 pg_crc32c_sse42_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
+	bool		result;
 
 #if defined(HAVE__GET_CPUID)
 	__get_cpuid(1, [0], [1], [2], [3]);
@@ -43,7 +48,13 @@ pg_crc32c_sse42_available(void)
 #error cpuid instruction not available
 #endif
 
-	return (exx[2] & (1 << 20)) != 0;	/* SSE 4.2 */
+	result = ((exx[2] & (1 << 20)) != 0);	/* SSE 4.2 */
+
+#ifndef FRONTEND
+	elog(DEBUG1, "using sse42 crc32 hardware = %d", result);
+#endif
+
+	return result;
 }
 
 /*
-- 
2.37.1 (Apple Git-137.1)

>From 7b8efae00327728000f7650a513ab0fd4fb15cd5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 28 Oct 2023 21:16:19 -0500
Subject: [PATCH v1 2/2] always use runtime checks for sse4.2/armv8 crc32c code

---
 configure| 122 ++-
 configure.ac | 104 ++---
 meson.build  |  31 +++--
 src/include/pg_config.h.in   |   6 --
 src/include/port/pg_crc32c.h |  19 +-
 src/port/meson.build |   2 -
 src/tools/msvc/Solution.pm   |   2 -
 7 files changed, 80 insertions(+), 206 deletions(-)

diff --git a/configure b/configure
index cfd968235f..47372dcd18 100755
--- a/configure
+++ b/configure
@@ -17885,28 +17885,6 @@ fi
 
 fi
 
-# Are we targeting a 

Re: Add semi-join pushdown to postgres_fdw

2023-10-30 Thread Alexander Korotkov
Hi, Alexander!

Thank you for working on this.  I believe this is a very interesting patch,
which significantly improves our FDW-based distributed facilities.  This is
why I decided to review this.

On Fri, Jan 20, 2023 at 11:00 AM Alexander Pyhalov 
wrote:
> Tomas Vondra писал 2023-01-19 20:49:
> > I took a quick look at the patch. It needs a rebase, although it
> > applies
> > fine using patch.
> >
> > A couple minor comments:
> >
> > 1) addl_conds seems a bit hard to understand, I'd use either the full
> > wording (additional_conds) or maybe extra_conds
>
> Renamed to additional_conds.
>
> >
> > 2) some of the lines got quite long, and need a wrap
> Splitted some of them. Not sure if it's enough.
>
> >
> > 3) unknown_subquery_rels name is a bit misleading - AFAIK it's the rels
> > that can't be referenced from upper rels (per what the .h says). So
> > they
> > are known, but hidden. Is there a better name?
>
> Renamed to hidden_subquery_rels. These are rels, which can't be referred
> to from upper join levels.
>
> >
> > 4) joinrel_target_ok() needs a better comment, explaining *when* the
> > reltarget is safe for pushdown. The conditions are on the same row, but
> > the project style is to break after '&&'.
>
> Added comment. It seems to be a rephrasing of lower comment in
> joinrel_target_ok().

+   /*
+* We can't push down join if its reltarget is not safe
+*/
+   if (!joinrel_target_ok(root, joinrel, jointype, outerrel, innerrel))
return false;

As I get joinrel_target_ok() function do meaningful checks only for semi
join and always return false for all other kinds of joins.  I think we
should call this only for semi join and name the function accordingly.

+   fpinfo->unknown_subquery_rels =
bms_union(fpinfo_o->unknown_subquery_rels,
+
fpinfo_i->unknown_subquery_rels);

Should the comment before this code block be revised?

+   case JOIN_SEMI:
+   fpinfo->joinclauses = list_concat(fpinfo->joinclauses,
+ fpinfo_i->remote_conds);
+   fpinfo->joinclauses = list_concat(fpinfo->joinclauses,
+  fpinfo->remote_conds);
+   fpinfo->remote_conds = list_copy(fpinfo_o->remote_conds);
+   fpinfo->unknown_subquery_rels =
bms_union(fpinfo->unknown_subquery_rels,
+   innerrel->relids);
+   break;

I think that comment before switch() should be definitely revised.

+ Relids hidden_subquery_rels; /* relids, which can't be referred to
+ * from upper relations */

Could this definition contain the positive part?  Can't be referred to from
upper relations, but used internally for semi joins (or something like
that)?

Also, I think the machinery around the append_conds could be somewhat
simpler if we turn them into a list (list of strings).  I think that should
make code clearer and also save us some memory allocations.

In [1] you've referenced the cases, when your patch can't push down
semi-joins.  It doesn't seem impossible to handle these cases, but that
would make the patch much more complicated.  I'm OK to continue with a
simpler patch to handle the majority of cases.  Could you please add the
cases, which can't be pushed down with the current patch, to the test suite?

Links
1.
https://www.postgresql.org/message-id/816fa8b1bc2da09a87484d1ef239a332%40postgrespro.ru

--
Regards,
Alexander Korotkov


Re: POC, WIP: OR-clause support for indexes

2023-10-30 Thread Peter Geoghegan
On Mon, Oct 30, 2023 at 6:40 AM Robert Haas  wrote:
> I agree with that, with some caveats, mainly that the reverse is to
> some extent also true. Maybe not completely, because arguably the
> ANY() formulation should just be straight-up easier to deal with, but
> in principle, the two are equivalent and it shouldn't matter which
> representation we pick.

I recently looked into MySQL's handling of these issues, which is more
mature and better documented than what we can do. EXPLAIN ANALYZE will
show an IN() list as if the query had been written as a list of ORs,
even though it can efficiently execute an index scan that uses
IN()/"OR var = constant" lists. So I agree with what you said here. It
is perhaps just as accident of history that we're talking about
converting to a ScalarArrayOpExpr, rather than talking about
converting to some other clause type that we associate with OR lists.

The essential point is that there ought to be one clause type that is
easier to deal with.

> But practically, it may, and we need to be sure that we don't put in
> place a translation that is theoretically a win but in practice leads
> to large regressions. Avoiding regressions here is more important than
> capturing all the possible gains. A patch that wins in some scenarios
> and does nothing in others can be committed; a patch that wins in even
> more scenarios but causes serious regressions in some cases probably
> can't.

I agree. Most of the really big wins here will come from simple
transformations. I see no reason why we can't take an incremental
approach. In fact I think we almost have to do so, since as I
understand it the transformations are just infeasible in certain
extreme cases.

-- 
Peter Geoghegan




Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
While reviewing this thread today, I realized that I never responded
to this email. That was inadvertent; my apologies.

On Wed, Jun 14, 2023 at 4:34 PM Matthias van de Meent
 wrote:
> Nice, I like this idea.

Cool.

> Skimming through the 7th patch, I see claims that FSM is not fully
> WAL-logged and thus shouldn't be tracked, and so it indeed doesn't
> track those changes.
> I disagree with that decision: we now have support for custom resource
> managers, which may use the various forks for other purposes than
> those used in PostgreSQL right now. It would be a shame if data is
> lost because of the backup tool ignoring forks because the PostgreSQL
> project itself doesn't have post-recovery consistency guarantees in
> that fork. So, unless we document that WAL-logged changes in the FSM
> fork are actually not recoverable from backup, regardless of the type
> of contents, we should still keep track of the changes in the FSM fork
> and include the fork in our backups or only exclude those FSM updates
> that we know are safe to ignore.

I'm not sure what to do about this problem. I don't think any data
would be *lost* in the scenario that you mention; what I think would
happen is that the FSM forks would be backed up in their entirety even
if they were owned by some other table AM or index AM that was
WAL-logging all changes to whatever it was storing in that fork. So I
think that there is not a correctness issue here but rather an
efficiency issue.

It would still be nice to fix that somehow, but I don't see how to do
it. It would be easy to make the WAL summarizer stop treating the FSM
as a special case, but there's no way for basebackup_incremental.c to
know whether a particular relation fork is for the heap AM or some
other AM that handles WAL-logging differently. It can't for example
examine pg_class; it's not connected to any database, let alone every
database. So we have to either trust that the WAL for the FSM is
correct and complete in all cases, or assume that it isn't in any
case. And the former doesn't seem like a safe or wise assumption given
how the heap AM works.

I think the reality here is unfortunately that we're missing a lot of
important infrastructure to really enable a multi-table-AM world. The
heap AM, and every other table AM, should include a metapage so we can
tell what we're looking at just by examining the disk files. Relation
forks don't scale and should be replaced with some better system that
does. We should have at least two table AMs in core that are fully
supported and do truly useful things. Until some of that stuff (and
probably a bunch of other things) get sorted out, out-of-core AMs are
going to have to remain second-class citizens to some degree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_rewind WAL segments deletion pitfall

2023-10-30 Thread Alexander Kukushkin
Hi,

On Wed, 18 Oct 2023 at 08:50, torikoshia  wrote:

>
> I have very minor questions on the regression tests mainly regarding the
> consistency with other tests for pg_rewind:
>

Please find attached a new version of the patch. It addresses all your
comments.

Regards,
--
Alexander Kukushkin
From 08d2cac946b92f05d410d75ed026ebf6587b6671 Mon Sep 17 00:00:00 2001
From: Alexander Kukushkin 
Date: Mon, 30 Oct 2023 16:23:15 +0200
Subject: [PATCH] Be more picky with WAL segment deletion in pg_rewind

Make pg_rewind to be a bit wiser in terms of creating filemap:
preserve on the target all WAL segments that contain records between the
last common checkpoint and the point of divergence.

Co-authored-by: Polina Bungina 
---
 src/bin/pg_rewind/filemap.c   | 62 +-
 src/bin/pg_rewind/filemap.h   |  4 ++
 src/bin/pg_rewind/parsexlog.c | 22 +++
 src/bin/pg_rewind/pg_rewind.c |  3 +
 src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 64 +++
 5 files changed, 153 insertions(+), 2 deletions(-)
 create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ecadd69dc5..ebaebfb149 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -64,6 +64,27 @@ static file_entry_t *lookup_filehash_entry(const char *path);
 static int	final_filemap_cmp(const void *a, const void *b);
 static bool check_file_excluded(const char *path, bool is_source);
 
+typedef struct skipwal_t {
+	const char *path;
+	uint32		status;
+} skipwal_t;
+
+#define SH_PREFIX		keepwalhash
+#define SH_ELEMENT_TYPE	skipwal_t
+#define SH_KEY_TYPE		const char *
+#define	SH_KEY			path
+#define SH_HASH_KEY(tb, key)	hash_string_pointer(key)
+#define SH_EQUAL(tb, a, b)		(strcmp(a, b) == 0)
+#define	SH_SCOPE		static inline
+#define SH_RAW_ALLOCATOR	pg_malloc0
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static keepwalhash_hash *keepwalhash = NULL;
+
+static bool keepwalhash_entry_exists(const char *path);
+
 /*
  * Definition of one element part of an exclusion list, used to exclude
  * contents when rewinding.  "name" is the name of the file or path to
@@ -207,6 +228,35 @@ lookup_filehash_entry(const char *path)
 	return filehash_lookup(filehash, path);
 }
 
+/* Initialize a hash table to store WAL file names that must be kept */
+void
+keepwalhash_init(void)
+{
+	keepwalhash = keepwalhash_create(FILEHASH_INITIAL_SIZE, NULL);
+}
+
+/* Prevent a given file deletion during rewind */
+void
+insert_keepwalhash_entry(const char *path)
+{
+	skipwal_t   *entry;
+	bool		found;
+
+	/* Should only be called with keepwalhash initialized */
+	Assert(keepwalhash);
+
+	entry = keepwalhash_insert(keepwalhash, path, );
+
+	if (!found)
+		entry->path = pg_strdup(path);
+}
+
+static bool
+keepwalhash_entry_exists(const char *path)
+{
+	return keepwalhash_lookup(keepwalhash, path) != NULL;
+}
+
 /*
  * Callback for processing source file list.
  *
@@ -682,7 +732,16 @@ decide_file_action(file_entry_t *entry)
 	}
 	else if (entry->target_exists && !entry->source_exists)
 	{
-		/* File exists in target, but not source. Remove it. */
+		/* File exists in target, but not source. */
+
+		if (keepwalhash_entry_exists(path))
+		{
+			/* This is a WAL file that should be kept. */
+			pg_log_debug("Not removing %s because it is required for recovery", path);
+			return FILE_ACTION_NONE;
+		}
+
+		/* Otherwise remove an unexpected file. */
 		return FILE_ACTION_REMOVE;
 	}
 	else if (!entry->target_exists && !entry->source_exists)
@@ -818,7 +877,6 @@ decide_file_actions(void)
 	return filemap;
 }
 
-
 /*
  * Helper function for filemap hash table.
  */
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 988d4590e0..f037448a0f 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -109,5 +109,9 @@ extern void process_target_wal_block_change(ForkNumber forknum,
 extern filemap_t *decide_file_actions(void);
 extern void calculate_totals(filemap_t *filemap);
 extern void print_filemap(filemap_t *filemap);
+extern void preserve_file(char *filepath);
+
+extern void keepwalhash_init(void);
+extern void insert_keepwalhash_entry(const char *path);
 
 #endif			/* FILEMAP_H */
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 0233ece88b..b89b65077d 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -176,6 +176,10 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	char	   *errormsg;
 	XLogPageReadPrivate private;
 
+	/* Track WAL segments opened while searching a checkpoint */
+	XLogSegNo	segno = 0;
+	TimeLineID	tli = 0;
+
 	/*
 	 * The given fork pointer points to the end of the last common record,
 	 * which is not necessarily the beginning of the next record, if the
@@ -217,6 +221,24 @@ findLastCheckpoint(const char *datadir, XLogRecPtr 

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2023-10-30 Thread Ashutosh Bapat
Rebased patchset. No changes in actual patches. 0001 is squashed
version of patches submitted at
https://www.postgresql.org/message-id/CAExHW5sCJX7696sF-OnugAiaXS=Ag95=-m1csrjcmyyj8pd...@mail.gmail.com.

On Tue, Oct 3, 2023 at 6:42 PM Ashutosh Bapat
 wrote:
>
> On Fri, Sep 29, 2023 at 8:36 AM Amit Langote  wrote:
> > IOW, something
> > like the following would have sufficed:
> >
> > @@ -1735,6 +1735,10 @@ build_child_join_sjinfo(PlannerInfo *root,
> > SpecialJoinInfo *parent_sjinfo,
> >  /*
> >   * free_child_sjinfo_members
> >   * Free memory consumed by members of a child SpecialJoinInfo.
> > + *
> > + * Only members that are translated copies of their counterpart in the 
> > parent
> > + * SpecialJoinInfo are freed here.  However, members that could be 
> > referenced
> > + * elsewhere are not freed.
> >   */
> >  static void
> >  free_child_sjinfo_members(SpecialJoinInfo *child_sjinfo)
> > @@ -1755,10 +1759,7 @@ free_child_sjinfo_members(SpecialJoinInfo 
> > *child_sjinfo)
> > bms_free(child_sjinfo->syn_lefthand);
> > bms_free(child_sjinfo->syn_righthand);
> >
> > -   /*
> > -* But the list of operator OIDs and the list of expressions may be
> > -* referenced somewhere else. Do not free those.
> > -*/
> > +   /* semi_rhs_exprs may be referenced, so don't free. */
> >  }
>
> Works for me. PFA patchset with these changes. I have still left the
> changes addressing your comments as a separate patch for easier
> review.
>
> --
> Best Wishes,
> Ashutosh Bapat



-- 
Best Wishes,
Ashutosh Bapat




Remove obsolete check for TTSOpsVirtual from slot_compile_deform

2023-10-30 Thread Xing Guo
Hi hackers,

While exploring the JIT support for tuple deforming process, I noticed that
one check for TTSOpsVirtual in slot_compile_deform is obsolete. Since
virtual tuples never need deforming and there's an assertion in
llvm_compile_expr[1]. I simply replace it with an assertion in
slot_compile_deform. Patch is attached.

[1]
https://github.com/postgres/postgres/blob/0c60e8ba80e03491b028204a19a9dca6d216df91/src/backend/jit/llvm/llvmjit_expr.c#L322

Best Regards,
Xing
From 6db602bd7d0964b60134c4b20b4107a1e8717b78 Mon Sep 17 00:00:00 2001
From: Xing Guo 
Date: Mon, 30 Oct 2023 22:34:10 +0800
Subject: [PATCH v1] Remove obsolete check for TTSOpsVirtual from
 slot_compile_deform.

Since virtual tuples never need deforming and there's no
EEOP_*_FETCHSOME step generated for it. We don't need to check it in
slot_compile_deform. Probably we should replace it with an assertion.
---
 src/backend/jit/llvm/llvmjit_deform.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_deform.c b/src/backend/jit/llvm/llvmjit_deform.c
index b7e88e7ca9..e652cda77f 100644
--- a/src/backend/jit/llvm/llvmjit_deform.c
+++ b/src/backend/jit/llvm/llvmjit_deform.c
@@ -90,9 +90,8 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc,
 
 	int			attnum;
 
-	/* virtual tuples never need deforming, so don't generate code */
-	if (ops == )
-		return NULL;
+	/* virtual tuples never need deforming */
+	Assert(ops != );
 
 	/* decline to JIT for slot types we don't know to handle */
 	if (ops !=  && ops !=  &&
-- 
2.42.0



Re: Fwd: Annoying corruption in PostgreSQL.

2023-10-30 Thread Stephen Frost
Greetings,

Please don't top-post on these lists.

* Kirill Reshke (reshkekir...@gmail.com) wrote:
> We have physical backups and we can PITR. But restoring a cluster to some
> point in the past is a bit of a different task: we need our client's
> approval for these operations, since we are a Managed DBs Cloud Provider.
> Will try to ask someone.

Do you have page-level checksums enabled for these PG instances?

Are you able to see if these clusters which are reporting the corruption
have been restored in the past from a backup?  What are you using to
perform your backups and perform your restores?

Are you able to see if these clusters have ever crashed and come back up
after by doing WAL replay?

Where I'm heading with these questions is essentially: I suspect either
your backup/restore procedure is broken or you're running on a system
that doesn't properly fsync data.  Or possibly both.

Oh, and you should probably have checksums enabled.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: trying again to get incremental backup

2023-10-30 Thread Robert Haas
On Tue, Oct 24, 2023 at 12:08 PM Robert Haas  wrote:
> Note that whether to remove summaries is a separate question from
> whether to generate them in the first place. Right now, I have
> wal_summarize_mb controlling whether they get generated in the first
> place, but as I noted in another recent email, that isn't an entirely
> satisfying solution.

I did some more research on this. My conclusion is that I should
remove wal_summarize_mb and just have a GUC summarize_wal = on|off
that controls whether the summarizer runs at all. There will be one
summary file per checkpoint, no matter how far apart checkpoints are
or how large the summary gets. Below I'll explain the reasoning; let
me know if you disagree.

What I describe above would be a bad plan if it were realistically
possible for a summary file to get so large that it might run the
machine out of memory either when producing it or when trying to make
use of it for an incremental backup. This seems to be a somewhat
difficult scenario to create. So far, I haven't been able to generate
WAL summary files more than a few tens of megabytes in size, even when
summarizing 50+ GB of WAL per summary file. One reason why it's hard
to produce large summary files is because, for a single relation fork,
the WAL summary size converges to 1 bit per modified block when the
number of modified blocks is large. This means that, even if you have
a terabyte sized relation, you're looking at no more than perhaps 20MB
of summary data no matter how much of it gets modified. Now, somebody
could have a 30TB relation and then if they modify the whole thing
they could have the better part of a gigabyte of summary data for that
relation, but if you've got a 30TB table you probably have enough
memory that that's no big deal.

But, what if you have multiple relations? I initialized pgbench with a
scale factor of 3 and also with 3 partitions and did a 1-hour
run. I got 4 checkpoints during that time and each one produced an
approximately 16MB summary file. The efficiency here drops
considerably. For example, one of the files is 16495398 bytes and
records information on 7498403 modified blocks, which works out to
about 2.2 bytes per modified block. That's more than an order of
magnitude worse than what I got in the single-relation case, where the
summary file didn't even use two *bits* per modified block. But here
again, the file just isn't that big in absolute terms. To get a 1GB+
WAL summary file, you'd need to modify millions of relation forks,
maybe tens of millions, and most installations aren't even going to
have that many relation forks, let alone be modifying them all
frequently.

My conclusion here is that it's pretty hard to have a database where
WAL summarization is going to use too much memory. I wouldn't be
terribly surprised if there are some extreme cases where it happens,
but those databases probably aren't great candidates for incremental
backup anyway. They're probably databases with millions of relations
and frequent, widely-scattered modifications to those relations. And
if you have that kind of high turnover rate then incremental backup
isn't going to as helpful anyway, so there's probably no reason to
enable WAL summarization in the first place. Maybe if you have that
plus in the same database cluster you have a 100TB of completely
static data that is never modified, and if you also do all of this on
a pretty small machine, then you can find a case where incremental
backup would have worked well but for the memory consumed by WAL
summarization.

But I think that's sufficiently niche that the current patch shouldn't
concern itself with such cases. If we find that they're common enough
to worry about, we might eventually want to do something to mitigate
them, but whether that thing looks anything like wal_summarize_mb
seems pretty unclear. So I conclude that it's a mistake to include
that GUC as currently designed and propose to replace it with a
Boolean as described above.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Report planning memory in EXPLAIN ANALYZE

2023-10-30 Thread Ashutosh Bapat
>
>
> David challenged something at the begining,  but IIUC he also agree the
> value of patch 01 as the previous statement after discussion. Since the patch
> is mild itself, so I will mark this commitfest entry as "Ready for committer".
> Please  correct me if anything is wrong.
>
Thanks Andy.

Here's rebased patches. A conflict in explain.out resolved.


-- 
Best Wishes,
Ashutosh Bapat
From 171ce2bd03f846e7ba3e6972b1f51a432d5f75c5 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 23 Aug 2023 16:23:12 +0530
Subject: [PATCH 2/3] Report memory allocated along with memory used in EXPLAIN

Memory might be pallc'ed and pfree'ed during planning. The memory used at the
end of planning may miss a large chunk of memory palloc'ed and pfree'ed during
planning. But the corresponding memory may remain allocated in the memory
context. Hence report both memory used and memory allocated to detect any such
activity during planning.

Ashutosh Bapat, per suggestion by David Rowley
---
 src/backend/commands/explain.c| 69 +++
 src/backend/commands/prepare.c| 12 +++--
 src/backend/utils/mmgr/mcxt.c | 13 ++---
 src/include/commands/explain.h| 11 -
 src/include/utils/memutils.h  |  3 +-
 src/test/regress/expected/explain.out | 25 +++---
 src/tools/pgindent/typedefs.list  |  1 +
 7 files changed, 103 insertions(+), 31 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dfa4447794..efc4887244 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -122,6 +122,8 @@ static const char *explain_get_index_name(Oid indexId);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 			  bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
+static void show_planning_memory(ExplainState *es,
+ const MemUsage *usage);
 static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir,
 	ExplainState *es);
 static void ExplainScanTarget(Scan *plan, ExplainState *es);
@@ -397,11 +399,13 @@ ExplainOneQuery(Query *query, int cursorOptions,
 	planduration;
 		BufferUsage bufusage_start,
 	bufusage;
-		Size		mem_consumed;
+		MemoryContextCounters mem_counts_start;
+		MemoryContextCounters mem_counts_end;
+		MemUsage	mem_usage;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
-		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext);
+		MemoryContextMemConsumed(CurrentMemoryContext, _counts_start);
 		INSTR_TIME_SET_CURRENT(planstart);
 
 		/* plan the query */
@@ -409,8 +413,8 @@ ExplainOneQuery(Query *query, int cursorOptions,
 
 		INSTR_TIME_SET_CURRENT(planduration);
 		INSTR_TIME_SUBTRACT(planduration, planstart);
-		mem_consumed = MemoryContextMemUsed(CurrentMemoryContext)
-			- mem_consumed;
+		MemoryContextMemConsumed(CurrentMemoryContext, _counts_end);
+		calc_mem_usage(_usage, _counts_end, _counts_start);
 
 		/* calc differences of buffer counters. */
 		if (es->buffers)
@@ -422,7 +426,7 @@ ExplainOneQuery(Query *query, int cursorOptions,
 		/* run it (if needed) and produce output */
 		ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
 	   , (es->buffers ?  : NULL),
-	   _consumed);
+	   _usage);
 	}
 }
 
@@ -532,7 +536,7 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage, const Size *mem_consumed)
+			   const BufferUsage *bufusage, const MemUsage *mem_usage)
 {
 	DestReceiver *dest;
 	QueryDesc  *queryDesc;
@@ -635,9 +639,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 3, es);
 	}
 
-	if (es->summary && mem_consumed)
-		ExplainPropertyUInteger("Planning Memory", "bytes",
-(uint64) *mem_consumed, es);
+	if (es->summary && mem_usage)
+	{
+		ExplainOpenGroup("Planning Memory", "Planning Memory", true, es);
+		show_planning_memory(es, mem_usage);
+		ExplainCloseGroup("Planning Memory", "Planning Memory", true, es);
+	}
 
 	/* Print info about runtime of triggers */
 	if (es->analyze)
@@ -3775,6 +3782,50 @@ show_wal_usage(ExplainState *es, const WalUsage *usage)
 	}
 }
 
+/*
+ * Show planner's memory usage details.
+ */
+static void
+show_planning_memory(ExplainState *es, const MemUsage *usage)
+{
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		appendStringInfo(es->str,
+		 "Planning Memory: used=%zu bytes allocated=%zu bytes",
+		 usage->mem_used, usage->mem_allocated);
+		appendStringInfoChar(es->str, '\n');
+	}
+	else
+	{
+		ExplainPropertyInteger("Used", "bytes", usage->mem_used, es);
+		ExplainPropertyInteger("Allocated", "bytes", usage->mem_allocated, es);
+	}
+}
+
+/*
+ * Compute memory usage from the start and end memory counts.
+ */
+void

Re: Not deleted mentions of the cleared path

2023-10-30 Thread Ashutosh Bapat
On Mon, Oct 30, 2023 at 5:01 PM Alena Rybakina
 wrote:
>
> Hi, hackers!
>
> I have already written about the problem of InvalidPath [0] appearing. I 
> investigated this and found an error in the add_path() function, when we 
> reject a path, we free up the memory of the path, but do not delete various 
> mentions of it (for example, in the ancestor of relation, as in the example 
> below).
>
> Thus, we may face the problem of accessing the freed memory.
>
> I demonstrated this below using gdb when I call a query after running a test 
> in test/regression/sql/create_misc.sql:
>
> Query:
>
> :-- That ALTER TABLE should have added TOAST tables.
>
> SELECT relname, reltoastrelid <> 0 AS has_toast_table
>FROM pg_class
>WHERE oid::regclass IN ('a_star', 'c_star')
>ORDER BY 1;
>
> --UPDATE b_star*
> --   SET a = text 'gazpacho'
> --   WHERE aa > 4;
>
> SELECT class, aa, a FROM a_star*;
>
>
> gdb:
>
> 0x7ff3f7325fda in epoll_wait (epfd=5, events=0x55bf9ee75c38, maxevents=1, 
> timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> 30  ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
> (gdb) b /home/alena/postgrespro_3/src/backend/optimizer/util/pathnode.c:620
> Breakpoint 1 at 0x55bf9cfe4c65: file pathnode.c, line 621.
> (gdb) c
> Continuing.
>
> Breakpoint 1, add_path (parent_rel=0x55bf9ef7f5c0, new_path=0x55bf9ef7f4e0) 
> at pathnode.c:621
> 621 if (!IsA(new_path, IndexPath))
> (gdb) n
> 622   pfree(new_path);
> (gdb) n
> 624 }
> (gdb) p *new_path
> $1 = {type = T_Invalid, pathtype = T_Invalid, parent = 0x7f7f7f7f7f7f7f7f, 
> pathtarget = 0x7f7f7f7f7f7f7f7f,
>   param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe = 127, 
> parallel_workers = 2139062143,
>   rows = 1.3824172084878715e+306, startup_cost = 1.3824172084878715e+306, 
> total_cost = 1.3824172084878715e+306,
>   pathkeys = 0x7f7f7f7f7f7f7f7f}
> (gdb) p new_path
> $2 = (Path *) 0x55bf9ef7f4e0

At this point the new_path has not been added to the parent_rel. We do
not set the cheapest* paths while paths are being added. The stack
trace will give you an idea where this is happening.

>
> (gdb) p ((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements 
> [0].ptr_value)->subpath)->path->parent->cheapest_startup_path
> $20 = (struct Path *) 0x55bf9ef7f4e0

This looks familiar though. There was some nearby thread where Tom
Lane, if my memory serves well, provided a case where a path from
lower rel was added to an upper rel without copying or changing its
parent. This very much looks like that case.

-- 
Best Wishes,
Ashutosh Bapat




Re: pgBufferUsage.blk_{read|write}_time are zero although there are pgBufferUsage.local_blks_{read|written}

2023-10-30 Thread hubert depesz lubaczewski
On Mon, Oct 30, 2023 at 10:45:05AM +0900, Michael Paquier wrote:
> On Fri, Oct 27, 2023 at 04:58:20PM +0300, Nazir Bilal Yavuz wrote:
> > I think switching it to 'shared' makes sense. That shouldn't confuse
> > existing monitoring queries much as the numbers won't change, right?
> > Also, if we keep 'shared/local' there could be similar complaints to
> > this thread in the future; so, at least adding comments can be
> > helpful.
> 
> The problem is that it may impact existing tools that do explain
> output deparsing.  One of them is https://explain.depesz.com/ that
> Hubert Depesz Lubaczewski has implemented, and it would be sad to
> break anything related to it.
> 
> I am adding Hubert in CC for comments about changing this
> "shared/local" to "shared" on a released branch.  Knowing that
> "shared" and "local" will need to be handled as separate terms in 17~
> anyway, perhaps that's not a big deal, but let's be sure.

Hi,
some things will definitely break, but that's 100% OK. The change seems
needed, and I can update my parser to deal with it :)

Best regards,

depesz





Re: Explicitly skip TAP tests under Meson if disabled

2023-10-30 Thread Tom Lane
Aleksander Alekseev  writes:
> Personally I like the change. It makes the output more explicit. In my
> use cases not running TAP tests typically is not something I want . So
> I would appreciate being warned with a long list of bright yellow
> "SKIP" messages. If I really want to skip TAP tests these messages are
> just informative and don't bother me.

+1 for counting such tests as "skipped" in the summary.  -1 for
emitting a message per skipped test.  If I'm intentionally not
running those tests, that would be very annoying noise, and
potentially would obscure messages I actually need to see.

(And about -100 for emitting such messages in yellow.  Doesn't
anybody who codes this stuff have a clue about vision problems?)

regards, tom lane




Re: POC, WIP: OR-clause support for indexes

2023-10-30 Thread Alexander Korotkov
On Mon, Oct 30, 2023 at 3:40 PM Robert Haas  wrote:
> On Thu, Oct 26, 2023 at 5:05 PM Peter Geoghegan  wrote:
> > On Thu, Oct 26, 2023 at 12:59 PM Robert Haas  wrote:
> > > Alexander's example seems to show that it's not that simple. If I'm
> > > reading his example correctly, with things like aid = 1, the
> > > transformation usually wins even if the number of things in the OR
> > > expression is large, but with things like aid + 1 * bid = 1, the
> > > transformation seems to lose at least with larger numbers of items. So
> > > it's not JUST the number of OR elements but also what they contain,
> > > unless I'm misunderstanding his point.
> >
> > Alexander said "Generally, I don't see why ANY could be executed
> > slower than the equivalent OR clause". I understood that this was his
> > way of expressing the following idea:
> >
> > "In principle, there is no reason to expect execution of ANY() to be
> > slower than execution of an equivalent OR clause (except for
> > noise-level differences). While it might not actually look that way
> > for every single type of plan you can imagine right now, that doesn't
> > argue for making a cost-based decision. It actually argues for fixing
> > the underlying issue, which can't possibly be due to some kind of
> > fundamental advantage enjoyed by expression evaluation with ORs".
> >
> > This is also what I think of all this.
>
> I agree with that, with some caveats, mainly that the reverse is to
> some extent also true. Maybe not completely, because arguably the
> ANY() formulation should just be straight-up easier to deal with, but
> in principle, the two are equivalent and it shouldn't matter which
> representation we pick.
>
> But practically, it may, and we need to be sure that we don't put in
> place a translation that is theoretically a win but in practice leads
> to large regressions. Avoiding regressions here is more important than
> capturing all the possible gains. A patch that wins in some scenarios
> and does nothing in others can be committed; a patch that wins in even
> more scenarios but causes serious regressions in some cases probably
> can't.

+1
Sure, I've identified two cases where patch shows regression [1].  The
first one (quadratic complexity of expression processing) should be
already addressed by usage of hash.  The second one (planning
regression with Bitmap OR) is not yet addressed.

Links
1. 
https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com

--
Regards,
Alexander Korotkov




Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?

2023-10-30 Thread Bruce Momjian
On Mon, Oct 30, 2023 at 02:22:08PM +1300, David Rowley wrote:
> If I change STD_FUZZ_FACTOR to something like 1.001 then the plans
> no longer change when I do:
> 
> alter server loopback options (add fdw_tuple_cost '0.01');
> 
> alter server loopback options (drop fdw_tuple_cost);
> 
> 
> Ordinarily, I'd not care too much about that, but I did test the
> performance of one of the plans and the new plan came out slower than
> the old one.
> 
> I'm not exactly sure how best to proceed on this in the absence of any 
> feedback.

I think you just go and change it.  Your number is better than what we
have, and if someone wants to suggest a better number, we can change it
later.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: COPY TO (FREEZE)?

2023-10-30 Thread Bruce Momjian
On Mon, Oct 30, 2023 at 03:16:58PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 29 Oct 2023 15:35:02 -0400, Bruce Momjian  wrote in 
> > You are 100% correct.  Updated patch attached.
> 
> -  errmsg("COPY force not null only available 
> using COPY FROM")));
> +  errmsg("COPY force not null cannot be used 
> with COPY TO")));
> 
> I find the term "force not null" hard to translate, especially into
> Japaese, as its literal translation doesn't align with the entire
> message.  The most recent translation for it is the literal rendition
> of "FORCE_NOT_NULL option of COPY can only be used with COPY FROM".
> 
> In short, for translation convenience, I would prefer if "force not
> null" were "FORCE_NOT_NULL".

That is a good point.  I reviewed more of the messages and added
capitalization where appropriate, patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index d12ba96497..18ecc69c33 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -224,6 +224,7 @@ COPY { table_name [ ( COPY FREEZE on
   a partitioned table.
+  This option is only allowed in COPY FROM.
  
  
   Note that all other sessions will immediately be able to see the data
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c5d7d78645..145315debe 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -671,7 +671,7 @@ ProcessCopyOptions(ParseState *pstate,
 	if (!opts_out->csv_mode && opts_out->quote != NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY quote available only in CSV mode")));
+ errmsg("COPY QUOTE available only in CSV mode")));
 
 	if (opts_out->csv_mode && strlen(opts_out->quote) != 1)
 		ereport(ERROR,
@@ -687,7 +687,7 @@ ProcessCopyOptions(ParseState *pstate,
 	if (!opts_out->csv_mode && opts_out->escape != NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY escape available only in CSV mode")));
+ errmsg("COPY ESCAPE available only in CSV mode")));
 
 	if (opts_out->csv_mode && strlen(opts_out->escape) != 1)
 		ereport(ERROR,
@@ -698,46 +698,52 @@ ProcessCopyOptions(ParseState *pstate,
 	if (!opts_out->csv_mode && (opts_out->force_quote || opts_out->force_quote_all))
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force quote available only in CSV mode")));
+ errmsg("COPY FORCE_QUOTE available only in CSV mode")));
 	if ((opts_out->force_quote || opts_out->force_quote_all) && is_from)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force quote only available using COPY TO")));
+ errmsg("COPY FORCE_QUOTE cannot be used with COPY FROM")));
 
 	/* Check force_notnull */
 	if (!opts_out->csv_mode && opts_out->force_notnull != NIL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force not null available only in CSV mode")));
+ errmsg("COPY FORCE_NOT_NULL available only in CSV mode")));
 	if (opts_out->force_notnull != NIL && !is_from)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force not null only available using COPY FROM")));
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY FORCE_NOT_NULL cannot be used with COPY TO")));
 
 	/* Check force_null */
 	if (!opts_out->csv_mode && opts_out->force_null != NIL)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force null available only in CSV mode")));
+ errmsg("COPY FORCE_NULL available only in CSV mode")));
 
 	if (opts_out->force_null != NIL && !is_from)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY force null only available using COPY FROM")));
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY FORCE_NULL cannot be used with COPY TO")));
 
 	/* Don't allow the delimiter to appear in the null string. */
 	if (strchr(opts_out->null_print, opts_out->delim[0]) != NULL)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY delimiter must not appear in the NULL specification")));
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY delimiter character must not appear in the NULL specification")));
 
 	/* Don't allow the CSV quote char to appear in the null string. */
 	if (opts_out->csv_mode &&
 		strchr(opts_out->null_print, opts_out->quote[0]) != NULL)
 		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("CSV quote character must not appear in the NULL specification")));
 
+	/* Check freeze */
+	if (opts_out->freeze && !is_from)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+

Including Doxyfile and Meson script for docs into main source tree

2023-10-30 Thread Bohdan Mart

Hello, Stefan!

What do you think about this? See quoted message from John Morris.

P.S. Also we would like you to provide currently used Doxyfile, as 
Postgres doxigen's home page no longer have link to download said file.


Regards,
Bohdan.

On 14.10.23 21:54, John Morris wrote:

Thank you for trying the patch out and commenting on it.

I'm starting to think of it as a project. Here's a quick project statement.

The purpose is to generate improved  Doxygen output while making maximal 
use of how Postgres currently does program comments.


Thinking in terms of specific steps, and adding to what you have mentioned:

  * Configure Doxyfile so it produces output similar to previous output.
  * Only build Doxygen output if requested
  * Don't compile the filter or configure the Doxyfile  if they aren't
needed
  * Include contrib in the sources to document
  * Provide options for other (non-html) output. (Which ones?)
  * Update Postgres documentation to include instructions for creating
Doxygen output.
  * Mention it in developer guidelines and provide sample code showing a
"desired" commenting style.

Does that list seem complete? I don't want people to think we're 
imposing new standards or legislating new commenting styles.  It's more 
a matter of describing what we currently do, maybe with some minor 
suggestions for improving.


  * John Morris






Re: Explicitly skip TAP tests under Meson if disabled

2023-10-30 Thread Aleksander Alekseev
Hi,

> Under Meson, it is not very easy to see if TAP tests have been enabled
> or disabled, if you rely on the default auto setting.  You either need
> to carefully study the meson setup output, or you notice, what a minute,
> didn't there use to be like 250 tests, not only 80?
>
> I think it would be better if we still registered the TAP tests in Meson
> even if the tap_tests option is disabled, but with a dummy command that
> registers them as skipped.  That way you get a more informative output like
>
> Ok: 78
> Expected Fail:  0
> Fail:   0
> Unexpected Pass:0
> Skipped:187
> Timeout:0
>
> which is really a more accurate representation of what the test run
> actually accomplished than "everything Ok".
>
> See attached patch for a possible implementation.  (This uses perl as a
> hard build requirement.  We are planning to do that anyway, but
> obviously other implementations, such as using python, would also be
> possible.)

I tested the patch and it works as intended.

Personally I like the change. It makes the output more explicit. In my
use cases not running TAP tests typically is not something I want . So
I would appreciate being warned with a long list of bright yellow
"SKIP" messages. If I really want to skip TAP tests these messages are
just informative and don't bother me.

+1

-- 
Best regards,
Aleksander Alekseev




Re: Is this a problem in GenericXLogFinish()?

2023-10-30 Thread Robert Haas
On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila  wrote:
> > Hmm.  So my question is: do we need the cleanup lock on the write
> > buffer even if there are no tuples, and even if primary bucket and the
> > write bucket are the same?
>
> Yes, we need it to exclude any concurrent in-progress scans that could
> return incorrect tuples during bucket squeeze operation.

Amit, thanks for weighing in, but I'm not convinced. I thought we only
ever used a cleanup lock on the main bucket page to guard against
concurrent scans. Here you seem to be saying that we need a cleanup
lock on some page that may be an overflow page somewhere in the middle
of the chain, and that doesn't seem right to me.

So ... are you sure? If yes, can you provide any more detailed justification?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: POC, WIP: OR-clause support for indexes

2023-10-30 Thread Robert Haas
On Thu, Oct 26, 2023 at 5:05 PM Peter Geoghegan  wrote:
> On Thu, Oct 26, 2023 at 12:59 PM Robert Haas  wrote:
> > Alexander's example seems to show that it's not that simple. If I'm
> > reading his example correctly, with things like aid = 1, the
> > transformation usually wins even if the number of things in the OR
> > expression is large, but with things like aid + 1 * bid = 1, the
> > transformation seems to lose at least with larger numbers of items. So
> > it's not JUST the number of OR elements but also what they contain,
> > unless I'm misunderstanding his point.
>
> Alexander said "Generally, I don't see why ANY could be executed
> slower than the equivalent OR clause". I understood that this was his
> way of expressing the following idea:
>
> "In principle, there is no reason to expect execution of ANY() to be
> slower than execution of an equivalent OR clause (except for
> noise-level differences). While it might not actually look that way
> for every single type of plan you can imagine right now, that doesn't
> argue for making a cost-based decision. It actually argues for fixing
> the underlying issue, which can't possibly be due to some kind of
> fundamental advantage enjoyed by expression evaluation with ORs".
>
> This is also what I think of all this.

I agree with that, with some caveats, mainly that the reverse is to
some extent also true. Maybe not completely, because arguably the
ANY() formulation should just be straight-up easier to deal with, but
in principle, the two are equivalent and it shouldn't matter which
representation we pick.

But practically, it may, and we need to be sure that we don't put in
place a translation that is theoretically a win but in practice leads
to large regressions. Avoiding regressions here is more important than
capturing all the possible gains. A patch that wins in some scenarios
and does nothing in others can be committed; a patch that wins in even
more scenarios but causes serious regressions in some cases probably
can't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: DRAFT GIST support for ORDER BY

2023-10-30 Thread Michał Kłeczek



> On 30 Oct 2023, at 13:31, Matthias van de Meent 
>  wrote:
> 
>> The solution is not ideal as it requires registering “<“ and “>” operators 
>> as ordering operators in opfamily
>> (which in turn makes it possible to issue somewhat meaningless “ORDER BY a < 
>> ‘constant’)
> 
> I don't quite understand why we need to register new "<" and ">"
> operators. Can't we update the current ones?

I wasn’t precise: what is needed is adding pg_amop entries with amoppurpose = 
‘o’ for existing “<" and “>" operators.

> 
>> The problem is though that right now handling of ORDER BY column clauses is 
>> tightly coupled to BTree.
>> It would be good to refactor the code so that semantics of ORDER BY column 
>> could be more flexible.
> 
> The existence of a BTREE operator class for the type is the indicator
> that (and how) the type can be ordered - that is where PostgreSQL gets
> its methods for ordering most types. Although I agree that it's a
> quirk, I don't mind it that much as an indicator of how a type is
> ordered.
> I do agree, though, that operator classes by themselves should be able
> to say "hey, we support full ordered retrieval as well". Right now,
> that seems to be limited to btrees, but indeed a GiST index with
> btree_gist columns should be able to support the same.

Right now opfamily and strategy are set in PathKey before creating index scan 
paths.

The patch actually copies existing code from create_indexscan_plan
that finds an operator OID for (pk_opfamily, pk_strategy).
The operator is supposed to be binary with specific operand types.

To create a path:
1) do the operator OID lookup as above
2) look for sortfamily of pg_amop entry for (operator did, index opfamily)
If the sort family is the same as pk_opfamily we can create a path.

The side effect is that it is possible to “ORDER BY column < ‘constant’” as we 
have more ordering operators in pg_amop.

Ideally we could look up _unary_ operator in pg_amop instead - that would make 
sense we are actually measuring some “absolute distance”.
But this would require more changes - createplan.c would need to decide when to 
lookup unary and when - binary operator.


>> It would be great if someone could take a look at it.
> 
> I've not looked in detail at the patch, but here's some comments:
> 
>> --- a/contrib/btree_gist/btree_gist--1.6--1.7.sql
>> +++ b/contrib/btree_gist/btree_gist--1.6--1.7.sql
> 
> You seem to be modifying an existing migration of a released version
> of the btree_bist extension. I suggest you instead add a migration
> from 1.7 to a new version 1.8, and update the control file's default
> installed version.

Thanks. I didn’t know how to register a new migration so did it that way.
Will try to fix that.

> 
>> ORDER BY a == ORDER BY a <-> MIN_VALUE
>> and
>> ORDER BY a DESC == ORDER BY a <-> MAX_VALUE
>> 
>> This allows implementing GIST ordered scans for btree_gist datatypes.
>> 
>> This in turn makes using GIST with partitioning feasible (I have described 
>> issues with such usage in my previous e-mails - see below).
> 
> Did you take into account that GiST's internal distance function uses
> floating point, and is thus only an approximation for values that
> require more than 2^54 significant bits in their distance function?
> For example, GiST wouldn't be guaranteed to yield correct ordering of
> int8/bigint when you use `my_column <-> UINT64_MAX` because as far as
> the floating point math is concerned, 0 is about as far away from
> INT64_MAX as (say) 20 and -21.

Hmm… Good point but it means ORDER BY <-> is broken for these types then?
The patch assumes it works correctly and just uses it for ordered scans.

—
Michal



Re: Parallel query behaving different with custom GUCs

2023-10-30 Thread Robert Haas
On Thu, Oct 26, 2023 at 3:10 AM Rushabh Lathia  wrote:
> -- RESET the custom GUC and rerun the query.
> postgres=> reset myapp.blah;
> RESET
>
> -- Query should still run, but with forcing parallel plan, throwing an error.
> postgres=> select count(*) from ptest2 where current_setting('myapp.blah') is 
> null;
> ERROR:  unrecognized configuration parameter "myapp.blah"
> CONTEXT:  parallel worker
>
> -- Disable the parallel plan and query just runs fine.
> postgres=#set max_parallel_workers_per_gather to 0;
> SET
> postgres=#select count(*) from ptest2 where current_setting('myapp.blah') is 
> null;
>  count
> ---
>  0
> (1 row)
>
> We might need another placeholder for the custom GUCs. Currently, we are
> maintaining 3 linked lists in guc.c - guc_nondef_list, guc_stack_list,
> guc_report_list and to fix the above issue either we need a 4th list or do
> changes in the existing list.

I discussed this a bit with Rushabh off-list before he posted, and was
hoping someone else would reply, but since no one has:

Formally, I think this is a bug. However, the practical impact of it
is fairly low, because you have to be using custom GUCs in your query
and you have to RESET them instead of using SET to put them back to
the default value, which I'm guessing is something that not a lot of
people do. I'm a bit concerned that adding the necessary tracking
could be expensive, and I'm not sure we want to slow down things in
normal cases to cater to this somewhat strange case. On the other
hand, maybe we can fix it without too much cost, in which case that
would be good to do.

I'm also alert to my own possible bias. Perhaps since I designed this
mechanism, I'm more prone to viewing its deficiencies as minor than a
neutral observer would be. So if anyone is sitting there reading this
and thinking "wow, I can't believe Robert doesn't think it's important
to fix this," feel free to write back and say so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: "38.10.10. Shared Memory and LWLocks" may require a clarification

2023-10-30 Thread Aleksander Alekseev
Hi,

> PFA patch v4.

I didn't like the commit message. Here is the corrected patch. Sorry
for the noise.

-- 
Best regards,
Aleksander Alekseev


v5-0001-Clarify-the-38.10.10.-Shared-Memory-and-LWLocks-s.patch
Description: Binary data


Re: "38.10.10. Shared Memory and LWLocks" may require a clarification

2023-10-30 Thread Aleksander Alekseev
Michael,

On Fri, Oct 27, 2023 at 9:52 AM Michael Paquier  wrote:
>
> On Tue, May 23, 2023 at 01:47:52PM +0300, Aleksander Alekseev wrote:
> > That's me still talking to myself :)
>
> Let's be two then.

Many thanks for your feedback.

> +
> +It is convenient to use shmem_startup_hook which 
> allows
> +placing all the code responsible for initializing shared memory in one 
> place.
> +When using shmem_startup_hook the extension still 
> needs
> +to acquire AddinShmemInitLock in order to work 
> properly
> +on all the supported platforms including Windows.
>
> Yeah, AddinShmemInitLock is useful because extensions have no base
> point outside that, and they may want to update their local variables.
> Still, this is not completely exact because EXEC_BACKEND on
> non-Windows platform would still need it, so this should be mentioned.
> Another thing is that extensions may do like autoprewarm.c, where
> the shmem area is not initialized in the startup shmem hook.  This is
> a bit cheating because this is a scenario where shmem_request_hook is
> not requested, so shmem needs overallocation, but you'd also need a
> LWLock in this case, even for non-WIN32.

Got it. Let's simply remove the "including Windows" part then.

>
> +on all the supported platforms including Windows. This is also the reason
> +why the return value of GetNamedLWLockTranche is
> +conventionally stored in shared memory instead of local process memory.
> +
>
> Not sure to follow this sentence, the result of GetNamedLWLockTranche
> is the lock, so this sentence does not seem really necessary?

To be honest, by now I don't remember what was meant here, so I
removed the sentence.

> While we're on it, why not improving this part of the documentation more
> modern?  We don't mention LWLockNewTrancheId() and
> LWLockRegisterTranche() at all, so do you think that it would be worth
> adding a sample of code with that, mentioning autoprewarm.c as
> example?

Agree, these functions deserve to be mentioned in this section.

PFA patch v4.

-- 
Best regards,
Aleksander Alekseev


v4-0001-Clarify-the-38.10.10.-Shared-Memory-and-LWLocks-s.patch
Description: Binary data


Re: DRAFT GIST support for ORDER BY

2023-10-30 Thread Matthias van de Meent
On Mon, 30 Oct 2023 at 09:04, Michał Kłeczek  wrote:
>
> Hi All,
>
> Attached is a first attempt to implement GIST index (only) scans for ORDER BY 
> column clauses.

Cool!

> The solution is not ideal as it requires registering “<“ and “>” operators as 
> ordering operators in opfamily
> (which in turn makes it possible to issue somewhat meaningless “ORDER BY a < 
> ‘constant’)

I don't quite understand why we need to register new "<" and ">"
operators. Can't we update the current ones?

> The problem is though that right now handling of ORDER BY column clauses is 
> tightly coupled to BTree.
> It would be good to refactor the code so that semantics of ORDER BY column 
> could be more flexible.

The existence of a BTREE operator class for the type is the indicator
that (and how) the type can be ordered - that is where PostgreSQL gets
its methods for ordering most types. Although I agree that it's a
quirk, I don't mind it that much as an indicator of how a type is
ordered.
I do agree, though, that operator classes by themselves should be able
to say "hey, we support full ordered retrieval as well". Right now,
that seems to be limited to btrees, but indeed a GiST index with
btree_gist columns should be able to support the same.

> It would be great if someone could take a look at it.

I've not looked in detail at the patch, but here's some comments:

> --- a/contrib/btree_gist/btree_gist--1.6--1.7.sql
> +++ b/contrib/btree_gist/btree_gist--1.6--1.7.sql

You seem to be modifying an existing migration of a released version
of the btree_bist extension. I suggest you instead add a migration
from 1.7 to a new version 1.8, and update the control file's default
installed version.

> ORDER BY a == ORDER BY a <-> MIN_VALUE
> and
> ORDER BY a DESC == ORDER BY a <-> MAX_VALUE
>
> This allows implementing GIST ordered scans for btree_gist datatypes.
>
> This in turn makes using GIST with partitioning feasible (I have described 
> issues with such usage in my previous e-mails - see below).

Did you take into account that GiST's internal distance function uses
floating point, and is thus only an approximation for values that
require more than 2^54 significant bits in their distance function?
For example, GiST wouldn't be guaranteed to yield correct ordering of
int8/bigint when you use `my_column <-> UINT64_MAX` because as far as
the floating point math is concerned, 0 is about as far away from
INT64_MAX as (say) 20 and -21.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: pg_resetwal tests, logging, and docs update

2023-10-30 Thread Aleksander Alekseev
Hi,

> Here are updated versions of the remaining patches.  I took out the
> "FIXME" notes about the multipliers applying to the -c option and
> replaced them by gentler comments.  I don't intend to resolve those
> issues here.

The patch LGTM. However, postgresql:pg_resetwal test suite doesn't
pass on Windows according to cfbot. Seems to be a matter of picking a
more generic regular expression:

```
at C:/cirrus/src/bin/pg_resetwal/t/001_basic.pl line 54.
'pg_resetwal: error: could not change directory to
"foo": No such file or directory
  doesn't match '(?^:error: could not read permissions of directory)'
```

Should we simply use something like:

```
qr/error: could not (read|change).* directory/
```

... instead?

-- 
Best regards,
Aleksander Alekseev




Re: Open a streamed block for transactional messages during decoding

2023-10-30 Thread Amit Kapila
On Mon, Oct 30, 2023 at 2:17 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, October 30, 2023 12:20 PM Amit Kapila  
> wrote:
> >
> > On Thu, Oct 26, 2023 at 2:01 PM Zhijie Hou (Fujitsu) 
> > 
> > wrote:
> > >
> > > On Thursday, October 26, 2023 12:42 PM Amit Kapila
> >  wrote:
> > > >
> > > > On Tue, Oct 24, 2023 at 5:27 PM Zhijie Hou (Fujitsu)
> > > > 
> > > > wrote:
> > > > >
> > > > > While reviewing the test_decoding code, I noticed that when
> > > > > skip_empty_xacts option is specified, it doesn't open the
> > > > > streaming
> > > > block( e.g.
> > > > > pg_output_stream_start) before streaming the transactional MESSAGE
> > > > > even if it's the first change in a streaming block.
> > > > >
> > > > > It looks inconsistent with what we do when streaming DML changes(e.g.
> > > > > pg_decode_stream_change()).
> > > > >
> > > > > Here is a small patch to open the stream block in this case.
> > > > >
> > > >
> > > > The change looks good to me though I haven't tested it yet. BTW, can
> > > > we change the comment: "Output stream start if we haven't yet, but
> > > > only for the transactional case." to "Output stream start if we
> > > > haven't yet for transactional messages"?
> > >
> > > Thanks for the review and I changed this as suggested.
> > >
> >
> > --- a/contrib/test_decoding/expected/stream.out
> > +++ b/contrib/test_decoding/expected/stream.out
> > @@ -29,7 +29,10 @@ COMMIT;
> >  SELECT data FROM pg_logical_slot_get_changes('regression_slot',
> > NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', 
> > '1');
> > data
> >  --
> > + opening a streamed block for transaction
> >   streaming message: transactional: 1 prefix: test, sz: 50
> > + closing a streamed block for transaction aborting streamed
> > + (sub)transaction
> >
> > I was analyzing the reason for the additional message: "aborting streamed
> > (sub)transaction" in the above test and it seems to be due to the below 
> > check in
> > the function pg_decode_stream_abort():
> >
> > if (data->skip_empty_xacts && !xact_wrote_changes) return;
> >
> > Before the patch, we won't be setting the 'xact_wrote_changes' flag in 
> > txndata
> > which is fixed now. So, this looks okay to me. However, I have another
> > observation in this code which is that for aborts or subtransactions, we 
> > are not
> > checking the flag 'stream_wrote_changes', so we may end up emitting the
> > abort message even when no actual change has been streamed. I haven't tried
> > to generate a test to verify this observation, so I could be wrong as well 
> > but it is
> > worth analyzing such cases.
>
> I have confirmed that the mentioned case is possible(steps[1]): the
> sub-transaction doesn't output any data, but the stream abort for this
> sub-transaction will still be sent.
>
> But I think this may not be a problemic behavior, as even the pgoutput can
> behave similarly, e.g. If all the changes are filtered by row filter or table
> filter, then the stream abort will still be sent. The subscriber will skip
> handling the STREAM ABORT if the aborted txn was not applied.
>
> And if we want to fix this, in output plugin, we need to record if we have 
> sent
> any changes for each sub-transaction so that we can decide whether to send the
> following stream abort or not. We cannot use 'stream_wrote_changes' because
> it's a per streamed block flag and there could be serval streamed blocks for 
> one
> sub-txn. It looks a bit complicate to me.
>

I agree with your analysis. So, pushed the existing patch. BTW, sorry,
by mistake I used Peter's name as author.

-- 
With Regards,
Amit Kapila.




Not deleted mentions of the cleared path

2023-10-30 Thread Alena Rybakina

Hi, hackers!

I have already written about the problem of InvalidPath [0] appearing. I 
investigated this and found an error in the add_path() function, when we 
reject a path, we free up the memory of the path, but do not delete 
various mentions of it (for example, in the ancestor of relation, as in 
the example below).


Thus, we may face the problem of accessing the freed memory.

I demonstrated this below using gdb when I call a query after running a 
test in test/regression/sql/create_misc.sql:


*Query:*

:-- That ALTER TABLE should have added TOAST tables.

SELECT relname, reltoastrelid <> 0 AS has_toast_table
   FROM pg_class
   WHERE oid::regclass IN ('a_star', 'c_star')
   ORDER BY 1;

--UPDATE b_star*
--   SET a = text 'gazpacho'
--   WHERE aa > 4;

SELECT class, aa, a FROM a_star*;


*gdb:
*

0x7ff3f7325fda in epoll_wait (epfd=5, events=0x55bf9ee75c38, 
maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30

30  ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) b /home/alena/postgrespro_3/src/backend/optimizer/util/pathnode.c:620
Breakpoint 1 at 0x55bf9cfe4c65: file pathnode.c, line 621.
(gdb) c
Continuing.

Breakpoint 1, add_path (parent_rel=0x55bf9ef7f5c0, 
new_path=0x55bf9ef7f4e0) at pathnode.c:621

621 if (!IsA(new_path, IndexPath))
(gdb) n
622   pfree(new_path);
(gdb) n
624 }
(gdb) p *new_path
$1 = {type = T_Invalid, pathtype = T_Invalid, parent = 
0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f,
  param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe 
= 127, parallel_workers = 2139062143,
  rows = 1.3824172084878715e+306, startup_cost = 
1.3824172084878715e+306, total_cost = 1.3824172084878715e+306,

  pathkeys = 0x7f7f7f7f7f7f7f7f}
*(gdb) p new_path
$2 = (Path *) 0x55bf9ef7f4e0*

(gdb) p ((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements 
[0].ptr_value)->subpath)->path->parent->cheapest_startup_path

*$20 = (struct Path *) 0x55bf9ef7f4e0*

(gdb) p *((ProjectionPath *)((SortPath*)parent_rel->pathlist->elements 
[0].ptr_value)->subpath)->path->parent->cheapest_startup_path
$17 = {type = T_Invalid, pathtype = T_Invalid, parent = 
0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f,
  param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe 
= 127, parallel_workers = 2139062143,
  rows = 1.3824172084878715e+306, startup_cost = 
1.3824172084878715e+306, total_cost = 1.3824172084878715e+306,

  pathkeys = 0x7f7f7f7f7f7f7f7f}

(gdb) p (Path*)(((ProjectionPath 
*)((SortPath*)parent_rel->pathlist->elements 
[0].ptr_value)->subpath)->path->parent->pathlist->elements[1].ptr_value)

*$21 = (Path *) 0x55bf9ef7f4e0*

(gdb) p *(Path*)(((ProjectionPath 
*)((SortPath*)parent_rel->pathlist->elements 
[0].ptr_value)->subpath)->path->parent->pathlist->elements[1].ptr_value)
$19 = {type = T_Invalid, pathtype = T_Invalid, parent = 
0x7f7f7f7f7f7f7f7f, pathtarget = 0x7f7f7f7f7f7f7f7f,
  param_info = 0x7f7f7f7f7f7f7f7f, parallel_aware = 127, parallel_safe 
= 127, parallel_workers = 2139062143,
  rows = 1.3824172084878715e+306, startup_cost = 
1.3824172084878715e+306, total_cost = 1.3824172084878715e+306,

  pathkeys = 0x7f7f7f7f7f7f7f7f}
(gdb)


The same problem may be in the add_partial_path() function.

Unfortunately, I have not yet been able to find a problematic query with 
the described case, but I have prepared a patch to fix this problem.


What do you think?

0. 
https://www.postgresql.org/message-id/flat/01aa76cc-93e1-4de9-ab34-b3a890bf7...@postgrespro.ru


--
Regards,
Alena Rybakina
Postgres Professional
From b9483800eb9d8dc9bafb114c602c63d5cf596ee7 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Mon, 30 Oct 2023 14:22:21 +0300
Subject: [PATCH] Fix reject released path: we shouldn't only clean path, but
 we should delete all mentions ottherwice we will face the problem access to
 released memory.

---
 src/backend/optimizer/util/pathnode.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0b1d17b9d33..4df9d34c9f1 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -619,7 +619,19 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	{
 		/* Reject and recycle the new path */
 		if (!IsA(new_path, IndexPath))
-			pfree(new_path);
+		{
+			if(new_path == new_path->parent->cheapest_startup_path)
+new_path->parent->cheapest_startup_path = NULL;
+			if(new_path == new_path->parent->cheapest_total_path)
+new_path->parent->cheapest_total_path = NULL;
+			foreach(p1, new_path->parent->pathlist)
+			{
+Path *path = (Path *) lfirst(p1);
+
+if (path == new_path)
+	new_path->parent->pathlist = foreach_delete_current(new_path->parent->pathlist, p1);
+			}
+		}
 	}
 }
 
@@ -849,7 +861,17 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	else
 	{
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		

Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-30 Thread Amit Kapila
On Fri, Oct 27, 2023 at 3:23 AM David Rowley  wrote:
>
> On Thu, 26 Oct 2023 at 17:00, David Rowley  wrote:
> > Thanks for looking at this again. I fixed up each of those and pushed
> > the result, mentioning the incompatibility in the commit message.
> >
> > Now that that's done, I've attached a patch which makes use of the new
> > initReadOnlyStringInfo initializer function for the original case
> > mentioned when I opened this thread. I don't think there are any
> > remaining objections to this, but I'll let it sit for a bit to see.
>
> I've just pushed the deserial function optimisation patch.
>
> I was just looking at a few other places where we might want to make
> use of initReadOnlyStringInfo.
>
> * parallel.c in HandleParallelMessages():
>
> Drilling into HandleParallelMessage(), I see the PqMsg_BackendKeyData
> case just reads a fixed number of bytes.  In some of the other
> "switch" cases, I see calls pq_getmsgrawstring() either directly or
> indirectly.  I see the counterpart to pq_getmsgrawstring() is
> pq_sendstring() which always appends the NUL char to the StringInfo,
> so I don't think not NUL terminating the received bytes is a problem
> as cstrings seem to be sent with the NUL terminator.
>
> This case just seems to handle ERROR/NOTICE messages coming from
> parallel workers. Not tuples themselves. It may not be that
> interesting a case to speed up.
>
> * applyparallelworker.c in HandleParallelApplyMessages():
>
> Drilling into HandleParallelApplyMessage(), I don't see anything there
> that needs the input StringInfo to be NUL terminated.
>

Both the above calls are used to handle ERROR/NOTICE messages from
parallel workers as you have also noticed. The comment atop
initReadOnlyStringInfo() clearly states that it is used in the
performance-critical path. So, is it worth changing these places? In
the future, this may pose the risk of this API being used
inconsistently.

-- 
With Regards,
Amit Kapila.




Re: Why is DEFAULT_FDW_TUPLE_COST so insanely low?

2023-10-30 Thread Ashutosh Bapat
Looks like the value goes long back to
d0d75c402217421b691050857eb3d7af82d0c770. The comment there adds
"above and beyond cpu_tuple_cost". So certainly it's expected to be
higher than cpu_tuple_cost. I have no memories of this. But looking at
the surrounding code, I think DEFAULT_FDW_STARTUP_COST takes care of
network costs and bandwidths. So DEFAULT_FDW_TUPLE_COST is just
assembling row from bytes on network. That might have been equated to
assembling row from heap buffer.

But I think you are right, it should be comparable to the parallel
tuple cost which at least is IPC like socket. This will also mean that
operations which reduce the number of rows will be favoured and pushed
down. That's what is desired.

-- 
Best Wishes,
Ashutosh Bapat

On Mon, Oct 30, 2023 at 6:52 AM David Rowley  wrote:
>
> On Sun, 29 Oct 2023 at 12:45, Bruce Momjian  wrote:
> > Has anything been done about this issue?
>
> Nothing has been done. I was hoping to get the attention of a few
> people who have dealt more with postgres_fdw in the past.
>
> I've attached a patch with adjusts DEFAULT_FDW_TUPLE_COST and sets it
> to 0.2.  I set it to this because my experiment in [1] showed that it
> was about 21x lower than the actual costs (on my machine with a
> loopback fdw connecting to the same instance and database using my
> example query).  Given that we have parallel_tuple_cost set to 0.1 by
> default, the network cost of a tuple from an FDW of 0.2 seems
> reasonable to me. Slightly higher is probably also reasonable, but
> given the seeming lack of complaints, I think I'd rather err on the
> low side.
>
> Changing it to 0.2, I see 4 plans change in postgres_fdw's regression
> tests.  All of these changes are due to STD_FUZZ_FACTOR causing some
> other plan to win in add_path().
>
> For example the query EXPLAIN (VERBOSE, ANALYZE) SELECT a, sum(b),
> min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY
> 1; the plan switches from a HashAggregate to a GroupAggregate. This is
> because after increasing the DEFAULT_FDW_TUPLE_COST to 0.2 the sorted
> append child (fuzzily) costs the same as the unsorted seq scan path
> and the sorted path wins in add_path due to having better pathkeys.
> The seq scan path is then thrown away and we end up doing the Group
> Aggregate using the sorted append children.
>
> If I change STD_FUZZ_FACTOR to something like 1.001 then the plans
> no longer change when I do:
>
> alter server loopback options (add fdw_tuple_cost '0.01');
> 
> alter server loopback options (drop fdw_tuple_cost);
> 
>
> Ordinarily, I'd not care too much about that, but I did test the
> performance of one of the plans and the new plan came out slower than
> the old one.
>
> I'm not exactly sure how best to proceed on this in the absence of any 
> feedback.
>
> David
>
> [1] 
> https://postgr.es/m/caaphdvopvjjfh5c1ed2hrvddfom2depmwwiu5-f1anmyprj...@mail.gmail.com




Re: Is this a problem in GenericXLogFinish()?

2023-10-30 Thread Amit Kapila
On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier  wrote:
>
> On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote:
> > Yes, we need it to exclude any concurrent in-progress scans that could
> > return incorrect tuples during bucket squeeze operation.
>
> Thanks.  So I assume that we should just set REGBUF_NO_CHANGE when the
> primary and write buffers are the same and there are no tuples to
> move.  Say with something like the attached?
>

What if the primary and write buffer are not the same but ntups is
zero? Won't we hit the assertion again in that case?

-- 
With Regards,
Amit Kapila.




Explicitly skip TAP tests under Meson if disabled

2023-10-30 Thread Peter Eisentraut
Under Meson, it is not very easy to see if TAP tests have been enabled 
or disabled, if you rely on the default auto setting.  You either need 
to carefully study the meson setup output, or you notice, what a minute, 
didn't there use to be like 250 tests, not only 80?


I think it would be better if we still registered the TAP tests in Meson 
even if the tap_tests option is disabled, but with a dummy command that 
registers them as skipped.  That way you get a more informative output like


Ok: 78
Expected Fail:  0
Fail:   0
Unexpected Pass:0
Skipped:187
Timeout:0

which is really a more accurate representation of what the test run 
actually accomplished than "everything Ok".


See attached patch for a possible implementation.  (This uses perl as a 
hard build requirement.  We are planning to do that anyway, but 
obviously other implementations, such as using python, would also be 
possible.)From 19e78e4c5a16337c0ac4e661beb4729505736016 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 30 Oct 2023 05:32:45 -0400
Subject: [PATCH] Explicitly skip TAP tests under Meson if disabled

If the tap_tests option is disabled under Meson, the TAP tests are
currently not registered at all.  But this makes it harder to see what
is going one, why suddently there are fewer tests than before.

Instead, register the tests anyway but with a dummy command that marks
them as skipped.  That way, the total list and count of tests is
constant whether the option is enabled or not.
---
 meson.build | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/meson.build b/meson.build
index 2d516c8f372..1795dd8159e 100644
--- a/meson.build
+++ b/meson.build
@@ -3248,10 +3248,6 @@ foreach test_dir : tests
 
   testport += 1
 elif kind == 'tap'
-  if not tap_tests_enabled
-continue
-  endif
-
   test_command = [
 perl.path(),
 '-I', meson.source_root() / 'src/test/perl',
@@ -3286,16 +3282,23 @@ foreach test_dir : tests
   onetap_p = fs.stem(onetap_p)
 endif
 
-test(test_dir['name'] / onetap_p,
-  python,
-  kwargs: test_kwargs,
-  args: testwrap_base + [
-'--testgroup', test_dir['name'],
-'--testname', onetap_p,
-'--', test_command,
-test_dir['sd'] / onetap,
-  ],
-)
+if tap_tests_enabled
+  test(test_dir['name'] / onetap_p,
+python,
+kwargs: test_kwargs,
+args: testwrap_base + [
+  '--testgroup', test_dir['name'],
+  '--testname', onetap_p,
+  '--', test_command,
+  test_dir['sd'] / onetap,
+],
+  )
+else
+  test(test_dir['name'] / onetap_p,
+   perl,
+   args: ['-e', 'print "1..0 # Skipped: TAP tests not enabled"'],
+   kwargs: test_kwargs)
+endif
   endforeach
   install_suites += test_group
 else
-- 
2.42.0



Re: pg_upgrade and logical replication

2023-10-30 Thread vignesh C
On Fri, 27 Oct 2023 at 12:09, vignesh C  wrote:
>
> On Thu, 21 Sept 2023 at 11:27, Michael Paquier  wrote:
> >
> > On Fri, Sep 15, 2023 at 03:08:21PM +0530, vignesh C wrote:
> > > On Tue, 12 Sept 2023 at 14:25, Hayato Kuroda (Fujitsu)
> > >  wrote:
> > >> Is there a possibility that apply worker on old cluster connects to the
> > >> publisher during the upgrade? Regarding the pg_upgrade on publisher, the 
> > >> we
> > >> refuse TCP/IP connections from remotes and port number is also changed, 
> > >> so we can
> > >> assume that subscriber does not connect to. But IIUC such settings may 
> > >> not affect
> > >> to the connection source, so that the apply worker may try to connect to 
> > >> the
> > >> publisher. Also, is there any hazards if it happens?
> > >
> > > Yes, there is a possibility that the apply worker gets started and new
> > > transaction data is being synced from the publisher. I have made a fix
> > > not to start the launcher process in binary ugprade mode as we don't
> > > want the launcher to start apply worker during upgrade.
> >
> > Hmm.  I was wondering if 0001 is the right way to handle this case,
> > but at the end I'm OK to paint one extra isBinaryUpgrade in the code
> > path where apply launchers are registered.  I don't think that the
> > patch is complete, though.  A comment should be added in pg_upgrade's
> > server.c, exactly start_postmaster(), to tell that -b also stops apply
> > workers.  I am attaching a version updated as of the attached, that
> > I'd be OK to apply.
>
> I have added comments
>
> > I don't really think that we need to worry about a subscriber
> > connecting back to a publisher in this case, though?  I mean, each
> > postmaster instance started by pg_upgrade restricts the access to the
> > instance with unix_socket_directories set to a custom path and
> > permissions at 0700, and a subscription's connection string does not
> > know the unix path used by pg_upgrade.  I certainly agree that
> > stopping these processes could lead to inconsistencies in the data the
> > subscribers have been holding though, if we are not careful, so
> > preventing them from running is a good practice anyway.
>
> I have made the fix similar to how upgrade publisher has done to keep
> it  consistent.
>
> > I have also reviewed 0002.  As a whole, I think that I'm OK with the
> > main approach of the patch in pg_dump to use a new type of dumpable
> > object for subscription relations that are dumped with their upgrade
> > functions after.  This still needs more work, and more documentation.
>
> Added documentation
>
> > Also, perhaps we should really have an option to control if this part
> > of the copy happens or not.  With a --no-subscription-relations for
> > pg_dump at least?
>
> Currently this is done by default in binary upgrade mode, I will add a
> separate patch to skip dump of subscription relations from upgrade and
> dump a little later.
>
> >
> > +{ oid => '4551', descr => 'add a relation with the specified relation 
> > state to pg_subscription_rel table',
> >
> > During a development cycle, any new function added needs to use an OID
> > in range 8000-.  Running unused_oids will suggest new random OIDs.
>
> Modified
>
> > FWIW, I am not convinced that there is a need for two functions to add
> > an entry to pg_subscription_rel, with sole difference between both the
> > handling of a valid or invalid LSN.  We should have only one function
> > that's able to handle NULL for the LSN.  So let's remove rel_state_a
> > and rel_state_b, and have a single rel_state().  The description of
> > the SQL functions is inconsistent with the other binary upgrade ones,
> > I would suggest for the two functions
> > "for use by pg_upgrade (relation for pg_subscription_rel)"
> > "for use by pg_upgrade (remote_lsn for origin)"
>
> Removed rel_state_a and rel_state_b and updated the description accordingly
>
> > +   i_srsublsn = PQfnumber(res, "srsublsn");
> > [...]
> > +   subrinfo[cur_rel].srsublsn = pg_strdup(PQgetvalue(res, i, 
> > i_srsublsn));
> >
> > In getSubscriptionTables(), this should check for PQgetisnull()
> > because we would have a NULL value for InvalidXLogRecPtr in the
> > catalog.  Using a char* for srsublsn is OK, but just assign NULL to
> > it, then just pass a hardcoded NULL value to the function as we do in
> > other places.  So I don't quite get why this is not the same handling
> > as suboriginremotelsn.
>
> Modified
>
> >
> > getSubscriptionTables() is entirely skipped if we don't want any
> > subscriptions, if we deal with a server of 9.6 or older or if we don't
> > do binary upgrades, which is OK.
> >
> > +/*
> > + * getSubscriptionTables
> > + *   get information about subscription membership for dumpable tables.
> > + */
> > This commit is slightly misleading and should mention that this is an
> > upgrade-only path?
>
> Modified
>
> >
> > The code for dumpSubscriptionTable() is a copy-paste of
> > dumpPublicationTable(), but a lot of what you are 

Re: A recent message added to pg_upgade

2023-10-30 Thread Bharath Rupireddy
On Mon, Oct 30, 2023 at 2:31 PM Bharath Rupireddy
 wrote:
>

Never mind. Previous message was accidentally sent before I finished
writing my comments.

> Yeah. The check_hook is called even after the param is specified in
> postgresql.conf during the upgrade, so I see no problem there.
>
> > The error message, which is deemed impossible, adds an additional
> > message translation. In another thread, we are discussing the
> > reduction of translatable messages. Therefore, I suggest using elog()
> > for the condition at the very least. Whether it should be elog() or
> > Assert() remains open for discussion, as I don't have a firm stance on
> > it.
>
> I get it. I agree to go with just the assert because the GUC
> check_hook kinda tightens the screws against setting
> max_slot_wal_keep_size to a value other than -1 during the binary
> upgrade,

 A few comments on your inhibit_m_s_w_k_s_during_upgrade_2.txt:

1.
+ * All WAL files on the publisher node must be retained during an upgrade to
+ * maintain connections from downstreams.  While pg_upgrade explicitly sets

It's not just the publisher, anyone using logical slots. Also, no
downstream please. If you want, you can repurpose the comment that's
added by 29d0a77f.

/*
 * Use max_slot_wal_keep_size as -1 to prevent the WAL removal by the
 * checkpointer process.  If WALs required by logical replication slots
 * are removed, the slots are unusable.  This setting prevents the
 * invalidation of slots during the upgrade. We set this option when
 * cluster is PG17 or later because logical replication slots can only be
 * migrated since then. Besides, max_slot_wal_keep_size is added in PG13.
 */

2.
 At present, only logical slots really require
+ * this.

Can we remove the above comment as the code with SlotIsLogical(s)
explains it all?

3.
+GUC_check_errdetail("\"max_slot_wal_keep_size\" must be set
to -1 during the upgrade.");
+return false;

How about we be explicit like the following which helps users reason
about this restriction instead of them looking at the comments/docs?

GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
GUC_check_errmsg(""\"max_slot_wal_keep_size\" must be set
to -1 when in binary upgrade mode");
GUC_check_errdetail("A value of -1 prevents the removal of
WAL required for logical slots upgrade.");
return false;

4. I think a test case to hit the error in the check hook in
003_logical_slots.pl will help a lot here - not only covers the code
but also helps demonstrate how one can reach the error.

5. I think the check_hook is better defined in xlog.c the place where
it's actually being declared and in action. IMV, there's no reason for
it to be in slot.c as it doesn't deal with any slot related
variables/structs. This also avoids an unnecessary "utils/guc_hooks.h"
inclusion in slot.c.
+bool
+check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source)
+{

5. A possible problem with this check_hook approach is that it doesn't
let anyone setting max_slot_wal_keep_size to a value other than -1
during pg_ugprade even if someone doesn't have logical slots or
doesn't want to upgrade logical slots in which case the WAL file
growth during pg_upgrade may be huge (transiently) unless the
pg_resetwal step of pg_upgrade removes it at the end.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: PGDOCS - add more links in the pub/sub reference pages

2023-10-30 Thread Amit Kapila
On Mon, Oct 16, 2023 at 6:15 AM Peter Smith  wrote:
>
> Thanks for pushing the 0001 patch! I am unsure what the decision was
> for the remaining patches, but anyway, here they are again (rebased).
>

To keep the link names the same in logical replication-related
commands, I have pushed your patch.

-- 
With Regards,
Amit Kapila.




RE: PATCH: document for regression test forgets libpq test

2023-10-30 Thread Ryo Matsumura (Fujitsu)
Hi,

> Wednesday, September 6, 2023 2:06 AM Bruce Momjian  wrote:
> Yes, good point.  I modifed the patch, attached, and applied it to all
> supported versions.

Thank you.  # I forgot to send mail.

Best Regards
Ryo Matsumura

> -Original Message-
> From: Bruce Momjian 
> Sent: Wednesday, September 6, 2023 2:06 AM
> To: Matsumura, Ryo/松村 量 
> Cc: pgsql-hack...@postgresql.org
> Subject: Re: PATCH: document for regression test forgets libpq test
> 
> On Fri, Sep  1, 2023 at 08:01:47AM +, Ryo Matsumura (Fujitsu) wrote:
> > Hi,
> >
> > I found a small mistake in document in 33.1.3. Additional Test Suites.
> >
> > > The additional tests that can be invoked this way include:
> > The list doesn't include interface/libpq/test.
> >
> > I attached patch.
> 
> Yes, good point.  I modifed the patch, attached, and applied it to all
> supported versions.
> 
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
> 
>   Only you can decide what is important to you.




RE: A recent message added to pg_upgade

2023-10-30 Thread Hayato Kuroda (Fujitsu)
Dear Bharath,

> Will the check_hook approach work correctly?

I tested by using the first version and worked well (rejected). Please see the
log which recorded the output and log. Below lines were copied from server
log and found that max_slot_wal_keep_size must not be >= 0.

```
waiting for server to start2023-10-30 08:53:32.529 GMT [6903] FATAL:  
invalid value for parameter "max_slot_wal_keep_size": 1
 stopped waiting
pg_ctl: could not start serve
```

> I haven't checked that by
> myself, but I see InitializeGUCOptions() getting called before
> IsBinaryUpgrade is set to true and the passed-in config options ('c')
> are parsed.

I thought the key point was that user-defined options are aligned after the 
"-b".
User-defined options are set after the '-b'  option, so check_hook could work
as we expected. Thought?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

$ pg_upgrade -b /usr/local/pgsql/bin/ -B /usr/local/pgsql/bin/ -d data_N1/ -D 
data_N2/ -r -U po
stgres -o "-c max_slot_wal_keep_size=1MB"
Performing Consistency Checks
-
Checking cluster versions ok

*failure*
Consult the last few lines of 
"data_N2/pg_upgrade_output.d/20231030T085332.388/log/pg_upgrade_server.log" for
the probable cause of the failure.

connection to server on socket "/home/hayato/test/pgdump/.s.PGSQL.50432" 
failed: No such file or directory
Is the server running locally and accepting connections on that socket?


could not connect to source postmaster started with the command:
"/usr/local/pgsql/bin/pg_ctl" -w -l 
"data_N2/pg_upgrade_output.d/20231030T085332.388/log/pg_upgrade_server.log" -D 
"data_N1" -o "-p 50432 -b -c max_slot_wal_keep_size=-1 -c 
max_slot_wal_keep_size=1MB -c listen_addresses='' -c 
unix_socket_permissions=0700 -c 
unix_socket_directories='/home/hayato/test/pgdump'" start
Failure, exiting

pg_upgrade_server.log
Description: pg_upgrade_server.log


Re: A recent message added to pg_upgade

2023-10-30 Thread Bharath Rupireddy
On Mon, Oct 30, 2023 at 1:42 PM Kyotaro Horiguchi
 wrote:
>
> At Mon, 30 Oct 2023 12:38:47 +0530, Bharath Rupireddy 
>  wrote in
> > Will the check_hook approach work correctly? I haven't checked that by
> > myself, but I see InitializeGUCOptions() getting called before
> > IsBinaryUpgrade is set to true and the passed-in config options ('c')
> > are parsed.
>
> I'm not sure about the wanted behavior exactly, but the fast you
> pointed doesn't matter because the check is required after parsing the
> command line options. On the other hand, I'm not sure about the
> behavior that a setting in postgresql.conf is rejected.

Yeah. The check_hook is called even after the param is specified in
postgresql.conf during the upgrade, so I see no problem there.

> > If the check_hook approach works correctly, I think we must add a test
> > hitting the error in check_max_slot_wal_keep_size for the
> > IsBinaryUpgrade case. And, I agree with Amit to have a detailed
> > messaging with GUC_check_errmsg/GUC_check_errdetail. Also, IMV,
> > leaving the error message in InvalidatePossiblyObsoleteSlot() there
> > (if required with a better wording as discussed initially in this
> > thread) does no harm. Actually, it acts as another safety net given
> > that max_slot_wal_keep_size GUC is reloadable via SIGHUP.
>
> The error message, which is deemed impossible, adds an additional
> message translation. In another thread, we are discussing the
> reduction of translatable messages. Therefore, I suggest using elog()
> for the condition at the very least. Whether it should be elog() or
> Assert() remains open for discussion, as I don't have a firm stance on
> it.

I get it. I agree to go with just the assert because the GUC
check_hook kinda tightens the screws against setting
max_slot_wal_keep_size to a value other than -1 during the binary
upgrade,

A few comments on your inhibit_m_s_w_k_s_during_upgrade_2.txt:

1.



--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: Open a streamed block for transactional messages during decoding

2023-10-30 Thread Zhijie Hou (Fujitsu)
On Monday, October 30, 2023 12:20 PM Amit Kapila  
wrote:
> 
> On Thu, Oct 26, 2023 at 2:01 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Thursday, October 26, 2023 12:42 PM Amit Kapila
>  wrote:
> > >
> > > On Tue, Oct 24, 2023 at 5:27 PM Zhijie Hou (Fujitsu)
> > > 
> > > wrote:
> > > >
> > > > While reviewing the test_decoding code, I noticed that when
> > > > skip_empty_xacts option is specified, it doesn't open the
> > > > streaming
> > > block( e.g.
> > > > pg_output_stream_start) before streaming the transactional MESSAGE
> > > > even if it's the first change in a streaming block.
> > > >
> > > > It looks inconsistent with what we do when streaming DML changes(e.g.
> > > > pg_decode_stream_change()).
> > > >
> > > > Here is a small patch to open the stream block in this case.
> > > >
> > >
> > > The change looks good to me though I haven't tested it yet. BTW, can
> > > we change the comment: "Output stream start if we haven't yet, but
> > > only for the transactional case." to "Output stream start if we
> > > haven't yet for transactional messages"?
> >
> > Thanks for the review and I changed this as suggested.
> >
> 
> --- a/contrib/test_decoding/expected/stream.out
> +++ b/contrib/test_decoding/expected/stream.out
> @@ -29,7 +29,10 @@ COMMIT;
>  SELECT data FROM pg_logical_slot_get_changes('regression_slot',
> NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', 
> '1');
> data
>  --
> + opening a streamed block for transaction
>   streaming message: transactional: 1 prefix: test, sz: 50
> + closing a streamed block for transaction aborting streamed
> + (sub)transaction
> 
> I was analyzing the reason for the additional message: "aborting streamed
> (sub)transaction" in the above test and it seems to be due to the below check 
> in
> the function pg_decode_stream_abort():
> 
> if (data->skip_empty_xacts && !xact_wrote_changes) return;
> 
> Before the patch, we won't be setting the 'xact_wrote_changes' flag in txndata
> which is fixed now. So, this looks okay to me. However, I have another
> observation in this code which is that for aborts or subtransactions, we are 
> not
> checking the flag 'stream_wrote_changes', so we may end up emitting the
> abort message even when no actual change has been streamed. I haven't tried
> to generate a test to verify this observation, so I could be wrong as well 
> but it is
> worth analyzing such cases.

I have confirmed that the mentioned case is possible(steps[1]): the
sub-transaction doesn't output any data, but the stream abort for this
sub-transaction will still be sent.

But I think this may not be a problemic behavior, as even the pgoutput can
behave similarly, e.g. If all the changes are filtered by row filter or table
filter, then the stream abort will still be sent. The subscriber will skip
handling the STREAM ABORT if the aborted txn was not applied.

And if we want to fix this, in output plugin, we need to record if we have sent
any changes for each sub-transaction so that we can decide whether to send the
following stream abort or not. We cannot use 'stream_wrote_changes' because
it's a per streamed block flag and there could be serval streamed blocks for one
sub-txn. It looks a bit complicate to me.


[1]
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 
'test_decoding');
BEGIN;
savepoint p1;
CREATE TABLE test(a int);
INSERT INTO test VALUES(1);
savepoint p2;
CREATE TABLE test2(a int);
ROLLBACK TO SAVEPOINT p2;
COMMIT;

SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 
'skip-empty-xacts', '1', 'include-xids', '1', 'stream-changes', '1');

   data
--
 opening a streamed block for transaction TXN 734
 streaming change for TXN 734
 closing a streamed block for transaction TXN 734
 aborting streamed (sub)transaction TXN 736
 committing streamed transaction TXN 734

Best Regards,
Hou zj





Re: Add new option 'all' to pg_stat_reset_shared()

2023-10-30 Thread Bharath Rupireddy
On Mon, Oct 30, 2023 at 1:47 PM Kyotaro Horiguchi
 wrote:
>
> At Mon, 30 Oct 2023 16:35:19 +0900, torikoshia  
> wrote in
> > Hi,
> >
> > After 96f052613f3, we have below 6 types of parameter for
> > pg_stat_reset_shared().
> >
> >   "archiver", "bgwriter", "checkpointer", "io", "recovery_prefetch",
> >   "wal"
> >
> > How about adding a new option 'all' to delete all targets above?
> >
> > I imagine there are cases where people want to initialize all of them
> > at the same time in addition to initializing one at a time.
>
> FWIW, I fairly often wanted it, but forgot about that:p

Isn't calling pg_stat_reset_shared() for all stats types helping you
out? Is there any problem with it? Can you be more specific about the
use-case?

IMV, I don't see any point for adding another pseudo (rather
non-existent) shared stats target which might confuse users - it's
easy to specify pg_stat_reset_shared('all'); to clear things out when
someone actually doesn't want to reset all - an accidental usage of
the 'all' option will reset all shared memory stats.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add new option 'all' to pg_stat_reset_shared()

2023-10-30 Thread Kyotaro Horiguchi
At Mon, 30 Oct 2023 16:35:19 +0900, torikoshia  
wrote in 
> Hi,
> 
> After 96f052613f3, we have below 6 types of parameter for
> pg_stat_reset_shared().
> 
>   "archiver", "bgwriter", "checkpointer", "io", "recovery_prefetch",
>   "wal"
> 
> How about adding a new option 'all' to delete all targets above?
> 
> I imagine there are cases where people want to initialize all of them
> at the same time in addition to initializing one at a time.

FWIW, I fairly often wanted it, but forgot about that:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A recent message added to pg_upgade

2023-10-30 Thread Kyotaro Horiguchi
At Mon, 30 Oct 2023 12:38:47 +0530, Bharath Rupireddy 
 wrote in 
> Will the check_hook approach work correctly? I haven't checked that by
> myself, but I see InitializeGUCOptions() getting called before
> IsBinaryUpgrade is set to true and the passed-in config options ('c')
> are parsed.

I'm not sure about the wanted behavior exactly, but the fast you
pointed doesn't matter because the check is required after parsing the
command line options. On the other hand, I'm not sure about the
behavior that a setting in postgresql.conf is rejected.

> If the check_hook approach works correctly, I think we must add a test
> hitting the error in check_max_slot_wal_keep_size for the
> IsBinaryUpgrade case. And, I agree with Amit to have a detailed
> messaging with GUC_check_errmsg/GUC_check_errdetail. Also, IMV,
> leaving the error message in InvalidatePossiblyObsoleteSlot() there
> (if required with a better wording as discussed initially in this
> thread) does no harm. Actually, it acts as another safety net given
> that max_slot_wal_keep_size GUC is reloadable via SIGHUP.

The error message, which is deemed impossible, adds an additional
message translation. In another thread, we are discussing the
reduction of translatable messages. Therefore, I suggest using elog()
for the condition at the very least. Whether it should be elog() or
Assert() remains open for discussion, as I don't have a firm stance on
it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




DRAFT GIST support for ORDER BY

2023-10-30 Thread Michał Kłeczek
Hi All,Attached is a first attempt to implement GIST index (only) scans for ORDER BY column clauses.The idea is that it order by column for some datatypes is a special case of ordering by distance:ORDER BY a == ORDER BY a <-> MIN_VALUEandORDER BY a DESC == ORDER BY a <-> MAX_VALUEThis allows implementing GIST ordered scans for btree_gist datatypes.This in turn makes using GIST with partitioning feasible (I have described issues with such usage in my previous e-mails - see below).The solution is not ideal as it requires registering “<“ and “>” operators as ordering operators in opfamily(which in turn makes it possible to issue somewhat meaningless “ORDER BY a < ‘constant’)The problem is though that right now handling of ORDER BY column clauses is tightly coupled to BTree.It would be good to refactor the code so that semantics of ORDER BY column could be more flexible.It would be great if someone could take a look at it.Thanks,Michal 

DRAFT-gist-orderby.patch
Description: Binary data
On 24 Oct 2023, at 13:22, Michał Kłeczek  wrote:Hi,Some time ago I’ve provided some details with the issues we face when trying to use GIST and partitioning at the same time in the postgresql-general mailing list:GIST index and ORDER BYpostgresql.orgWe decided to go with the solution to partition our table by:RANGE (‘2100-01-01' <-> operation_date).While it (somewhat) solves partition pruning issues described above there is another problem:It is impossible to create a unique constraint on the partitioned table.So now we cannot use INSERT … ON CONFLICT (…) DO UPDATEMy question to hackers:Would it be feasible to implement ORDER BY column GIST index (only) scan for types with total order and sensible greatest and least values?Thanks,Michal

Re: A performance issue with Memoize

2023-10-30 Thread Richard Guo
On Thu, Oct 26, 2023 at 12:07 PM Andrei Lepikhov 
wrote:

> Do you've thought about the case, fixed with the commit 1db5667? As I
> see, that bugfix still isn't covered by regression tests. Could your
> approach of a PARAM_EXEC slot reusing break that case?


Hm, I don't think so.  The issue fixed by commit 1db5667 was caused by
sharing PARAM_EXEC slots between different levels of NestLoop.  AFAICS
it's safe to share PARAM_EXEC slots within the same level of NestLoop.

The change here is about sharing PARAM_EXEC slots between subquery's
subplan_params and outer-relation variables, which happens within the
same level of NestLoop.

Actually, even without this change, we'd still share PARAM_EXEC slots
between subquery's subplan_params and outer-relation variables in some
cases.  As an example, consider

explain (costs off)
select * from t t1 left join
(t t2 left join
lateral (select t1.a as t1a, t2.a as t2a, * from t t3) s
on t2.b = s.b)
on t1.b = s.b and t1.a = t2.a;
  QUERY PLAN
---
 Nested Loop Left Join
   ->  Seq Scan on t t1
   ->  Nested Loop
 Join Filter: (t1.a = t2.a)
 ->  Seq Scan on t t2
 ->  Subquery Scan on s
   Filter: ((t1.b = s.b) AND (t2.b = s.b))
   ->  Seq Scan on t t3
(8 rows)

For outer-relation Var 't1.a' from qual 't1.a = t2.a', it shares
PARAM_EXEC slot 0 with the PlannerParamItem for 't1.a' within the
subquery (from its targetlist).

Did you notice a case that the change here breaks?

Hi Tom, could you share your insights on this issue and the proposed
fix?

Thanks
Richard


Re: A recent message added to pg_upgade

2023-10-30 Thread Kyotaro Horiguchi
At Mon, 30 Oct 2023 03:36:41 +, "Zhijie Hou (Fujitsu)" 
 wrote in 
> Thanks for the diff and I think the approach basically works.
> 
> One notable behavior of this approach it will reject the GUC setting even if 
> there
> are no slots on old cluster or user set the value to a big enough value which
> doesn't cause invalidation. The behavior doesn't look bad to me but just 
> mention it
> for reference.

Indeed.  pg_upgrade anyway sets the variable to -1 irrespective of the
slot's existence, and I see no justification for allowing users to
forcibly change it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: A recent message added to pg_upgade

2023-10-30 Thread Kyotaro Horiguchi
At Mon, 30 Oct 2023 08:51:18 +0530, Amit Kapila  wrote 
in 
> I think we can simply change that error message to assert if we want
> to go with the check hook idea of yours. BTW, can we add
> GUC_check_errdetail() with a better message as some of the other check
> function uses? Also, I guess we can add some comments or at least
> refer to the existing comments to explain the reason of such a check.

Definitely.  I've attached the following changes.

1. Added GUC_check_errdetail() to the check function.

2. Added a comment to the check function (based on my knowledge about
  the feature).

3. Altered the ereport() into Assert() in
   InvalidatePossiblyObsoleteSlot().  I considered removing the
   !SlotIsLogical() condition since pg_upgrade always sets
   max_slot_wal_keep_size to -1, but I left it unchanged.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 99823df3c7..f7dc966600 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -52,6 +52,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
+#include "utils/guc_hooks.h"
 
 /*
  * Replication slot on-disk data structure.
@@ -111,6 +112,26 @@ static void RestoreSlotFromDisk(const char *name);
 static void CreateSlotOnDisk(ReplicationSlot *slot);
 static void SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel);
 
+/*
+ * GUC check_hook for max_slot_wal_keep_size
+ *
+ * All WAL files on the publisher node must be retained during an upgrade to
+ * maintain connections from downstreams.  While pg_upgrade explicitly sets
+ * this variable to -1, there are ways for users to modify it. Ensure it
+ * remains unchanged for safety.  See InvalidatePossiblyObsoleteSlot() and
+ * start_postmaster() in pg_upgrade for more details.
+ */
+bool
+check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source)
+{
+   if (IsBinaryUpgrade && *newval != -1)
+   {
+   GUC_check_errdetail("\"max_slot_wal_keep_size\" must be set to 
-1 during the upgrade.");
+   return false;
+   }
+   return true;
+}
+
 /*
  * Report shared-memory space needed by ReplicationSlotsShmemInit.
  */
@@ -1424,18 +1445,12 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
SpinLockRelease(>mutex);
 
/*
-* The logical replication slots shouldn't be invalidated as
-* max_slot_wal_keep_size GUC is set to -1 during the upgrade.
-*
-* The following is just a sanity check.
+* check_max_slot_wal_keep_size() ensures 
max_slot_wal_keep_size is set
+* to -1, so, slot invalidation for logical slots shouldn't 
happen
+* during an upgrade. At present, only logical slots really 
require
+* this.
 */
-   if (*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)
-   {
-   ereport(ERROR,
-   
errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-   errmsg("replication slots must not be 
invalidated during the upgrade"),
-   errhint("\"max_slot_wal_keep_size\" 
must be set to -1 during the upgrade"));
-   }
+   Assert (!*invalidated || !SlotIsLogical(s) || !IsBinaryUpgrade);
 
if (active_pid != 0)
{
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 7605eff9b9..818ffdb2ae 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2845,7 +2845,7 @@ struct config_int ConfigureNamesInt[] =
},
_slot_wal_keep_size_mb,
-1, -1, MAX_KILOBYTES,
-   NULL, NULL, NULL
+   check_max_slot_wal_keep_size, NULL, NULL
},
 
{
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 2a191830a8..3d74483f44 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -84,6 +84,8 @@ extern bool check_maintenance_io_concurrency(int *newval, 
void **extra,
 extern void assign_maintenance_io_concurrency(int newval, void *extra);
 extern bool check_max_connections(int *newval, void **extra, GucSource source);
 extern bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
+   
 GucSource source);
 extern void assign_max_wal_size(int newval, void *extra);
 extern bool check_max_worker_processes(int *newval, void **extra,
   
GucSource source);


Add new option 'all' to pg_stat_reset_shared()

2023-10-30 Thread torikoshia

Hi,

After 96f052613f3, we have below 6 types of parameter for 
pg_stat_reset_shared().


  "archiver", "bgwriter", "checkpointer", "io", "recovery_prefetch",  
"wal"


How about adding a new option 'all' to delete all targets above?

I imagine there are cases where people want to initialize all of them at 
the same time in addition to initializing one at a time.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2023-10-30 Thread Pavel Borisov
Hi, Alexander!


On Wed, 25 Oct 2023 at 00:13, Alexander Korotkov  wrote:
>
> On Wed, Sep 28, 2022 at 11:44 AM Aleksander Alekseev
>  wrote:
> > > I think, this patch was marked as "Waiting on Author", probably, by 
> > > mistake. Since recent changes were done without any significant code 
> > > changes and CF bot how happy again.
> > >
> > > I'm going to move it to RfC, could I? If not, please tell why.
> >
> > I restored the "Ready for Committer" state. I don't think it's a good
> > practice to change the state every time the patch has a slight
> > conflict or something. This is not helpful at all. Such things happen
> > quite regularly and typically are fixed in a couple of days.
>
> This patch seems useful to me.  I went through the thread, it seems
> that all the critics are addressed.
>
> I've rebased this patch.   Also, I've run perltidy for tests, split
> long errmsg() into errmsg(), errdetail() and errhint(), and do other
> minor enchantments.
>
> I think this patch is ready to go.  I'm going to push it if there are
> no objections.
>
> --
> Regards,
> Alexander Korotkov

It's very good that this long-standing patch is finally committed. Thanks a lot!

Regards,
Pavel Borisov




Add BF member koel-like indentation checks to SanityCheck CI

2023-10-30 Thread Bharath Rupireddy
Hi,

How about adding code indent checks (like what BF member koel has) to
the SanityCheck CI task? This helps catch indentation problems way
before things are committed so that developers can find them out in
their respective CI runs and lets developers learn the postgres code
indentation stuff. It also saves committers time - git log | grep
'koel' | wc -l gives me 11 commits and git log | grep 'indentation' |
wc -l gives me 97. Most, if not all of these commits went into fixing
code indentation problems that could have been reported a bit early
and fixed by developers/patch authors themselves.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-10-30 Thread Michael Paquier
On Fri, Oct 27, 2023 at 09:31:10AM -0400, David Steele wrote:
> That sounds like the right plan to me. Nice and simple.

I'll tackle that in a separate thread with a patch registered for the
upcoming CF of November.

> I'm still +1 for the patch as it stands.

I have been reviewing the patch, and applied portions of it as of
dc5bd388 and 1ffdc03c and they're quite independent pieces.  After
that, the remaining bits of the patch to change the behavior is now
straight-forward.  I have written a commit message for it, while on
it, as per the attached.
--
Michael
From 26a8432fe3ab8426e7797d85d19b0fe69d3384c9 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 30 Oct 2023 16:02:52 +0900
Subject: [PATCH v4] Require recovery.signal or standby.signal when reading a
 backup_file

Historically, the startup process uses two static variables to control
if archive recovery should happen, when either recovery.signal or
standby.signal are defined in the data folder at the beginning of
recovery:
- ArchiveRecoveryRequested, set to "true" if either .signal file exists.
This controls if archive recovery has been requested.
- InArchiveRecovery, that controls if recovery should switch to archive
mode during recovery.  It is initially "false" but switched to "true"
during recovery for two cases:
-- Just after finding and reading a valid backup_label file, close to
the beginning of recovery.
-- When no backup_label file is found, it would be set to "false" to
force crash recovery, where all the WAL located in pg_wal/ is read
first.  When all the local WAL records have been consumed, it is
changed to "true" to switch to archive recovery mode, equivalent to a
base backup.

Now comes the fact that recovery has an old logical problem: when a
backup_label is found but no .signal file is defined, the startup
process finishes in a state where ArchiveRecoveryRequested is "false"
but InArchiveRecovery is "true".  This is proving to cause various
issues, that got worse since restart points can run during crash
recovery for the benefit of being able to start and stop instances
without doing crash recovery from its initial point (7ff23c6d277d):
- Some standby states needed by archive recovery would not be set,
causing correctness issues, like assertion failures during recovery.
- recoveryWakeupLatch would not be set.
- Some GUCs are messed up, hot_standby that depends on
ArchiveRecoveryRequested.

This configuration was possible when recovering from a base backup taken
by pg_basebackup without -R.  Note that the documentation requires at
least to set recovery.signal to restore from a backup, but the startup
process was not making this policy explicit.  In most cases, one would
have been able to complete recovery, but that's a matter of luck,
really, as it depends on the workload of the origin server.  This has
the advantage of simplifying the logic for the archive recovery case, as
InArchiveRecovery now requires ArchiveRecoveryRequested to be set.
Recovering a self-contained backup now requires a recovery.signal, with
a restore_command set in postgresql.conf (or related conf file).

One test of pg_basebackup and one test of pg_rewind need to be tweaked
to avoid the FATAL introduced by this patch when a base_backup is found
without a .signal file, even if the intention of both is to have a
recovery.signal set.

Reviewed-by: David Steele, Bowen Shi
Discussion: https://postgr.es/m/zarvomifjze7f...@paquier.xyz
Discussion: https://postgr.es/m/Y/Q/17rpys7yg...@paquier.xyz
Discussion: https://postgr.es/m/Y/v0c+3W89NBT/i...@paquier.xyz
---
 src/backend/access/transam/xlogrecovery.c | 22 ---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |  3 ++-
 src/bin/pg_rewind/t/008_min_recovery_point.pl |  1 +
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6156a..0088dfec2e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -125,14 +125,18 @@ static TimeLineID curFileTLI;
 
 /*
  * When ArchiveRecoveryRequested is set, archive recovery was requested,
- * ie. signal files were present.  When InArchiveRecovery is set, we are
- * currently recovering using offline XLOG archives.  These variables are only
- * valid in the startup process.
+ * ie. signal files or backup_label were present.  When InArchiveRecovery is
+ * set, we are currently recovering using offline XLOG archives.  These
+ * variables are only valid in the startup process.  Note that the presence of
+ * a backup_label file requires the presence of one of the two .signal files
+ * to enforce archive recovery.
  *
  * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
  * currently performing crash recovery using only XLOG files in pg_wal, but
  * will switch to using offline XLOG archives as soon as we reach the end of
  * WAL in pg_wal.
+ *
+ * InArchiveRecovery should never be 

Re: A recent message added to pg_upgade

2023-10-30 Thread Bharath Rupireddy
On Mon, Oct 30, 2023 at 8:51 AM Amit Kapila  wrote:
>
> > This discussion seems like a bit off from my point. I suggested adding
> > a check for that setting when IsBinaryUpgraded is true at the GUC
> > level as shown in the attached patch. I believe Álvaro made a similar
> > suggestion.  While the error message is somewhat succinct, I think it
> > is sufficient given the low possilibility of the scenario and the fact
> > that it cannot occur inadvertently.
> >
>
> I think we can simply change that error message to assert if we want
> to go with the check hook idea of yours. BTW, can we add
> GUC_check_errdetail() with a better message as some of the other check
> function uses? Also, I guess we can add some comments or at least
> refer to the existing comments to explain the reason of such a check.

Will the check_hook approach work correctly? I haven't checked that by
myself, but I see InitializeGUCOptions() getting called before
IsBinaryUpgrade is set to true and the passed-in config options ('c')
are parsed.

If the check_hook approach works correctly, I think we must add a test
hitting the error in check_max_slot_wal_keep_size for the
IsBinaryUpgrade case. And, I agree with Amit to have a detailed
messaging with GUC_check_errmsg/GUC_check_errdetail. Also, IMV,
leaving the error message in InvalidatePossiblyObsoleteSlot() there
(if required with a better wording as discussed initially in this
thread) does no harm. Actually, it acts as another safety net given
that max_slot_wal_keep_size GUC is reloadable via SIGHUP.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce a new view for checkpointer related stats

2023-10-30 Thread Bharath Rupireddy
On Mon, Oct 30, 2023 at 6:19 AM Michael Paquier  wrote:
>
> On Fri, Oct 27, 2023 at 10:23:34AM +0530, Bharath Rupireddy wrote:
> > +1. Changed in the attached v10-001. FWIW, having a test case in
> > stats.sql emitting this error message and hint would have helped here.
> > If okay, I can add one.
>
> That may be something to do.  At least it was missed on this thread,
> even if we don't add a new  pgstat shared type every day.

Right. Adding test coverage for the error-case is no bad IMV
(https://coverage.postgresql.org/src/backend/utils/adt/pgstatfuncs.c.gcov.html).
Here's the attached 0001 patch for that.

> > PS: I'll park the SLRU flush related patch aside until the approach is
> > finalized. I'm posting the pg_stat_checkpointer patch as v10-0001.
>
> +-- Test that reset_shared with checkpointer specified as the stats type works
> +SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset
> +SELECT pg_stat_reset_shared('checkpointer');
> +SELECT stats_reset > :'checkpointer_reset_ts'::timestamptz FROM 
> pg_stat_checkpointer;
> +SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset
>
> Note that you have forgotten to update the test of
> pg_stat_reset_shared(NULL) to check for the value of
> checkpointer_reset_ts.  I've added an extra SELECT to check that for
> pg_stat_checkpointer, and applied v8.

Oh, thanks for taking care of this. While at it, I noticed that
there's no coverage for pg_stat_reset_shared('recovery_prefetch') and
XLogPrefetchResetStats()
https://coverage.postgresql.org/src/backend/access/transam/xlogprefetcher.c.gcov.html.
Most of the recovery_prefetch code is covered with recovery/streaming
related tests, but the reset stats part is missing. So, I've added
coverage for it in the attached 0002.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Add-error-case-test-for-pg_stat_reset_shared-with.patch
Description: Binary data


v1-0002-Add-test-for-resetting-recovery_prefetch-shared-s.patch
Description: Binary data


Re: COPY TO (FREEZE)?

2023-10-30 Thread Kyotaro Horiguchi
At Sun, 29 Oct 2023 15:35:02 -0400, Bruce Momjian  wrote in 
> You are 100% correct.  Updated patch attached.

-errmsg("COPY force not null only available 
using COPY FROM")));
+errmsg("COPY force not null cannot be used 
with COPY TO")));

I find the term "force not null" hard to translate, especially into
Japaese, as its literal translation doesn't align with the entire
message.  The most recent translation for it is the literal rendition
of "FORCE_NOT_NULL option of COPY can only be used with COPY FROM".

In short, for translation convenience, I would prefer if "force not
null" were "FORCE_NOT_NULL".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center