date:20240411


On 4/11/24 20:26, Tomas Vondra wrote:

On 4/11/24 03:52, David Steele wrote:

On 4/11/24 10:23, Tom Kincaid wrote:


The extensive Beta process we have can be used to build confidence we
need in a feature that has extensive review and currently has no known
issues or outstanding objections.


I did have objections, here [1] and here [2]. I think the complexity,
space requirements, and likely performance issues involved in restores
are going to be a real problem for users. Some of these can be addressed
in future releases, but I can't escape the feeling that what we are
releasing here is half-baked.


I do not think it's half-baked. I certainly agree there are limitations,
and there's all kinds of bells and whistles we could add, but I think
the fundamental infrastructure is corrent and a meaningful step forward.
Would I wish it to handle .tar for example? Sure I would. But I think
it's something we can add in the future - if we require all of this to
happen in a single release, it'll never happen.


I'm not sure that I really buy this argument, anyway. It is not uncommon 
for significant features to spend years in development before they are 
committed. This feature went from first introduction to commit in just 
over six months. Obviously Robert had been working on it for a while, 
but for a feature this large six months is a sprint.


Regards,
-David

Re: Can't find not null constraint, but \d+ shows that

2024-04-11 Thread Tender Wang

Alvaro Herrera  于2024年4月11日周四 22:48写道：

> On 2024-Apr-11, Alvaro Herrera wrote:
>
> > Well, I think you were right that we should try to handle the situation
> > of unmarking attnotnull as much as possible, to decrease the chances
> > that the problematic situation occurs.  That means, we can use the
> > earlier code to handle DROP COLUMN when it causes a PK to be dropped --
> > even though we still need to handle the situation of an attnotnull flag
> > set with no pg_constraint row.  I mean, we still have to handle DROP
> > DOMAIN correctly (and there might be other cases that I haven't thought
> > about) ... but surely this is a much less common situation than the one
> > you reported.  So I'll merge everything and post an updated patch.
>
> Here's roughly what I'm thinking.  If we drop a constraint, we can still
> reset attnotnull in RemoveConstraintById(), but only after checking that
> it's not a generated column or a replica identity.  If they are, we
> don't error out -- just skip the attnotnull update.
>
> Now, about the code to allow ALTER TABLE DROP NOT NULL in case there's
> no pg_constraint row, I think at this point it's mostly dead code,
> because it can only happen when you have a replica identity or generated
> column ... and the DROP NOT NULL should still prevent you from dropping
> the flag anyway.  But the case can still arise, if you change the
> replica identity or ALTER TABLE ALTER COLUMN DROP DEFAULT, respectively.
>
> I'm still not ready with this -- still not convinced about the new AT
> pass.


Yeah, at first, I was also hesitant. Two reasons make me convinced.
in ATPostAlterTypeParse()
-
   else if (cmd->subtype == AT_SetAttNotNull)
{
/*
 * The parser will create
AT_AttSetNotNull subcommands for
 * columns of PRIMARY KEY
indexes/constraints, but we need
 * not do anything with them here,
because the columns'
 * NOT NULL marks will already have
been propagated into
 * the new table definition.
 */
}
---
The new table difinition continues to use old column not-null, so here does
nothing.
If we reset NOT NULL marks in RemoveConstrainById() when dropping PK
indirectly,
we need to do something here or somewhere else.

Except AT_SetAttNotNull type, other types add a AT pass to tab->subcmds.
Because
not-null should be added before re-adding index, there is no right AT pass
in current AlterTablePass.
So a new AT pass ahead AT_PASS_OLD_INDEX  is needed.

Another reason is that it can use ALTER TABLE frame to set not-null.
This way looks simpler and better than hardcode to re-install not-null in
some funciton.

-- 
Tender Wang
OpenPie:  https://en.openpie.com/

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 21:41:39 -0700, Andres Freund wrote:
> FWIW, I just reproduced the scenario with signals. I added tracking of the
> total time actually slept and lost to SpinDelayStatus, and added a function to
> trigger a wait on a spinlock.
>
> To wait less, I set max_standby_streaming_delay=0.1, but that's just for
> easier testing in isolation. In reality that could have been reached before
> the spinlock is even acquired.
>
> On a standby, while a recovery conflict is happening:
> PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 
> 4.38s, lost 127.96s
>
>
> So right now it's really not hard to trigger the stuck-spinlock logic
> completely spuriously.  This doesn't just happen with hot standby, there are
> plenty other sources of lots of signals being sent.

Oh my. There's a workload that completely trivially hits this, without even
trying hard. LISTEN/NOTIFY.

PANIC:  XX000: stuck spinlock detected at crashme, file:line, after 0.72s, 
lost 133.027159s

Yes, it really triggered in less than 1ms. That was with just one session
doing NOTIFYs from a client. There's plenty users that send NOTIFY from
triggers, which afaics will result in much higher rates of signals being sent.

Even with a bit less NOTIFY traffic, this very obviously gets into the
territory where plain scheduling delays will trigger the stuck spinlock logic.

Greetings,

Andres

Re: POC: GROUP BY optimization

2024-04-11 Thread Andrei Lepikhov


On 4/12/24 06:44, Tom Lane wrote:

If this patch were producing better results I'd be more excited
about putting more work into it.  But on the basis of what I'm
seeing right now, I think maybe we ought to give up on it.
First, thanks for the deep review - sometimes, only a commit gives us a 
chance to get such observation :))).
On a broader note, introducing automatic group-by-order choosing is a 
step towards training the optimiser to handle poorly tuned incoming 
queries. While it's true that this may initially impact performance, 
it's crucial to weigh the potential benefits. So, beforehand, we should 
agree: Is it worth it?
If yes, I would say I see how often hashing doesn't work in grouping. 
Sometimes because of estimation errors, sometimes because grouping 
already has sorted input, sometimes in analytical queries when planner 
doesn't have enough memory for hashing. In analytical cases, the only 
way to speed up queries sometimes is to be smart with features like 
IncrementalSort and this one.
About low efficiency. Remember the previous version of the GROUP-BY 
optimisation - we disagreed on operator costs and the cost model in 
general. In the current version, we went the opposite - adding small 
features step-by-step. The current commit contains an integral part of 
the feature and is designed for safely testing the approach and adding 
more profitable parts like choosing group-by-order according to distinct 
values or unique indexes on grouping columns.
I have passed through the code being steered by the issues explained in 
detail. I see seven issues. Two of them definitely should be scrutinised 
right now, and I'm ready to do that.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Issue with the PRNG used by Postgres

2024-04-11 Thread Alexander Lakhin


Hi Andres,

12.04.2024 07:41, Andres Freund wrote:


FWIW, I just reproduced the scenario with signals. I added tracking of the
total time actually slept and lost to SpinDelayStatus, and added a function to
trigger a wait on a spinlock.

To wait less, I set max_standby_streaming_delay=0.1, but that's just for
easier testing in isolation. In reality that could have been reached before
the spinlock is even acquired.

On a standby, while a recovery conflict is happening:
PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 
4.38s, lost 127.96s


So right now it's really not hard to trigger the stuck-spinlock logic
completely spuriously.  This doesn't just happen with hot standby, there are
plenty other sources of lots of signals being sent.


I managed to trigger that logic when trying to construct a reproducer
for bug #18426.

With the following delays added:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1776,6 +1776,7 @@ retry:
  */
 if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
 {
+pg_usleep(30L);
 UnlockBufHdr(buf, buf_state);
 LWLockRelease(oldPartitionLock);
 /* safety check: should definitely not be our *own* pin */
@@ -5549,6 +5550,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, 
uint32 set_flag_bits,

 Assert(buf_state & BM_IO_IN_PROGRESS);

+pg_usleep(300);
 buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
 if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
 buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);

and /tmp/temp.config:
bgwriter_delay = 10

TEMP_CONFIG=/tmp/temp.config make -s check -C src/test/recovery 
PROVE_TESTS="t/032*"
fails for me on iterations 22, 23, 37:
2024-04-12 05:00:17.981 UTC [762336] PANIC:  stuck spinlock detected at 
WaitBufHdrUnlocked, bufmgr.c:5726

I haven't investigated this case yet.

Best regards,
Alexander

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 20:47:37 -0700, Andres Freund wrote:
> > So there's that.  But that's not an argument that we need to be in a
> > hurry to timeout; if the built-in reaction time is less than perhaps
> > 10 minutes you're still miles ahead of the manual solution.
> 
> The current timeout is of a hard to determine total time, due to the
> increasing and wrapping around wait times, but it's normally longer than 60s,
> unless you're interrupted by a lot of signals. 1000 sleeps between 1000 and
> 100 us.
> 
> I think we should make the timeout something predictable and probably somewhat
> longer.

FWIW, I just reproduced the scenario with signals. I added tracking of the
total time actually slept and lost to SpinDelayStatus, and added a function to
trigger a wait on a spinlock.

To wait less, I set max_standby_streaming_delay=0.1, but that's just for
easier testing in isolation. In reality that could have been reached before
the spinlock is even acquired.

On a standby, while a recovery conflict is happening:
PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 
4.38s, lost 127.96s

So right now it's really not hard to trigger the stuck-spinlock logic
completely spuriously.  This doesn't just happen with hot standby, there are
plenty other sources of lots of signals being sent.

Tracking the total amount of time spent sleeping doesn't require any
additional syscalls, due to the nanosleep()...

Greetings,

Andres Freund

further improving roles_is_member_of()

(moved to a new thread)

On Thu, Mar 21, 2024 at 04:31:45PM -0400, Tom Lane wrote:
> I wrote:
>> ... I still see the problematic GRANT taking ~250ms, compared
>> to 5ms in v15.  roles_is_member_of is clearly on the hook for that.
> 
> Ah: looks like that is mainly the fault of the list_append_unique_oid
> calls in roles_is_member_of.  That's also an O(N^2) cost of course,
> though with a much smaller constant factor.
> 
> I don't think we have any really cheap way to de-duplicate the role
> OIDs, especially seeing that it has to be done on-the-fly within the
> collection loop, and the order of roles_list is at least potentially
> interesting.  Not sure how to make further progress without a lot of
> work.

I looked at this one again because I suspect these "thousands of roles"
cases are going to continue to appear.  Specifically, I tried to convert
the cached roles lists to hash tables to avoid the list_member_oid linear
searches.  That actually was pretty easy to do.  The most complicated part
is juggling a couple of lists to keep track of the roles we need to iterate
over in roles_is_member_of().

AFAICT the only catch is that select_best_grantor() seems to depend on the
ordering of roles_list so that it prefers a "closer" role.  To deal with
that, I added a "depth" field to the entry struct that can be used as a
tiebreaker.  I'm not certain that this is good enough, though.

As shown in the attached work-in-progress patch, this actually ends up
removing more code than it adds, and it seems to provide similar results to
HEAD (using the benchmark from the previous thread [0]).  I intend to test
it with many more roles to see if it's better in more extreme cases.

[0] https://postgr.es/m/341186.1711037256%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 01fb5ac0d7394af6a3036011c33087b12f2809ce Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Apr 2024 22:08:53 -0500
Subject: [PATCH v1 1/1] use hash tables for role caches (work in progress)

---
 src/backend/utils/adt/acl.c  | 256 ++-
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 113 insertions(+), 144 deletions(-)

diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index dc10b4a483..d873ff8d13 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -55,6 +55,12 @@ typedef struct
 	AclMode		value;
 } priv_map;
 
+typedef struct
+{
+	Oid			role;
+	int			depth;
+} CachedRoleEntry;
+
 /*
  * We frequently need to test whether a given role is a member of some other
  * role.  In most of these tests the "given role" is the same, namely the
@@ -76,17 +82,9 @@ enum RoleRecurseType
 	ROLERECURSE_SETROLE = 2		/* recurse through grants with set_option */
 };
 static Oid	cached_role[] = {InvalidOid, InvalidOid, InvalidOid};
-static List *cached_roles[] = {NIL, NIL, NIL};
+static HTAB *cached_roles[] = {NULL, NULL, NULL};
 static uint32 cached_db_hash;
 
-/*
- * If the list of roles gathered by roles_is_member_of() grows larger than the
- * below threshold, a Bloom filter is created to speed up list membership
- * checks.  This threshold is set arbitrarily high to avoid the overhead of
- * creating the Bloom filter until it seems likely to provide a net benefit.
- */
-#define ROLES_LIST_BLOOM_THRESHOLD 1024
-
 static const char *getid(const char *s, char *n, Node *escontext);
 static void putid(char *p, const char *s);
 static Acl *allocacl(int n);
@@ -4926,53 +4924,6 @@ RoleMembershipCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	cached_role[ROLERECURSE_SETROLE] = InvalidOid;
 }
 
-/*
- * A helper function for roles_is_member_of() that provides an optimized
- * implementation of list_append_unique_oid() via a Bloom filter.  The caller
- * (i.e., roles_is_member_of()) is responsible for freeing bf once it is done
- * using this function.
- */
-static inline List *
-roles_list_append(List *roles_list, bloom_filter **bf, Oid role)
-{
-	unsigned char *roleptr = (unsigned char *) &role;
-
-	/*
-	 * If there is a previously-created Bloom filter, use it to try to
-	 * determine whether the role is missing from the list.  If it says yes,
-	 * that's a hard fact and we can go ahead and add the role.  If it says
-	 * no, that's only probabilistic and we'd better search the list.  Without
-	 * a filter, we must always do an ordinary linear search through the
-	 * existing list.
-	 */
-	if ((*bf && bloom_lacks_element(*bf, roleptr, sizeof(Oid))) ||
-		!list_member_oid(roles_list, role))
-	{
-		/*
-		 * If the list is large, we take on the overhead of creating and
-		 * populating a Bloom filter to speed up future calls to this
-		 * function.
-		 */
-		if (*bf == NULL &&
-			list_length(roles_list) > ROLES_LIST_BLOOM_THRESHOLD)
-		{
-			*bf = bloom_create(ROLES_LIST_BLOOM_THRESHOLD * 10, work_mem, 0);
-			foreach_oid(roleid, roles_list)
-bloom_add_element(*bf, (unsigned char *) &roleid, sizeof(Oid));
-		}
-
-		/*
-		 * Finally, add

Re: Issue with the PRNG used by Postgres

Andres Freund  writes:
> On 2024-04-11 23:15:38 -0400, Tom Lane wrote:
>> On the third hand, it's still true that we have no comparable
>> behavior for any other source of system lockups, and it's difficult
>> to make a case that stuck spinlocks really need more concern than
>> other kinds of bugs.

> Spinlocks are somewhat more finnicky though, compared to e.g. lwlocks that are
> released on error. Lwlocks also take e.g. care to hold interrupts so code
> doesn't just jump out of a section with lwlocks held.

Yeah.  I don't think that unifying spinlocks with lwlocks is a great
idea, precisely because lwlocks have these other small overheads
in the name of bug detection/prevention.  It seems to me that the
division between spinlocks and lwlocks and heavyweight locks is
basically a good idea that matches up well with the actual
requirements of some different parts of our code.  The underlying
implementations can change (and have), but the idea of successively
increasing amounts of protection against caller error seems sound
to me.

If you grant that concept, then the idea that spinlocks need more
protection against stuck-ness than the higher-overhead levels do
seems mighty odd.

regards, tom lane

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 23:15:38 -0400, Tom Lane wrote:
> I wrote:
> > ... But Robert's question remains: how does
> > PANIC'ing after awhile make anything better?  I flat out don't
> > believe the idea that having a backend stuck on a spinlock
> > would otherwise go undetected.
> 
> Oh, wait.  After thinking a bit longer I believe I recall the argument
> for this behavior: it automates recovery from a genuinely stuck
> spinlock.  If we waited forever, the only way out of that is for a
> DBA to kill -9 the stuck process, which has exactly the same end
> result as a PANIC, except that it takes a lot longer to put the system
> back in service and perhaps rousts somebody, or several somebodies,
> out of their warm beds to fix it.  If you don't have a DBA on-call
> 24x7 then that answer looks even worse.

Precisely.  And even if you have a DBA on call 24x7, they need to know that
they need to react to something.

Today you can automate getting notified of crash-restarts, by using
  restart_after_crash = false
and restarting somewhere outside of postgres.

Imo that's the only sensible setting for larger production environments,
although I'm sure not everyone agrees with that.

> So there's that.  But that's not an argument that we need to be in a
> hurry to timeout; if the built-in reaction time is less than perhaps
> 10 minutes you're still miles ahead of the manual solution.

The current timeout is of a hard to determine total time, due to the
increasing and wrapping around wait times, but it's normally longer than 60s,
unless you're interrupted by a lot of signals. 1000 sleeps between 1000 and
100 us.

I think we should make the timeout something predictable and probably somewhat
longer.

> On the third hand, it's still true that we have no comparable
> behavior for any other source of system lockups, and it's difficult
> to make a case that stuck spinlocks really need more concern than
> other kinds of bugs.

Spinlocks are somewhat more finnicky though, compared to e.g. lwlocks that are
released on error. Lwlocks also take e.g. care to hold interrupts so code
doesn't just jump out of a section with lwlocks held.

Greetings,

Andres Freund

Re: Synchronizing slots from primary to standby

2024-04-11 Thread Amit Kapila

On Thu, Apr 11, 2024 at 5:04 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, April 11, 2024 12:11 PM Amit Kapila  
> wrote:
>
> >
> > 2.
> > - if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush)
> >   elog(ERROR,
> >   "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> >
> > Can we be more specific in this message? How about splitting it into
> > error_message as "cannot synchronize local slot \"%s\"" and then errdetail 
> > as
> > "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's
> > LSN(%X/%X)"?
>
> Your version looks better. Since the above two messages all have errdetail, I
> used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in
> the patch which is equal to the elog(ERROR but has an additional detail 
> message.
>

makes sense.

> Here is V5 patch set.
>

I think we should move the check to not advance slot when one of
remote_slot's restart_lsn or catalog_xmin is lesser than the local
slot inside update_local_synced_slot() as we want to prevent updating
slot in those cases even during slot synchronization.

-- 
With Regards,
Amit Kapila.

Re: Can't find not null constraint, but \d+ shows that

2024-04-11 Thread Tender Wang

jian he  于2024年4月12日周五 10:12写道：

> On Thu, Apr 11, 2024 at 10:48 PM Alvaro Herrera 
> wrote:
> >
> >
> > I'm still not ready with this -- still not convinced about the new AT
> > pass.  Also, I want to add a test for the pg_dump behavior, and there's
> > an XXX comment.
> >
> Now I am more confused...
>
> +-- make sure attnotnull is reset correctly when a PK is dropped
> indirectly,
> +-- or kept if there's a reason for that
> +CREATE TABLE notnull_tbl1 (c0 int, c1 int, PRIMARY KEY (c0, c1));
> +ALTER TABLE  notnull_tbl1 DROP c1;
> +\d+ notnull_tbl1
> +   Table "public.notnull_tbl1"
> + Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target | Description
>
> ++-+---+--+-+-+--+-
> + c0 | integer |   |  | | plain   |
>   |
> +
> +DROP TABLE notnull_tbl1;
>
> same query, mysql make let "c0" be not null
> mysql https://dbfiddle.uk/_ltoU7PO
>
> for postgre
> https://dbfiddle.uk/ZHJXEqL1
> from 9.3 to 16 (click the link (https://dbfiddle.uk/ZHJXEqL1), then
> click "9.3" choose which version you like)
> all will make the remaining column "co" be not null.
>
> latest
> 0001-Better-handle-indirect-constraint-drops.patch make c0 attnotnull be
> false.
>
> previous patches:
> v2-0001-Handle-ALTER-.-DROP-NOT-NULL-when-no-pg_constrain.patch  make
> c0 attnotnull be true.
> 0001-Correctly-reset-attnotnull-when-constraints-dropped-.patch make
> c0 attnotnull be false.
>

I'm not sure that SQL standard specifies what database must do for this
case.
If the standard does not specify, then it depends on each database vendor's
decision.

Some people like not-null retained, other people may like not-null removed.
I think it will be ok if people can drop not-null or add not-null back
again after dropping pk.

In Master, not-null will reset when we drop PK directly. I hope dropping pk
indirectly
is consistent with dropping PK directly.

--
Tender Wang
OpenPie:  https://en.openpie.com/

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-11 Thread Richard Guo

On Sun, Apr 7, 2024 at 10:42 PM Tomas Vondra 
wrote:

> create table t (a int, b int) with (fillfactor=10);
> insert into t select mod((i/22),2), (i/22) from generate_series(0,1000)
> S(i);
> create index on t(a);
> vacuum analyze t;
>
> set enable_indexonlyscan = off;
> set enable_seqscan = off;
> explain (analyze, verbose) select 1 from (values (1)) s(x) where exists
> (select * from t where a = x);
>
> KABOOM!

FWIW, it seems to me that this assert could be triggered in cases where,
during a join, not all inner tuples need to be scanned before skipping to
next outer tuple.  This can happen for 'single_match' or anti-join.

The query provided by Tomas is an example of 'single_match' case.  Here
is a query for anti-join that can also trigger this assert.

explain (analyze, verbose)
select t1.a from t t1 left join t t2 on t2.a = 1 where t2.a is null;
server closed the connection unexpectedly

Thanks
Richard

Re: Issue with the PRNG used by Postgres

I wrote:
> ... But Robert's question remains: how does
> PANIC'ing after awhile make anything better?  I flat out don't
> believe the idea that having a backend stuck on a spinlock
> would otherwise go undetected.

Oh, wait.  After thinking a bit longer I believe I recall the argument
for this behavior: it automates recovery from a genuinely stuck
spinlock.  If we waited forever, the only way out of that is for a
DBA to kill -9 the stuck process, which has exactly the same end
result as a PANIC, except that it takes a lot longer to put the system
back in service and perhaps rousts somebody, or several somebodies,
out of their warm beds to fix it.  If you don't have a DBA on-call
24x7 then that answer looks even worse.

So there's that.  But that's not an argument that we need to be in a
hurry to timeout; if the built-in reaction time is less than perhaps
10 minutes you're still miles ahead of the manual solution.

On the third hand, it's still true that we have no comparable
behavior for any other source of system lockups, and it's difficult
to make a case that stuck spinlocks really need more concern than
other kinds of bugs.

regards, tom lane

Re: post-freeze damage control





On 4/12/24 12:15, Robert Haas wrote:

On Thu, Apr 11, 2024 at 5:48 PM David Steele  wrote:

But they'll try because it is a new pg_basebackup feature and they'll
assume it is there to be used. Maybe it would be a good idea to make it
clear in the documentation that significant tooling will be required to
make it work.


I don't agree with that idea. LOTS of what we ship takes a significant
amount of effort to make it work. You may well need a connection
pooler. You may well need a failover manager which may or may not be
separate from your connection pooler. You need a backup tool. You need
a replication management tool which may or may not be separate from
your backup tool and may or may not be separate from your failover
tool. You probably need various out-of-core connections for the
programming languages you need. You may need a management tool, and
you probably need a monitoring tool. Some of the tools you might
choose to do all that stuff themselves have a whole bunch of complex
dependencies. It's a mess.


The difference here is you *can* use Postgres without a connection 
pooler (I have many times) or failover (if downtime is acceptable) but 
most people would agree that you really *need* backup.


The backup tool should be clear and easy to use or misery will 
inevitably result. pg_basebackup is difficult enough to use and automate 
because it has no notion of a repository, no expiration, and no WAL 
handling just to name a few things. Now there is an even more advanced 
feature that is even harder to use. So, no, I really don't think this 
feature is practically usable by the vast majority of end users.



Now, if someone were to say that we ought to talk about these issues
in our documentation and maybe give people some ideas about how to get
started, I would likely be in favor of that, modulo the small
political problem that various people would want their solution to be
the canonical one to which everyone gets referred. But I think it's
wrong to pretend like this feature is somehow special, that it's
somehow more raw or unfinished than tons of other things. I actually
think it's significantly *better* than a lot of other things. If we
add a disclaimer to the documentation saying "hey, this new
incremental backup feature is half-finished garbage!", and meanwhile
the documentation still says "hey, you can use cp as your
archive_command," then we have completely lost our minds.


Fair point on cp, but that just points to an overall lack in our 
documentation and built-in backup/recovery tools in general.



I also think that you're being more negative about this than the facts
justify. As I said to several colleagues today, I *fully* acknowledge
that you have a lot more practical experience in this area than I do,
and a bunch of good ideas. I was really pleased to see you talking
about how it would be good if these tools worked on tar files - and I
completely agree, and I hope that will happen, and I hope to help in
making that happen. I think there are a bunch of other problems too,
only some of which I can guess at. However, I think saying that this
feature is not realistically intended to be used by end-users or that
they will not be able to do so is over the top, and is actually kind
of insulting. 


It is not meant to be insulting, but I still believe it to be true. 
After years of working with users on backup problems I think I have a 
pretty good bead on what the vast majority of admins are capable of 
and/or willing to do. Making this feature work is pretty high above that 
bar.


If the primary motivation is to provide a feature that can be integrated 
with third party tools, as Tomas suggests, then I guess usability is 
somewhat moot. But you are insisting that is not the case and I just 
don't see it that way.



There has been more enthusiasm for this feature on this
mailing list and elsewhere than I've gotten for anything I've
developed in years. And I don't think that's because all of the people
who have expressed enthusiasm are silly geese who don't understand how
terrible it is.


No doubt there is enthusiasm. It's a great feature to have. In 
particular I think the WAL summarizer is cool. But I do think the 
shortcomings are significant and that will become very apparent when 
people start to implement. The last minute effort to add COW support is 
an indication of problems that people will see in the field.


Further, I do think some less that ideal design decisions were made. In 
particular, I think sidelining manifests, i.e. making them optional, is 
not a good choice. This has led directly to the issue we see in [1]. If 
we require a manifest to make an incremental backup, why make it 
optional for combine?


This same design decision has led us to have "marker files" for 
zero-length files and unchanged files, which just seems extremely 
wasteful when these could be noted in the manifest. There are good 
reasons for writing everything out in a full backup, but for an 
inc

Re: pg_upgrde failed : logical replication : alter_subscription_add_log

2024-04-11 Thread Amit Kapila

On Fri, Apr 12, 2024 at 6:18 AM Perumal Raj  wrote:
>
> I am trying to upgrade PostgreSQL (RHEL 7) from version 13.7 to 15.6 using 
> pglogical.
> My Standby(destination) machine has following rpms,
>
> postgresql13-pglogical-3.7.16-1.el7.x86_64
> pglogical_15-2.4.3-1.rhel7.x86_64
>
> And Primary(Source) has ,
>
> postgresql13-pglogical-3.7.16-1.el7.x86_64
>
> pg_upgrade check mode went fine , but it failed while running real mode.
>
> pg_restore: while PROCESSING TOC:
> pg_restore: from TOC entry 13027; 1255 3375648004 FUNCTION 
> alter_subscription_add_log("text", "text", boolean, "regclass", "text"[], 
> "text"[]) postgres
> pg_restore: error: could not execute query: ERROR:  could not find function 
> "pglogical_alter_subscription_add_log" in file 
> "/usr/pgsql-15/lib/pglogical.so"
> Command was: CREATE FUNCTION 
> "pglogical"."alter_subscription_add_log"("sub_name" "text", "log_name" 
> "text", "log_to_file" boolean DEFAULT true, "log_to_table" "regclass" DEFAULT 
> NULL::"regclass", "conflict_type" "text"[] DEFAULT NULL::"text"[], 
> "conflict_resolution" "text"[] DEFAULT NULL::"text"[]) RETURNS boolean
> LANGUAGE "c"
> AS '$libdir/pglogical', 'pglogical_alter_subscription_add_log';
>
> -- For binary upgrade, handle extension membership the hard way
> ALTER EXTENSION "pglogical" ADD FUNCTION 
> "pglogical"."alter_subscription_add_log"("sub_name" "text", "log_name" 
> "text", "log_to_file" boolean, "log_to_table" "regclass", "conflict_type" 
> "text"[], "conflict_resolution" "text"[]);
>
> Am I missing any packages?
>

We don't maintain pglogical so difficult to answer but looking at the
error (ERROR:  could not find function
"pglogical_alter_subscription_add_log" in file
"/usr/pgsql-15/lib/pglogical.so"), it seems that the required function
is not present in pglogical.so. It is possible that the arguments
would have changed in newer version of pglogical or something like
that. You need to check with the maintainers of pglogical.

-- 
With Regards,
Amit Kapila.

Re: Issue with the PRNG used by Postgres

Robert Haas  writes:
> On Thu, Apr 11, 2024 at 5:30 PM Andres Freund  wrote:
>> By far the most of the stuck spinlocks I've seen were due to bugs in
>> out-of-core extensions. Absurdly enough, the next common thing probably is 
>> due
>> to people using gdb to make an uninterruptible process break out of some 
>> code,
>> without a crash-restart, accidentally doing so while a spinlock is held.

> Hmm, interesting. I'm glad I haven't seen those extensions. But I
> think I have seen cases of people attaching gdb to grab a backtrace to
> debug some problem in production, and failing to detach it within 60
> seconds.

I don't doubt that there are extensions with bugs of this ilk
(and I wouldn't bet an appendage that there aren't such bugs
in core, either).  But Robert's question remains: how does
PANIC'ing after awhile make anything better?  I flat out don't
believe the idea that having a backend stuck on a spinlock
would otherwise go undetected.

regards, tom lane

Re: Issue with the PRNG used by Postgres

On Thu, Apr 11, 2024 at 5:30 PM Andres Freund  wrote:
> I don't think that's a particularly apt comparison. If you have spinlocks that
> cannot be acquired within tens of seconds, you're in a really bad situation,
> regardless of whether you crash-restart or not.

I agree with that. I just don't think panicking makes it better.

> > In all seriousness, I'd really like to understand what experience
> > you've had that makes this check seem useful. Because I think all of
> > my experiences with it have been bad. If they weren't, the last good
> > one was a very long time ago.
>
> By far the most of the stuck spinlocks I've seen were due to bugs in
> out-of-core extensions. Absurdly enough, the next common thing probably is due
> to people using gdb to make an uninterruptible process break out of some code,
> without a crash-restart, accidentally doing so while a spinlock is held.

Hmm, interesting. I'm glad I haven't seen those extensions. But I
think I have seen cases of people attaching gdb to grab a backtrace to
debug some problem in production, and failing to detach it within 60
seconds.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: promotion related handling in pg_sync_replication_slots()

2024-04-11 Thread shveta malik

On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
 wrote:
>
> On Fri, Apr 5, 2024 at 10:31 AM shveta malik  wrote:
> >
> > Please find v2. Changes are:
>
> Thanks for the patch. Here are some comments.

Thanks for reviewing.
>
> 1. Can we have a clear saying in the shmem variable who's syncing at
> the moment? Is it a slot sync worker or a backend via SQL function?
> Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
>
> typedef enum SlotSyncSource
> {
> SLOT_SYNC_NONE,
> SLOT_SYNC_WORKER,
> SLOT_SYNC_BACKEND,
> } SlotSyncSource;
>
> Then, the check in ShutDownSlotSync can be:
>
> +   /*
> +* Return if neither the slot sync worker is running nor the function
> +* pg_sync_replication_slots() is executing.
> +*/
> +   if ((SlotSyncCtx->pid == InvalidPid) &&
> SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
> {
>
> 2.
> SyncReplicationSlots(WalReceiverConn *wrconn)
>  {
> +/*
> + * Startup process signaled the slot sync to stop, so if meanwhile user
> + * has invoked slot sync SQL function, simply return.
> + */
> +SpinLockAcquire(&SlotSyncCtx->mutex);
> +if (SlotSyncCtx->stopSignaled)
> +{
> +ereport(LOG,
> +errmsg("skipping slot synchronization as slot sync
> shutdown is signaled during promotion"));
> +
>
> Unless I'm missing something, I think this can't detect if the backend
> via SQL function is already half-way through syncing in
> synchronize_one_slot. So, better move this check to (or also have it
> there) slot sync loop that calls synchronize_one_slot. To avoid
> spinlock acquisitions, we can perhaps do this check in when we acquire
> the spinlock for synced flag.

If the sync via SQL function is already half-way, then promotion
should wait for it to finish. I don't think it is a good idea to move
the check to synchronize_one_slot().  The sync-call should either not
start (if it noticed the promotion) or finish the sync and then let
promotion proceed. But I would like to know others' opinion on this.

>
> /* Search for the named slot */
> if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> {
> boolsynced;
>
> SpinLockAcquire(&slot->mutex);
> synced = slot->data.synced;
> << get SlotSyncCtx->stopSignaled here >>
> SpinLockRelease(&slot->mutex);
>
> << do the slot sync skip check here if (stopSignaled) >>
>
> 3. Can we have a test or steps at least to check the consequences
> manually one can get if slot syncing via SQL function is happening
> during the promotion?
>
> IIUC, we need to ensure there is no backend acquiring it and
> performing sync while the slot sync worker is shutting down/standby
> promotion is occuring. Otherwise, some of the slots can get resynced
> and some are not while we are shutting down the slot sync worker as
> part of the standby promotion which might leave the slots in an
> inconsistent state.

I do not think that we can reach a state (exception is some error
scenario) where some of the slots are synced while the rest are not
during a *particular* sync-cycle only because promotion is going in
parallel. (And yes we need to fix the race-condition stated by Amit
up-thread for this statement to be true.)

thanks
Shveta

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

On Thu, Apr 11, 2024 at 9:54 PM Alexander Korotkov  wrote:
> I think we shouldn't unconditionally copy schema name and
> relpersistence from the parent table.  Instead we should throw the
> error on a mismatch like CREATE TABLE ... PARTITION OF ... does.  I'm
> working on revising this fix.

We definitely shouldn't copy the schema name from the parent table. It
should be possible to schema-qualify the new partition names, and if
you don't, then the search_path should determine where they get
placed.

But I am inclined to think that relpersistence should be copied. It's
weird that you split an unlogged partition and you get logged
partitions.

One of the things I dislike about this type of feature -- not this
implementation specifically, but just this kind of idea in general --
is that the syntax mentions a whole bunch of tables but in a way where
you can't set their properties. Persistence, reloptions, whatever.
There's just no place to mention any of that stuff - and if you wanted
to create a place, you'd have to invent special syntax for each
separate thing. That's why I think it's good that the normal way of
creating a partition is CREATE TABLE .. PARTITION OF. Because that
way, we know that the full power of the CREATE TABLE statement is
always available, and you can set anything that you could set for a
table that is not a partition.

Of course, that is not to say that some people won't like to have a
feature of this sort. I expect they will. The approach does have some
drawbacks, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: promotion related handling in pg_sync_replication_slots()

2024-04-11 Thread shveta malik

On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila  wrote:
>
> On Fri, Apr 5, 2024 at 10:31 AM shveta malik  wrote:
> >
> > Please find v2. Changes are:
> > 1) Rebased the patch as there was conflict due to recent commit 6f132ed.
> > 2) Added an Assert in update_synced_slots_inactive_since() to ensure
> > that the slot does not have active_pid.
> > 3) Improved commit msg and comments.
> >
>
> Few comments:
> ==
> 1.
>  void
>  SyncReplicationSlots(WalReceiverConn *wrconn)
>  {
> + /*
> + * Startup process signaled the slot sync to stop, so if meanwhile user
> + * has invoked slot sync SQL function, simply return.
> + */
> + SpinLockAcquire(&SlotSyncCtx->mutex);
> + if (SlotSyncCtx->stopSignaled)
> + {
> + ereport(LOG,
> + errmsg("skipping slot synchronization as slot sync shutdown is
> signaled during promotion"));
> +
> + SpinLockRelease(&SlotSyncCtx->mutex);
> + return;
> + }
> + SpinLockRelease(&SlotSyncCtx->mutex);
>
> There is a race condition with this code. Say during promotion
> ShutDownSlotSync() is just before setting this flag and the user has
> invoked pg_sync_replication_slots() and passed this check but still
> didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
> recognize that there is slot sync going on in parallel, and slot sync
> wouldn't know that the promotion is in progress.

Did you mean that now, the promotion *would not* recognize...

I see, I will fix this.

> The current coding for slot sync worker ensures there is no such race
> by checking the stopSignaled and setting the pid together under
> spinlock. I think we need to move the setting of the syncing flag
> similarly. Once we do that probably checking SlotSyncCtx->syncing
> should be sufficient in ShutDownSlotSync(). If we change the location
> of setting the 'syncing' flag then please ensure its cleanup as we
> currently do in slotsync_failure_callback().

Sure, let me review.

> 2.
> @@ -1395,6 +1395,7 @@ update_synced_slots_inactive_since(void)
>   if (s->in_use && s->data.synced)
>   {
>   Assert(SlotIsLogical(s));
> + Assert(s->active_pid == 0);
>
> We can add a comment like: "The slot must not be acquired by any process"
>
> --
> With Regards,
> Amit Kapila.

Re: post-freeze damage control

On Thu, Apr 11, 2024 at 5:48 PM David Steele  wrote:
> But they'll try because it is a new pg_basebackup feature and they'll
> assume it is there to be used. Maybe it would be a good idea to make it
> clear in the documentation that significant tooling will be required to
> make it work.

I don't agree with that idea. LOTS of what we ship takes a significant
amount of effort to make it work. You may well need a connection
pooler. You may well need a failover manager which may or may not be
separate from your connection pooler. You need a backup tool. You need
a replication management tool which may or may not be separate from
your backup tool and may or may not be separate from your failover
tool. You probably need various out-of-core connections for the
programming languages you need. You may need a management tool, and
you probably need a monitoring tool. Some of the tools you might
choose to do all that stuff themselves have a whole bunch of complex
dependencies. It's a mess.

Now, if someone were to say that we ought to talk about these issues
in our documentation and maybe give people some ideas about how to get
started, I would likely be in favor of that, modulo the small
political problem that various people would want their solution to be
the canonical one to which everyone gets referred. But I think it's
wrong to pretend like this feature is somehow special, that it's
somehow more raw or unfinished than tons of other things. I actually
think it's significantly *better* than a lot of other things. If we
add a disclaimer to the documentation saying "hey, this new
incremental backup feature is half-finished garbage!", and meanwhile
the documentation still says "hey, you can use cp as your
archive_command," then we have completely lost our minds.

I also think that you're being more negative about this than the facts
justify. As I said to several colleagues today, I *fully* acknowledge
that you have a lot more practical experience in this area than I do,
and a bunch of good ideas. I was really pleased to see you talking
about how it would be good if these tools worked on tar files - and I
completely agree, and I hope that will happen, and I hope to help in
making that happen. I think there are a bunch of other problems too,
only some of which I can guess at. However, I think saying that this
feature is not realistically intended to be used by end-users or that
they will not be able to do so is over the top, and is actually kind
of insulting. There has been more enthusiasm for this feature on this
mailing list and elsewhere than I've gotten for anything I've
developed in years. And I don't think that's because all of the people
who have expressed enthusiasm are silly geese who don't understand how
terrible it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Should we add a compiler warning for large stack frames?

Hi,

On 2024-04-11 15:07:11 -0700, Andres Freund wrote:
> On 2024-04-11 16:35:58 -0400, Tom Lane wrote:
> > Indeed.  I recall reading, not long ago, some Linux kernel docs to the
> > effect that automatic stack growth is triggered by a reference into
> > the page just below what is currently mapped as your stack, and
> > therefore allocating a stack frame greater than one page has the
> > potential to cause SIGSEGV rather than the desired stack extension.
> > (If you feel like digging in the archives, I think this was brought
> > up in the last round of lets-add-some-more-check_stack_depth-calls.)
>
> I think it's more than a single page, but I'm not entirely sure either. I
> think some compilers inject artificial stack accesses when extending the stack
> by a lot, but I don't remember the details.
>
> There certainly was the issue that strict memory overcommit does not reliably
> work with larger stack extensions.
>
> Could be worth writing a test program for...

It looks like it's a mess.

In the good cases the kernel doesn't map anything within ulimit -R of the
stack, and the stack is extended whenever memory in that range is accessed.
Nothing is mapped into that region unless MAP_FIXED is used.

However, in some cases linux maps the heap and the stack fairly close to each
other at program startup. I've observed this with an executable compiled with
-static-pie and executed with randomization disabled (via setarch -R).  In
that case the the layout at program start is

...
77fff000-78021000 rw-p  00:00 0  [heap]
7ffdd000-7000 rw-p  00:00 0  [stack]

Here the start of the heap and the end of the stack are only 128MB appart. The
heap grows upwards, the stack downwards.

Which means that if glibc allocates a bunch of memory via sbrk() and the stack
grows, they clash into each other.

I think this may be a glibc bug. If I compile with musl instead, this doesn't
happen, because musl stops using sbrk() style allocations before stack and
program break get too close to each other.

Greetings,

Andres Freund

Re: Can't find not null constraint, but \d+ shows that

2024-04-11 Thread jian he

On Thu, Apr 11, 2024 at 10:48 PM Alvaro Herrera  wrote:
>
>
> I'm still not ready with this -- still not convinced about the new AT
> pass.  Also, I want to add a test for the pg_dump behavior, and there's
> an XXX comment.
>
Now I am more confused...

+-- make sure attnotnull is reset correctly when a PK is dropped indirectly,
+-- or kept if there's a reason for that
+CREATE TABLE notnull_tbl1 (c0 int, c1 int, PRIMARY KEY (c0, c1));
+ALTER TABLE  notnull_tbl1 DROP c1;
+\d+ notnull_tbl1
+   Table "public.notnull_tbl1"
+ Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
++-+---+--+-+-+--+-
+ c0 | integer |   |  | | plain   |  |
+
+DROP TABLE notnull_tbl1;

same query, mysql make let "c0" be not null
mysql https://dbfiddle.uk/_ltoU7PO

for postgre
https://dbfiddle.uk/ZHJXEqL1
from 9.3 to 16 (click the link (https://dbfiddle.uk/ZHJXEqL1), then
click "9.3" choose which version you like)
all will make the remaining column "co" be not null.

latest
0001-Better-handle-indirect-constraint-drops.patch make c0 attnotnull be false.

previous patches:
v2-0001-Handle-ALTER-.-DROP-NOT-NULL-when-no-pg_constrain.patch  make
c0 attnotnull be true.
0001-Correctly-reset-attnotnull-when-constraints-dropped-.patch make
c0 attnotnull be false.

Re: Allow non-superuser to cancel superuser tasks.

2024-04-11 Thread Kyotaro Horiguchi

At Fri, 12 Apr 2024 09:10:35 +0900, Michael Paquier  wrote 
in 
> On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote:
> > The test doesn't fail because pg_terminate_backend actually meets his
> > point: autovac is killed. But while dying, autovac also receives
> > segfault. Thats because of injections points:
> > 
> > (gdb) bt
> > #0  0x56361c3379ea in tas (lock=0x7fbcb9632224  > access memory at address 0x7fbcb9632224>) at
> > ../../../../src/include/storage/s_lock.h:228
> > #1  ConditionVariableCancelSleep () at condition_variable.c:238
...
> > #3  0x56361c330a40 in CleanupProcSignalState (status=, arg=) at procsignal.c:240
> > #4  0x56361c328801 in shmem_exit (code=code@entry=1) at ipc.c:276
> > #9  0x56361c3378d7 in ConditionVariableTimedSleep
> > (cv=0x7fbcb9632224, timeout=timeout@entry=-1,
...
> I can see this stack trace as well.  Capturing a bit more than your
> own stack, this is crashing in the autovacuum worker while waiting on
> a condition variable when processing a ProcessInterrupts().
> 
> That may point to a legit bug with condition variables in this
> context, actually?  From what I can see, issuing a signal on a backend
> process waiting with a condition variable is able to process the
> interrupt correctly.

ProcSignalInit sets up CleanupProcSignalState to be called via
on_shmem_exit. If the CV is allocated in a dsm segment, shmem_exit
should have detached the region for the CV.  CV cleanup code should be
invoked via before_shmem_exit.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

On Thu, Apr 11, 2024 at 8:00 PM Alexander Lakhin  wrote:
> 11.04.2024 16:27, Dmitry Koval wrote:
> >
> > Added correction (and test), see 
> > v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.
> >
>
> Thank you for the correction, but may be an attempt to merge into implicit
> pg_temp should fail just like CREATE TABLE ... PARTITION OF ... does?
>
> Please look also at another anomaly with schemas:
> CREATE SCHEMA s1;
> CREATE TABLE t (i int) PARTITION BY RANGE (i);
> CREATE TABLE tp_0_2 PARTITION OF t
>FOR VALUES FROM (0) TO (2);
> ALTER TABLE t SPLIT PARTITION tp_0_2 INTO
>(PARTITION s1.tp0 FOR VALUES FROM (0) TO (1), PARTITION s1.tp1 FOR VALUES 
> FROM (1) TO (2));
> results in:
> \d+ s1.*
> Did not find any relation named "s1.*"
> \d+ tp*
>Table "public.tp0"
> ...
> Table "public.tp1"

+1
I think we shouldn't unconditionally copy schema name and
relpersistence from the parent table.  Instead we should throw the
error on a mismatch like CREATE TABLE ... PARTITION OF ... does.  I'm
working on revising this fix.

--
Regards,
Alexander Korotkov

Re: Simplify documentation related to Windows builds

On Fri, Mar 22, 2024 at 01:34:43PM -0400, Robert Haas wrote:
> I'm not very knowledgeable about building software about Windows in
> general, but on the rare occasions that I've done it, it was MUCH
> harder to figure out where to get things like Perl that it is on Linux
> or macOS machines. On Linux, your package manager probably knows about
> everything you need, and if it doesn't, you can probably fix that by
> adding an additional RPM repository to your configuration or using
> something like CPAN to find Perl modules that your OS package manager
> doesn't have. On macOS, you can install homebrew or macports and then
> get most things from there. But on Windows you have to go download
> installers individually for everything you need, and there's lots of
> installers on the Internet, and not all of them are prepared by
> equally friendly people, and not all of them necessarily work for
> building PostgreSQL.

Yeah.  These days I personally just go through stuff like Chocolatey
or msys2 to get all my dependencies, or even a minimal set of them.  I
suspect that most folks hanging around on pgsql-hackers do that as
well.  Relying on individual MSIs with many dependencies has the large
downside of causing these to become easily outdated.  When using the
build scripts with src/tools/msvc, now gone, I've had a bunch of these
in my environment hosts.

As of the buildfarm, we have currently (All Hail Andrew for
maintaining most of these):
- faiywren, with strawberry perl and msys.  OpenSSL is from a MSI.  It
uses meson.
- drongo, with a 64b version of OpenSSL installed with a MSI.  It uses
meson and chocolatey.
- lorikeet, cygwin, which is an ecosystem of its own.  OpenSSL has
been installed from a MSI, there's a System32 path.
- hamerkop, with meson.  OpenSSL is installed from strawberry, not a
separate MSI.  Python37 points to a custom MSI.

So, yes, you're right that removing completely this list may be too
aggressive for the end-user.  As far as I can see, there are a few
things that stand out:
- Diff is not mentioned in the list of dependencies on the meson page,
and it may not exist by default on Windows.  I think that we should
add it.
- We don't use activeperl anymore in the buildfarm, and recommending
it is not a good idea based on the state of the project.  If we don't
remove the entry, I would switch it to point to msys perl or even
strawberry perl.  Andres has expressed concerns about the strawberry
part, so perhaps mentioning only msys perl would be enough?
- The portion of the docs about command line editing with psql, cygwin
being mentioned as an exception, does not apply AFAIK.
- Mentioning more the packaging options that exist to not have to
install individual MSIs would be a good addition.

> So I think that it's pretty darn helpful to have some installation
> instructions in the documentation for stuff like this, just like I
> think it's useful that in the documentation index we tell people how
> to get the doc toolchain working on various platforms. I understand
> the concern about seeming to endorse particular Perl distributions or
> other software bundles, but I also don't like the idea of telling
> people something that boils down to "hey, it's possible to get this to
> compile on Windows, and we know some methods that do work, but we're
> not going to tell you what they are because we don't want to endorse
> anything so ... good luck!". If we know a set of things that work, I
> think we should list them in the documentation and just say that we're
> not endorsing the use of these particular distributions but just
> telling you that we've tested with them. And then I think we should
> update that as things change.

+  
+   On Windows, you can find packages for build dependencies using
+   https://www.msys2.org/";>MSYS2
+   or https://chocolatey.org/";>Chocolatey.
+  

The last patch I've sent has this diff.  Were you thinking about
completing this list with more options and add more command-level
instructions about how to set up these environments in our docs?  We
could just point to anything provided by these projects.  As far as I
can see, MSYS2 and chocolatey are the interesting ones to mention, and
these are used in the buildfarm at some extent.
--
Michael


signature.asc
Description: PGP signature

Re: Should we add a compiler warning for large stack frames?

Andres Freund  writes:
> On 2024-04-11 16:35:58 -0400, Tom Lane wrote:
>> Indeed.  I recall reading, not long ago, some Linux kernel docs to the
>> effect that automatic stack growth is triggered by a reference into
>> the page just below what is currently mapped as your stack, and
>> therefore allocating a stack frame greater than one page has the
>> potential to cause SIGSEGV rather than the desired stack extension.

> I think it's more than a single page, but I'm not entirely sure either. I
> think some compilers inject artificial stack accesses when extending the stack
> by a lot, but I don't remember the details.

Hmm.  You're right that I was misremembering the typical RAM page
size.  The kernel must be allowing stack frames bigger than 4K,
or we'd see problems everywhere.  I wonder how much bigger ...

> frame size  warnings
> 4096155
> 8192111
> 16384   36
> 32768   14
> 65536   8

> Suggests that starting somewhere around 16-32k might be reasonable?

I'm hesitant to touch more than a handful of places on the strength
of the info we've got; either it's wasted work or it's not enough
work, and we don't know which.  Might be time for some
experimentation.

regards, tom lane

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 02:21:49PM -0500, Nathan Bossart wrote:
> No objections here.  I'll give this a few days for others to comment.  I'm
> not particularly interested in back-patching this since it's arguably not
> fixing anything that's incorrect, but if anyone really wants me to, I will.

HEAD looks fine based on what I'm reading in the patch.  If there are
more voices in favor of a backpatch, it could always happen later.
--
Michael


signature.asc
Description: PGP signature

Re: session username in default psql prompt?

2024-04-11 Thread jian he

On Tue, Mar 26, 2024 at 7:14 AM Andrew Dunstan  wrote:
>
>
>
> On Mon, Mar 25, 2024 at 9:14 AM Jelte Fennema-Nio  wrote:
>>
>> On Mon, 25 Mar 2024 at 14:06, Robert Haas  wrote:
>> > On Mon, Mar 25, 2024 at 4:30 AM Jelte Fennema-Nio  
>> > wrote:
>> > > That problem seems easy to address by adding a newline into the
>> > > default prompt.
>> >
>> > Ugh. Please, no!
>>
>> I guess it's partially a matter of taste, but personally I'm never
>> going back to a single line prompt. It's so nice for zoomed-in demos
>> that your SQL queries don't get broken up.
>
>
>
> Very  much a matter of taste. I knew when I saw your suggestion there would 
> be some kickback. If horizontal space is at a premium vertical space is 
> doubly so, I suspect.
>

the change (session username in default psql prompt) is quite visible,
maybe this time we can conduct a poll,
but in a way the poll can reach more people?

pg_upgrde failed : logical replication : alter_subscription_add_log

2024-04-11 Thread Perumal Raj

Hi Community

I am trying to upgrade PostgreSQL (RHEL 7) from version 13.7 to 15.6 using
pglogical.
My Standby(destination) machine has following rpms,


*postgresql13-pglogical-3.7.16-1.el7.x86_64pglogical_15-2.4.3-1.rhel7.x86_64*

And Primary(Source) has ,


*postgresql13-pglogical-3.7.16-1.el7.x86_64*
pg_upgrade check mode went fine , but it failed while running real mode.

pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 13027; 1255 3375648004 FUNCTION
alter_subscription_add_log("text", "text", boolean, "regclass", "text"[],
"text"[]) postgres
pg_restore: error: could not execute query: ERROR:  could not find function
"pglogical_alter_subscription_add_log" in file
"/usr/pgsql-15/lib/pglogical.so"
Command was: CREATE FUNCTION
"pglogical"."alter_subscription_add_log"("sub_name" "text", "log_name"
"text", "log_to_file" boolean DEFAULT true, "log_to_table" "regclass"
DEFAULT NULL::"regclass", "conflict_type" "text"[] DEFAULT NULL::"text"[],
"conflict_resolution" "text"[] DEFAULT NULL::"text"[]) RETURNS boolean
LANGUAGE "c"
AS '$libdir/pglogical', 'pglogical_alter_subscription_add_log';

-- For binary upgrade, handle extension membership the hard way
ALTER EXTENSION "pglogical" ADD FUNCTION
"pglogical"."alter_subscription_add_log"("sub_name" "text", "log_name"
"text", "log_to_file" boolean, "log_to_table" "regclass", "conflict_type"
"text"[], "conflict_resolution" "text"[]);

Am I missing any packages?

Thanks,

Re: Support a wildcard in backtrace_functions

On Wed, Apr 10, 2024 at 11:07:00AM +0200, Jelte Fennema-Nio wrote:
> I think patch 0002 should be considered an Open Item for PG17. Since
> it's proposing changing the name of the newly introduced
> backtrace_on_internal_error GUC. If we want to change it in this way,
> we should do it before the release and preferably before the beta.

Indeed, it makes little sense to redesign this at the beginning of v18
if we're unhappy with the current outcome of v17.  So tweaking that is
worth considering at this stage.

log_backtrace speaks a bit more to me as a name for this stuff because
it logs a backtrace.  Now, there is consistency on HEAD as well
because these GUCs are all prefixed with "backtrace_".  Would
something like a backtrace_mode where we have an enum rather than a
boolean be better?  One thing would be to redesign the existing GUC as
having two values on HEAD as of:
- "none", to log nothing.
- "internal", to log backtraces for internal errors.

Then this could be extended with more modes, to discuss in future
releases as new features.

What you are suggesting with backtrace_min_level is an entirely new
feature.  Perhaps using an extra GUC to control the interactions of
the modes that can be assigned to the primary GUC "log_backtrace" in
your 0002 is better, but all that sounds like v18 material at this
stage.  The code that resolves the interactions between the existing
GUC and the new "level" GUC does not use LOGBACKTRACE_ALL.  Perhaps it
should use a case/switch.

+   gettext_noop("Each level includes all the levels that follow it. 
The later"
+" the level, the fewer backtraces are created."),

> I added it to the Open Items list[1] so we don't forget to at least
> decide on this.
> 
> [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Thanks.
--
Michael

signature.asc
Description: PGP signature

Re: Allow non-superuser to cancel superuser tasks.

On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote:
> The test doesn't fail because pg_terminate_backend actually meets his
> point: autovac is killed. But while dying, autovac also receives
> segfault. Thats because of injections points:
> 
> (gdb) bt
> #0  0x56361c3379ea in tas (lock=0x7fbcb9632224  access memory at address 0x7fbcb9632224>) at
> ../../../../src/include/storage/s_lock.h:228
> #1  ConditionVariableCancelSleep () at condition_variable.c:238
> #2  0x56361c337e4b in ConditionVariableBroadcast
> (cv=0x7fbcb66f498c) at condition_variable.c:310
> #3  0x56361c330a40 in CleanupProcSignalState (status= out>, arg=) at procsignal.c:240
> #4  0x56361c328801 in shmem_exit (code=code@entry=1) at ipc.c:276
> #5  0x56361c3288fc in proc_exit_prepare (code=code@entry=1) at ipc.c:198
> #6  0x56361c3289bf in proc_exit (code=code@entry=1) at ipc.c:111
> #7  0x56361c49ffa8 in errfinish (filename=,
> lineno=, funcname=0x56361c654370 <__func__.16>
> "ProcessInterrupts") at elog.c:592
> #8  0x56361bf7191e in ProcessInterrupts () at postgres.c:3264
> #9  0x56361c3378d7 in ConditionVariableTimedSleep
> (cv=0x7fbcb9632224, timeout=timeout@entry=-1,
> wait_event_info=117440513) at condition_variable.c:196
> #10 0x56361c337d0b in ConditionVariableTimedSleep
> (wait_event_info=, timeout=-1, cv=) at
> condition_variable.c:135
> #11 ConditionVariableSleep (cv=,
> wait_event_info=) at condition_variable.c:98
> 
> discovered because of
> # Release injection point.
> $node->safe_psql('postgres',
> "SELECT injection_point_detach('autovacuum-worker-start');");
> added
> 
> v4 also suffers from that. i will try to fix that.

I can see this stack trace as well.  Capturing a bit more than your
own stack, this is crashing in the autovacuum worker while waiting on
a condition variable when processing a ProcessInterrupts().

That may point to a legit bug with condition variables in this
context, actually?  From what I can see, issuing a signal on a backend
process waiting with a condition variable is able to process the
interrupt correctly.
--
Michael


signature.asc
Description: PGP signature

Re: POC: GROUP BY optimization

I'm late to the party, and I apologize for not having paid attention
to this thread for months ... but I am wondering why 0452b461b got
committed.  It seems to add a substantial amount of complexity and
planner cycles in return for not much.  The reason why I'm dubious
about it can be found in this comment that the patch deleted:

- * In principle it might be interesting to consider other orderings of the
- * GROUP BY elements, which could match the sort ordering of other
- * possible plans (eg an indexscan) and thereby reduce cost.  We don't
- * bother with that, though.  Hashed grouping will frequently win anyway.

Indeed, if you remove the

SET enable_hashagg = off;
SET enable_seqscan = off;

lines added to aggregates.sql, you find that every single one of the
test cases added there changes plan, except for the one case that is
there to prove that the planner *doesn't* pick this type of plan.
That shows that the planner doesn't believe any of the plans developed
here are cheaper than other alternatives it would normally consider
(usually HashAgg).  So it sure seems like we're spending a lot of
effort on uninteresting plans.  Maybe this is an artifact of testing
against toy tables and a useful win could be shown on bigger tables,
but I don't see any indication in the thread that anyone actually
demonstrated that.

I might hold still for this patch anyway if the patch were simpler
and more obviously correct, but there are scary bits in it:

* The very first hunk causes get_eclass_for_sort_expr to have
side-effects on existing EquivalenceClass data structures.
What is the argument that that's not going to cause problems?
At the very least it introduces asymmetry, in that the first
caller wins the chance to change the EC's ec_sortref and later
callers won't be able to.

* I'm pretty unconvinced by group_keys_reorder_by_pathkeys (which
I notice has already had one band-aid added to it since commit).
In particular, it seems to believe that the pathkeys and clauses
lists match one-for-one, but I seriously doubt that that invariant
remains guaranteed after the cleanup steps

/* append the remaining group pathkeys (will be treated as not sorted) */
*group_pathkeys = list_concat_unique_ptr(new_group_pathkeys,
 *group_pathkeys);
*group_clauses = list_concat_unique_ptr(new_group_clauses,
*group_clauses);

For that to be reliable, the SortGroupClauses added to
new_group_clauses in the main loop have to be exactly those
that are associated with the same pathkeys in the old lists.
I doubt that that's necessarily true in the presence of redundant
grouping clauses.  (Maybe we can't get here with any redundant
grouping clauses, but still, we don't really guarantee uniqueness of
SortGroupClauses, much less that they are never copied which is what
you need if you want to believe that pointer equality is sufficient
for de-duping here.  PathKeys are explicitly made to be safe to compare
pointer-wise, but I know of no such guarantee for SortGroupClauses.)

* It seems like root->processed_groupClause no longer has much to do
with the grouping behavior, which is scary given how much code still
believes that it does.  I suspect there are bugs lurking there, and
am not comforted by the fact that the patch changed exactly nothing
in the pathnodes.h documentation of that field.  This comment looks
pretty silly now too:

/* Preprocess regular GROUP BY clause, if any */
root->processed_groupClause = list_copy(parse->groupClause);

What "preprocessing" is going on there?  This comment was adequate
before, when the code line invoked preprocess_groupclause which had
a bunch of commentary; but now it has to stand on its own and it's
not doing a great job of that.

* Speaking of pathnodes.h, PathKeyInfo is a pretty uninformative node
type name, and the comments provided for it are not going to educate
anybody.  What is the "association" between the pathkeys and clauses?
I guess the clauses are supposed to be SortGroupClauses, but not all
pathkeys match up to SortGroupClauses, so what then?  This is very
underspecified, and fuzzy thinking about data structures tends to lead
to bugs.

So I'm quite afraid that there are still bugs lurking here.
What's more, given that the plans this patch makes apparently
seldom win when you don't put a thumb on the scales, it might
take a very long time to isolate those bugs.  If the patch
produces a faulty plan (e.g. not sorted properly) we'd never
notice if that plan isn't chosen, and even if it is chosen
it probably wouldn't produce anything as obviously wrong as
a crash.

If this patch were producing better results I'd be more excited
about putting more work into it.  But on the basis of what I'm
seeing right now, I think maybe we ought to give up on it.

regards, tom lane

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 11:27 PM Robert Haas  wrote:
> On Thu, Apr 11, 2024 at 1:46 PM Alexander Korotkov  
> wrote:
> > I understand that I'm the bad guy of this release, not sure if my
> > opinion counts.
> >
> > But what is going on here?  I hope this work is targeting pg18.
> > Otherwise, do I get this right that this post feature-freeze works on
> > designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> > committed on Mar 30.  So that couldn't justify why the proper API
> > wasn't designed in time.  Are we judging different commits with the
> > same criteria?
>
> I mean, Andres already said that the cleanup was needed possibly in
> 17, and possibly in 18.
>
> As far as fairness is concerned, you'll get no argument from me if you
> say the streaming read stuff was all committed far later than it
> should have been. I said that in the very first email I wrote on the
> "post-feature freeze cleanup" thread. But if you're going to argue
> that there's no opportunity for anyone to adjust patches that were
> sideswiped by the reverts of your patches, and that if any such
> adjustments seem advisable we should just revert the sideswiped
> patches entirely, I don't agree with that, and I don't see why anyone
> would agree with that. I think it's fine to have the discussion, and
> if the result of that discussion is that somebody says "hey, we want
> to do X in 17 for reason Y," then we can discuss that proposal on its
> merits, taking into account the answers to questions like "why wasn't
> this done before the freeze?" and "is that adjustment more or less
> risky than just reverting?" and "how about we just leave it alone for
> now and deal with it next release?".

I don't think 041b96802e could be sideswiped by 27bc1772fc.  The "Use
streaming I/O in ANALYZE" patch has the same issue before 27bc1772fc,
which was committed on Mar 30.  So, in the worst case 27bc1772fc
steals a week of work.  I can imagine without 27bc1772fc , a new API
could be proposed days before FF.  This means I saved patch authors
from what you name in my case "desperate rush".  Huh!

> > IMHO, 041b96802e should be just reverted.
>
> IMHO, it's too early to decide that, because we don't know what change
> concretely is going to be proposed, and there has been no discussion
> of why that change, whatever it is, belongs in this release or next
> release.
>
> I understand that you're probably not feeling great about being asked
> to revert a bunch of stuff here, and I do think it is a fair point to
> make that we need to be even-handed and not overreact. Just because
> you had some patches that had some problems doesn't mean that
> everything that got touched by the reverts can or should be whacked
> around a whole bunch more post-freeze, especially since that stuff was
> *also* committed very late, in haste, way closer to feature freeze
> than it should have been. At the same time, it's also important to
> keep in mind that our goal here is not to punish people for being bad,
> or to reward them for being good, or really to make any moral
> judgements at all, but to produce a quality release. I'm sure that,
> where possible, you'd prefer to fix bugs in a patch you committed
> rather than revert the whole thing as soon as anyone finds any
> problem. I would also prefer that, both for your patches, and for
> mine. And everyone else deserves that same consideration.

I expressed my thoughts about producing a better release without a
desperate rush post-FF in my reply to Andres [2].

Links.
1. 
https://www.postgresql.org/message-id/CA%2BTgmobZUnJQaaGkuoeo22Sydf9%3DmX864W11yZKd6sv-53-aEQ%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/CAPpHfdt%2BcCj6j6cR5AyBThP6SyDf6wxAz4dU-0NdXjfpiFca7Q%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: Add notes to pg_combinebackup docs





On 4/11/24 20:51, Tomas Vondra wrote:

On 4/11/24 02:01, David Steele wrote:


I have a hard time seeing this feature as being very useful, especially
for large databases, until pg_combinebackup works on tar (and compressed
tar). Right now restoring an incremental requires at least twice the
space of the original cluster, which is going to take a lot of users by
surprise.


I do agree it'd be nice if pg_combinebackup worked with .tar directly,
without having to extract the directories first. No argument there, but
as I said in the other thread, I believe that's something we can add
later. That's simply how incremental development works.


OK, sure, but if the plan is to make it practical later doesn't that 
make the feature something to be avoided now?



I know you have made some improvements here for COW filesystems, but my
experience is that Postgres is generally not run on such filesystems,
though that is changing a bit.


I'd say XFS is a pretty common choice, for example. And it's one of the
filesystems that work great with pg_combinebackup.


XFS has certainly advanced more than I was aware.


However, who says this has to be the filesystem the Postgres instance
runs on? Who in their right mind put backups on the same volume as the
instance anyway? At which point it can be a different filesystem, even
if it's not ideal for running the database.


My experience is these days backups are generally placed in object 
stores. Sure, people are still using NFS but admins rarely have much 
control over those volumes. They may or not be COW filesystems.



FWIW I think it's fine to tell users that to minimize the disk space
requirements, they should use a CoW filesystem and --copy-file-range.
The docs don't say that currently, that's true.


That would probably be a good addition to the docs.


All of this also depends on how people do the restore. With the CoW
stuff they can do a quick (and small) copy on the backup server, and
then copy the result to the actual instance. Or they can do restore on
the target directly (e.g. by mounting a r/o volume with backups), in
which case the CoW won't really help.


And again, this all requires a significant amount of setup and tooling. 
Obviously I believe good backup requires effort but doing this right 
gets very complicated due to the limitations of the tool.



But yeah, having to keep the backups as expanded directories is not
great, I'd love to have .tar. Not necessarily because of the disk space
(in my experience the compression in filesystems works quite well for
this purpose), but mostly because it's more compact and allows working
with backups as a single piece of data (e.g. it's much cleared what the
checksum of a single .tar is, compared to a directory).


But again, object stores are commonly used for backup these days and 
billing is based on data stored rather than any compression that can be 
done on the data. Of course, you'd want to store the compressed tars in 
the object store, but that does mean storing an expanded copy somewhere 
to do pg_combinebackup.


But if the argument is that all this can/will be fixed in the future, I 
guess the smart thing for users to do is wait a few releases for 
incremental backups to become a practical feature.


Regards,
-David

Re: Should we add a compiler warning for large stack frames?

Hi,

On 2024-04-11 16:35:58 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > d8f5acbdb9b made me wonder if we should add a compiler option to warn when
> > stack frames are large.  gcc compatible compilers have -Wstack-usage=limit, 
> > so
> > that's not hard.
>
> > Huge stack frames are somewhat dangerous, as they can defeat our stack-depth
> > checking logic. There are also some cases where large stack frames defeat
> > stack-protector logic by compilers/libc/os.
>
> Indeed.  I recall reading, not long ago, some Linux kernel docs to the
> effect that automatic stack growth is triggered by a reference into
> the page just below what is currently mapped as your stack, and
> therefore allocating a stack frame greater than one page has the
> potential to cause SIGSEGV rather than the desired stack extension.
> (If you feel like digging in the archives, I think this was brought
> up in the last round of lets-add-some-more-check_stack_depth-calls.)

I think it's more than a single page, but I'm not entirely sure either. I
think some compilers inject artificial stack accesses when extending the stack
by a lot, but I don't remember the details.

There certainly was the issue that strict memory overcommit does not reliably
work with larger stack extensions.

Could be worth writing a test program for...

> > I don't really have an opinion about the concrete warning limit to use.
>
> Given the above, I'm tempted to say we should make it 8K.

Hm, why 8k? That's 2x the typical page size, isn't it?

> But I know we have a bunch of places that allocate page images as stack
> space, so maybe that'd be too painful.

8k does generate a fair number of warnings, 111 here.  I think it might also
be hard to ensure that inlining doesn't end up creating bigger stack frames.

frame size  warnings
4096155
8192111
16384   36
32768   14
65536   8

Suggests that starting somewhere around 16-32k might be reasonable?

One issue is of course that configuring a larger blocksize or wal_blocksize
will trigger more warnings. I guess we'd need to set the limit based on
Max(blocksize, wal_blocksize) * 2 or such.

Greetings,

Andres Freund

Re: Table AM Interface Enhancements

On Fri, Apr 12, 2024 at 12:04 AM Andres Freund  wrote:
> On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:
> > I hope this work is targeting pg18.
>
> I think anything of the scope discussed by Melanie would be very clearly
> targeting 18. For 17, I don't know yet whether we should revert the the
> ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
> or some other small change.
>
> One oddity is that before 041b96802ef, the opportunities for making the
> interface cleaner were less apparent, because c6fc50cb4028 increased the
> coupling between analyze.c and the way the table storage works.

Thank you for pointing this out about c6fc50cb4028, I've missed this.

> > Otherwise, do I get this right that this post feature-freeze works on
> > designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> > committed on Mar 30.
>
> Note that there were versions of the patch that were targeting the
> pre-27bc1772fc interface.

Sure, I've checked this before writing.  It looks quite similar to the
result of applying my revert patch [1] to the head.

Let me describe my view over the current situation.

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed.  As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2].  Not a secret I think the
current API is quite restrictive.  And we're getting the ANALYZE
interface narrower than it was since 737a292b5de.  Frankly speaking, I
don't think this is acceptable.

2) Pushing down the read stream and prefetch to heap am is related to
difficulties [3], [4].  That's quite a significant piece of work to be
done post FF.

In token of all of the above, is the in-tree state that bad? (if we
abstract the way 27bc1772fc and dd1f6b0c17 were committed).

The in-tree state provides quite a general API for analyze, supporting
even non-block storages.  There is a way to reuse existing
acquire_sample_rows() for table AMs, which have block numbers 1:1 with
the actual physical on-disk location.  It requires some cleanup for
comments and docs, but does not require us to redesing the API post
FF.

Links.
1. 
https://www.postgresql.org/message-id/CAPpHfdvuT6DnguzaV-M1UQ2whYGDojaNU%3D-%3DiHc0A7qo9HBEJw%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/20240410212117.mxsldz2w6htrl36v%40awork3.anarazel.de
3. 
https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com
4. 
https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: post-freeze damage control


On 4/11/24 20:26, Tomas Vondra wrote:


On 4/11/24 03:52, David Steele wrote:

On 4/11/24 10:23, Tom Kincaid wrote:


The extensive Beta process we have can be used to build confidence we
need in a feature that has extensive review and currently has no known
issues or outstanding objections.


I did have objections, here [1] and here [2]. I think the complexity,
space requirements, and likely performance issues involved in restores
are going to be a real problem for users. Some of these can be addressed
in future releases, but I can't escape the feeling that what we are
releasing here is half-baked.


I haven't been part of those discussions, and that part of the thread is
a couple months old already, so I'll share my view here instead.

I do not think it's half-baked. I certainly agree there are limitations,
and there's all kinds of bells and whistles we could add, but I think
the fundamental infrastructure is corrent and a meaningful step forward.
Would I wish it to handle .tar for example? Sure I would. But I think
it's something we can add in the future - if we require all of this to
happen in a single release, it'll never happen.


Fair enough, but the current release is extremely limited and it would 
be best if that was well understood by users.



FWIW that discussion also mentions stuff that I think the feature should
not do. In particular, I don't think the ambition was (or should be) to
make pg_basebackup into a stand-alone tool. I always saw pg_basebackup
more as an interface to "backup steps" correctly rather than a complete
backup solution that'd manage backup registry, retention, etc.


Right -- this is exactly my issue. pg_basebackup was never easy to use 
as a backup solution and this feature makes it significantly more 
complicated. Complicated enough that it would be extremely difficult for 
most users to utilize in a meaningful way.


But they'll try because it is a new pg_basebackup feature and they'll 
assume it is there to be used. Maybe it would be a good idea to make it 
clear in the documentation that significant tooling will be required to 
make it work.


Regards,
-David

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 16:46:23 -0400, Robert Haas wrote:
> On Thu, Apr 11, 2024 at 3:52 PM Andres Freund  wrote:
> > My suspicion is that most of the false positives are caused by lots of 
> > signals
> > interrupting the pg_usleep()s. Because we measure the number of delays, not
> > the actual time since we've been waiting for the spinlock, signals
> > interrupting pg_usleep() trigger can very significantly shorten the amount 
> > of
> > time until we consider a spinlock stuck.  We should fix that.
> 
> I mean, go nuts. But  pants, 2 pair of asbestos socks, 3 asbestos shirts, 2 asbestos
> jackets, and then hides inside of a flame-proof capsule at the bottom
> of the Pacific ocean> this is just another thing like query hints,
> where everybody says "oh, the right thing to do is fix X or Y or Z and
> then you won't need it". But of course it never actually gets fixed
> well enough that people stop having problems in the real world. And
> eventually we look like a developer community that cares more about
> our own opinion about what is right than what the experience of real
> users actually is.

I don't think that's a particularly apt comparison. If you have spinlocks that
cannot be acquired within tens of seconds, you're in a really bad situation,
regardless of whether you crash-restart or not.

Whereas with hints, you might actually be operating perfectly normally when
using hints.  Never using the wrong plan is also just an order of magnitude
harder and fuzzier problem than ensuring we don't wait for spinlocks for a
long time.

> In all seriousness, I'd really like to understand what experience
> you've had that makes this check seem useful. Because I think all of
> my experiences with it have been bad. If they weren't, the last good
> one was a very long time ago.

By far the most of the stuck spinlocks I've seen were due to bugs in
out-of-core extensions. Absurdly enough, the next common thing probably is due
to people using gdb to make an uninterruptible process break out of some code,
without a crash-restart, accidentally doing so while a spinlock is held.

Greetings,

Andres Freund

Re: [Patch] add multiple client certificate selection feature

2024-04-11 Thread Cary Huang

Hello 



I would like to share an updated patch that adds a feature to libpq to 
automatically select the best client certificate to send to the server (if it 
requests one). This feature is inspired by this email discussion years ago: 
https://www.postgresql.org/message-id/200905081539.n48Fdl2Y003286%40no.baka.org,
 which makes it easier for a single client to communicate TLS with multiple 
TLS-enabled PostgreSQL servers with different certificate setups.



Instead of specifying just one sslcert, sslkey, or sslpassword, this patch 
allows multiple to be specified and libpq is able to pick the matching one to 
send to the PostgreSQL server based on the trusted CA names sent during TLS 
handshake.



If anyone finds it useful and would like to give it as try, I wrote a blog on 
how to test and verify this feature here: 
https://www.highgo.ca/2024/03/28/procedure-to-multiple-client-certificate-feature/



thank you



Best regards



Cary Huang

v3-0001-multiple_client_certificate_selection_support.patch
Description: Binary data

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 16:46:10 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
> >> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
> >> limit.  Whatever kicked us off the sleep doesn't matter, we might
> >> as well go check the spinlock.
> 
> > I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
> > shouldn't increase cur_delay faster just because a lot of signals are coming
> > in.
> 
> I'm unconvinced there's a problem there.

Obviously that's a different aspect than efficiency, but in local, admittedly
extreme, testing I've seen stuck spinlocks being detected in a fraction of the
normally expected time. A spinlock that ends up sleeping for close to a second
after a relatively short amount of time surely isn't good for predictable
performance.

IIRC the bad case was on a hot standby, with some recovery conflict causing
the startup process to send a lot of signals.

> Also, what would you do about this that wouldn't involve adding kernel calls
> for gettimeofday?  Admittedly, if we only do that when we're about to sleep,
> maybe it's not so awful; but it's still adding complexity that I'm
> unconvinced is warranted.

At least on !windows, pg_usleep() uses nanosleep(), which, when interrupted by
a signal, can return the remaining time until the experation of the timer.

I suspect that on windows computing the time when a signal arrived wouldn't be
expensive, compared to all the other overhead implied by our signal handling
emulation.

Greetings,

Andres Freund

Re: Table AM Interface Enhancements

Hi,

On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:
> I hope this work is targeting pg18.

I think anything of the scope discussed by Melanie would be very clearly
targeting 18. For 17, I don't know yet whether we should revert the the
ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
or some other small change.

One oddity is that before 041b96802ef, the opportunities for making the
interface cleaner were less apparent, because c6fc50cb4028 increased the
coupling between analyze.c and the way the table storage works.

> Otherwise, do I get this right that this post feature-freeze works on
> designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> committed on Mar 30.

Note that there were versions of the patch that were targeting the
pre-27bc1772fc interface.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

On Thu, Apr 11, 2024 at 3:52 PM Andres Freund  wrote:
> My suspicion is that most of the false positives are caused by lots of signals
> interrupting the pg_usleep()s. Because we measure the number of delays, not
> the actual time since we've been waiting for the spinlock, signals
> interrupting pg_usleep() trigger can very significantly shorten the amount of
> time until we consider a spinlock stuck.  We should fix that.

I mean, go nuts. But  this is just another thing like query hints,
where everybody says "oh, the right thing to do is fix X or Y or Z and
then you won't need it". But of course it never actually gets fixed
well enough that people stop having problems in the real world. And
eventually we look like a developer community that cares more about
our own opinion about what is right than what the experience of real
users actually is.

> I don't think that's a fair description of the situation. It supposes that the
> alternative to the PANIC is that the problem is detected and resolved some
> other way. But, depending on the spinlock, the problem will not be detected by
> automated checks for the system being up. IME you end up with a system that's
> degraded in a complicated hard to understand way, rather than one that's just
> down.

I'm interested to read that you've seen this actually happen and that
you got that result. What I would have thought would happen is that,
within a relatively short period of time, every backend in the system
would pile up waiting for that spinlock and the whole system would
become completely unresponsive. I mean, I know it depends on exactly
which spinlock it is. But, I would have thought that if this was
happening, it would be happening because some regular backend died in
a weird way, and if that is indeed what happened, then it's likely
that the other backends are doing similar kinds of work, because
that's how application workloads typically behave, so they'll probably
all hit the part of the code where they need that spinlock too, and
now everybody's just spinning.

If it's something like a WAL receiver mutex or the checkpointer mutex
or even a parallel query mutex, then I guess it would look different.
But even then, what I'd expect to see is all backends of type X pile
up on the stuck mutex, and when you check 'ps' or 'top',  you go "oh
hmm, all my WAL receivers are at 100% CPU" and you get a backtrace or
an strace and you go "hmm". Now, I agree that in this kind of scenario
where only some backends lock up, automated checks are not necessarily
going to notice the problem - but a PANIC is hardly better. Now you
just have a system that keeps PANICing, which liveness checks aren't
necessarily going to notice either.

In all seriousness, I'd really like to understand what experience
you've had that makes this check seem useful. Because I think all of
my experiences with it have been bad. If they weren't, the last good
one was a very long time ago.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

Andres Freund  writes:
> On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
>> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
>> limit.  Whatever kicked us off the sleep doesn't matter, we might
>> as well go check the spinlock.

> I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
> shouldn't increase cur_delay faster just because a lot of signals are coming
> in.

I'm unconvinced there's a problem there.  Also, what would you do
about this that wouldn't involve adding kernel calls for gettimeofday?
Admittedly, if we only do that when we're about to sleep, maybe it's
not so awful; but it's still adding complexity that I'm unconvinced
is warranted.

regards, tom lane

Re: Should we add a compiler warning for large stack frames?

Andres Freund  writes:
> d8f5acbdb9b made me wonder if we should add a compiler option to warn when
> stack frames are large.  gcc compatible compilers have -Wstack-usage=limit, so
> that's not hard.

> Huge stack frames are somewhat dangerous, as they can defeat our stack-depth
> checking logic. There are also some cases where large stack frames defeat
> stack-protector logic by compilers/libc/os.

Indeed.  I recall reading, not long ago, some Linux kernel docs to the
effect that automatic stack growth is triggered by a reference into
the page just below what is currently mapped as your stack, and
therefore allocating a stack frame greater than one page has the
potential to cause SIGSEGV rather than the desired stack extension.
(If you feel like digging in the archives, I think this was brought
up in the last round of lets-add-some-more-check_stack_depth-calls.)

> Warnings in src/bin aren't as interesting as warnings in backend code, as the
> latter is much more likely to be "exposed" to deep stacks and could be
> vulnerable due to stack overflows.

Probably, but it's still the case that such code is relying on the
original stack allocation being large enough already.

> I don't really have an opinion about the concrete warning limit to use.

Given the above, I'm tempted to say we should make it 8K.  But I know
we have a bunch of places that allocate page images as stack space,
so maybe that'd be too painful.

regards, tom lane

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 1:46 PM Alexander Korotkov  wrote:
> I understand that I'm the bad guy of this release, not sure if my
> opinion counts.
>
> But what is going on here?  I hope this work is targeting pg18.
> Otherwise, do I get this right that this post feature-freeze works on
> designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
> committed on Mar 30.  So that couldn't justify why the proper API
> wasn't designed in time.  Are we judging different commits with the
> same criteria?

I mean, Andres already said that the cleanup was needed possibly in
17, and possibly in 18.

As far as fairness is concerned, you'll get no argument from me if you
say the streaming read stuff was all committed far later than it
should have been. I said that in the very first email I wrote on the
"post-feature freeze cleanup" thread. But if you're going to argue
that there's no opportunity for anyone to adjust patches that were
sideswiped by the reverts of your patches, and that if any such
adjustments seem advisable we should just revert the sideswiped
patches entirely, I don't agree with that, and I don't see why anyone
would agree with that. I think it's fine to have the discussion, and
if the result of that discussion is that somebody says "hey, we want
to do X in 17 for reason Y," then we can discuss that proposal on its
merits, taking into account the answers to questions like "why wasn't
this done before the freeze?" and "is that adjustment more or less
risky than just reverting?" and "how about we just leave it alone for
now and deal with it next release?".

> IMHO, 041b96802e should be just reverted.

IMHO, it's too early to decide that, because we don't know what change
concretely is going to be proposed, and there has been no discussion
of why that change, whatever it is, belongs in this release or next
release.

I understand that you're probably not feeling great about being asked
to revert a bunch of stuff here, and I do think it is a fair point to
make that we need to be even-handed and not overreact. Just because
you had some patches that had some problems doesn't mean that
everything that got touched by the reverts can or should be whacked
around a whole bunch more post-freeze, especially since that stuff was
*also* committed very late, in haste, way closer to feature freeze
than it should have been. At the same time, it's also important to
keep in mind that our goal here is not to punish people for being bad,
or to reward them for being good, or really to make any moral
judgements at all, but to produce a quality release. I'm sure that,
where possible, you'd prefer to fix bugs in a patch you committed
rather than revert the whole thing as soon as anyone finds any
problem. I would also prefer that, both for your patches, and for
mine. And everyone else deserves that same consideration.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
> >> Or, rip out the whole, whole mechanism and just don't PANIC.
>
> > I continue believe that that'd be a quite bad idea.
>
> I'm warming to it myself.
>
> > My suspicion is that most of the false positives are caused by lots of 
> > signals
> > interrupting the pg_usleep()s. Because we measure the number of delays, not
> > the actual time since we've been waiting for the spinlock, signals
> > interrupting pg_usleep() trigger can very significantly shorten the amount 
> > of
> > time until we consider a spinlock stuck.  We should fix that.
>
> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
> limit.  Whatever kicked us off the sleep doesn't matter, we might
> as well go check the spinlock.

I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
shouldn't increase cur_delay faster just because a lot of signals are coming
in.  If it were just user triggered signals it'd probably not be worth
worrying about, but we do sometimes send a lot of signals ourselves...


> Also, you propose in your other message replacing spinlocks with lwlocks.
> Whatever the other merits of that, I notice that we have no timeout or
> "stuck lwlock" detection.

True. And that's not great. But at least lwlocks can be identified in
pg_stat_activity, which does help some.

Greetings,

Andres Freund

Re: Should we add a compiler warning for large stack frames?

Hi,

On 2024-04-11 15:16:57 -0400, Andrew Dunstan wrote:
> On 2024-04-11 Th 15:01, Andres Freund wrote:
> > [1345/1940 42  69%] Compiling C object 
> > src/bin/pg_verifybackup/pg_verifybackup.p/pg_verifybackup.c.o
> > ../../../../../home/andres/src/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:
> >  In function 'verify_file_checksum':
> > ../../../../../home/andres/src/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:842:1:
> >  warning: stack usage is 131232 bytes [-Wstack-usage=]
> >842 | verify_file_checksum(verifier_context *context, manifest_file *m,
> >| ^~~~
> > 
> This one's down to me. I asked Robert some time back why we were using a
> very conservative buffer size, and he agreed we could probably make it
> larger, but the number chosen is mine, not his. It was a completely
> arbitrary choice.
> 
> I'm happy to reduce it, but it's not clear to me why we care that much for a
> client binary. There's no stack depth checking going on here.

There isn't - but it that doesn't necessarily mean it's great to just use a
lot of stack. It can even mean that you ought to be more careful, because
you'll just crash, rather than get an error about having exceeded the stack
size.  When the stack size is increased by more than a few pages, the OS /
compiler defenses for detecting that don't work as well, if you're unlucky.

128k is probably not going to be an issue in practice. However, it also seems
not great from a performance POV to use this much stack in a function that's
called fairly often.  I'd allocate the buffer in verify_backup_checksums() and
reuse it across all to-be-checked files.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

Andres Freund  writes:
> On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
>> Or, rip out the whole, whole mechanism and just don't PANIC.

> I continue believe that that'd be a quite bad idea.

I'm warming to it myself.

> My suspicion is that most of the false positives are caused by lots of signals
> interrupting the pg_usleep()s. Because we measure the number of delays, not
> the actual time since we've been waiting for the spinlock, signals
> interrupting pg_usleep() trigger can very significantly shorten the amount of
> time until we consider a spinlock stuck.  We should fix that.

We wouldn't need to fix it, if we simply removed the NUM_DELAYS
limit.  Whatever kicked us off the sleep doesn't matter, we might
as well go check the spinlock.

Also, you propose in your other message replacing spinlocks with
lwlocks.  Whatever the other merits of that, I notice that we have
no timeout or "stuck lwlock" detection.  So that would basically
remove the stuck-spinlock behavior in an indirect way, without
adding any safety measures that would justify thinking that it's
less likely we needed stuck-lock detection than before.

regards, tom lane

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-10 21:52:59 -0400, Tom Lane wrote:
> Less drastically, I wonder if we should call finish_spin_delay
> at all in these off-label uses of perform_spin_delay.  What
> we're trying to measure there is the behavior of TAS() spin loops,
> and I'm not sure that what LWLockWaitListLock and the bufmgr
> callers are doing should be assumed to have timing behavior
> identical to that.

I think the difference between individual spinlocks is bigger than between
spinlocks and lwlock/buffer header locks.

I think we probably should move away from having any spinlocks.  I tried to
just replace them with lwlocks, but today the overhead is too big. The issue
isn't primarily the error handling or such, but the fact that rwlocks are more
expensive than simple exclusive locks. The main differences are:

1) On x86 a spinlock release just needs a compiler barrier, but an rwlock
   needs an atomic op.

2) Simple mutex acquisition is easily done with an atomic-exchange, which is
   much harder for an rwlock (as the lock has more states, so just setting to
   "locked" and checking the return value isn't sufficient). Once a lock is
   contended, a atomic compare-exchange ends up with lots of retries due to
   concurrent changes.

It might be that we can still get away with just removing spinlocks - to my
knowledge we have one very heavily contended performance critical spinlock,
XLogCtl->Insert->insertpos_lck.  I think Thomas and I have come up with a way
to do away with that spinlock.

OTOH, there are plenty other lwlocks where we pay the price of lwlocks being
an rwlock, but just use the exclusive mode. So perhaps we should just add a
exclusive-only lwlock variant.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

Robert Haas  writes:
> On Wed, Apr 10, 2024 at 9:53 PM Tom Lane  wrote:
>> Maybe we should rip out the whole mechanism and hard-wire
>> spins_per_delay at 1000 or so.

> Or, rip out the whole, whole mechanism and just don't PANIC.

By that you mean "remove the NUM_DELAYS limit", right?  We still ought
to sleep sometimes if we don't get a spinlock promptly, so that seems
fairly orthogonal to the other points raised in this thread.

Having said that, it's not an insane suggestion.  Our coding rules
for spinlocks are tight enough that a truly stuck spinlock should
basically never happen, and certainly it basically never happens in
developer testing (IME anyway, maybe other people break things at
that level more often).  Besides, if it does happen it wouldn't be
different in a user's eyes from other infinite or near-infinite loops.
I don't think we could risk putting in a CHECK_FOR_INTERRUPTS, so
getting out of it would require "kill -9" or so; but it's hardly
unheard-of for other problematic loops to not have such a check.

regards, tom lane

Re: Issue with the PRNG used by Postgres

Hi,

On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2024 at 9:53 PM Tom Lane  wrote:
> > Maybe we should rip out the whole mechanism and hard-wire
> > spins_per_delay at 1000 or so.
> 
> Or, rip out the whole, whole mechanism and just don't PANIC.

I continue believe that that'd be a quite bad idea.

My suspicion is that most of the false positives are caused by lots of signals
interrupting the pg_usleep()s. Because we measure the number of delays, not
the actual time since we've been waiting for the spinlock, signals
interrupting pg_usleep() trigger can very significantly shorten the amount of
time until we consider a spinlock stuck.  We should fix that.

> To believe that the PANIC is the right idea, we have to suppose that
> we have stuck-spinlock bugs that people actually hit, but that those
> people don't hit them often enough to care, as long as the system
> resets when the spinlock gets stuck, instead of hanging. I can't
> completely rule out the existence of either such bugs or such people,
> but I'm not aware of having encountered them.

I don't think that's a fair description of the situation. It supposes that the
alternative to the PANIC is that the problem is detected and resolved some
other way. But, depending on the spinlock, the problem will not be detected by
automated checks for the system being up. IME you end up with a system that's
degraded in a complicated hard to understand way, rather than one that's just
down.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

On Wed, Apr 10, 2024 at 9:53 PM Tom Lane  wrote:
> Maybe we should rip out the whole mechanism and hard-wire
> spins_per_delay at 1000 or so.

Or, rip out the whole, whole mechanism and just don't PANIC.

I'm not 100% sure that's better, but I think it's worth considering.
The thing is, if you're panicking regularly, that's almost as bad as
being down, because you're going to lose all of your connections and
have to run recovery before they can be reestablished. And if you're
panicking once in a blue moon, the PANIC actually prevents you from
easily getting a stack trace that might help you figure out what's
happening. And of course if it's not really a stuck spinlock but just
a very slow system, which I think is an extremely high percentage of
real-world cases, then it's completely worthless at best.

To believe that the PANIC is the right idea, we have to suppose that
we have stuck-spinlock bugs that people actually hit, but that those
people don't hit them often enough to care, as long as the system
resets when the spinlock gets stuck, instead of hanging. I can't
completely rule out the existence of either such bugs or such people,
but I'm not aware of having encountered them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: allow changing autovacuum_max_workers without restarting

On Thu, Apr 11, 2024 at 03:37:23PM +, Imseih (AWS), Sami wrote:
>> My concern with this approach is that other background workers could use up
>> all the slots and prevent autovacuum workers from starting
> 
> That's a good point, the current settings do not guarantee that you
> get a worker for the purpose if none are available, 
> i.e. max_parallel_workers_per_gather, you may have 2 workers planned 
> and 0 launched. 
> 
>> unless of
>> course we reserve autovacuum_max_workers slots for _only_ autovacuum
>> workers. I'm not sure if we want to get these parameters tangled up like
>> this, though...
> 
> This will be confusing to describe and we will be reserving autovac workers
> implicitly, rather than explicitly with a new GUC.

Yeah, that's probably a good reason to give autovacuum its own worker pool.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 01:38:37PM -0400, Robert Haas wrote:
> On Thu, Apr 11, 2024 at 1:03 PM Nathan Bossart  
> wrote:
>> While it's fresh on my mind, I very hastily hacked together a draft of what
>> I believe is the remaining work.
> 
> That looks fine to me. And if others agree, I think it's fine to just
> commit this now, post-freeze. It's only a doc change, and a
> back-patchable one if you want to go that route.

No objections here.  I'll give this a few days for others to comment.  I'm
not particularly interested in back-patching this since it's arguably not
fixing anything that's incorrect, but if anyone really wants me to, I will.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Should we add a compiler warning for large stack frames?

2024-04-11 Thread Andrew Dunstan

On 2024-04-11 Th 15:01, Andres Freund wrote:

Hi,

d8f5acbdb9b made me wonder if we should add a compiler option to warn when
stack frames are large. gcc compatible compilers have -Wstack-usage=limit, so
that's not hard.

Huge stack frames are somewhat dangerous, as they can defeat our stack-depth
checking logic. There are also some cases where large stack frames defeat
stack-protector logic by compilers/libc/os.

It's not always obvious how large the stack will be. Even if you look at all
the sizes of the variables defined in a function, inlining can increase that
substantially.

Here are all the cases a limit of 64k finds.

[1345/1940 42 69%] Compiling C object
src/bin/pg_verifybackup/pg_verifybackup.p/pg_verifybackup.c.o
../../../../../home/andres/src/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:
In function 'verify_file_checksum':
../../../../../home/andres/src/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:842:1:
warning: stack usage is 131232 bytes [-Wstack-usage=]
842 | verify_file_checksum(verifier_context *context, manifest_file *m,
| ^~~~

This one's down to me. I asked Robert some time back why we were using a
very conservative buffer size, and he agreed we could probably make it
larger, but the number chosen is mine, not his. It was a completely
arbitrary choice.

I'm happy to reduce it, but it's not clear to me why we care that much
for a client binary. There's no stack depth checking going on here.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Should we add a compiler warning for large stack frames?

Hi,

d8f5acbdb9b made me wonder if we should add a compiler option to warn when
stack frames are large.  gcc compatible compilers have -Wstack-usage=limit, so
that's not hard.

Huge stack frames are somewhat dangerous, as they can defeat our stack-depth
checking logic. There are also some cases where large stack frames defeat
stack-protector logic by compilers/libc/os.

It's not always obvious how large the stack will be. Even if you look at all
the sizes of the variables defined in a function, inlining can increase that
substantially.

Here are all the cases a limit of 64k finds.

Unoptimized build:
[138/1940 42   7%] Compiling C object 
src/common/libpgcommon_shlib.a.p/blkreftable.c.o
../../../../../home/andres/src/postgresql/src/common/blkreftable.c: In function 
'WriteBlockRefTable':
../../../../../home/andres/src/postgresql/src/common/blkreftable.c:474:1: 
warning: stack usage is 65696 bytes [-Wstack-usage=]
  474 | WriteBlockRefTable(BlockRefTable *brtab,
  | ^~
[173/1940 42   8%] Compiling C object src/common/libpgcommon.a.p/blkreftable.c.o
../../../../../home/andres/src/postgresql/src/common/blkreftable.c: In function 
'WriteBlockRefTable':
../../../../../home/andres/src/postgresql/src/common/blkreftable.c:474:1: 
warning: stack usage is 65696 bytes [-Wstack-usage=]
  474 | WriteBlockRefTable(BlockRefTable *brtab,
  | ^~
[281/1940 42  14%] Compiling C object 
src/common/libpgcommon_srv.a.p/blkreftable.c.o
../../../../../home/andres/src/postgresql/src/common/blkreftable.c: In function 
'WriteBlockRefTable':
../../../../../home/andres/src/postgresql/src/common/blkreftable.c:474:1: 
warning: stack usage is 65696 bytes [-Wstack-usage=]
  474 | WriteBlockRefTable(BlockRefTable *brtab,
  | ^~
[1311/1940 42  67%] Compiling C object 
src/bin/pg_basebackup/pg_basebackup.p/pg_basebackup.c.o
../../../../../home/andres/src/postgresql/src/bin/pg_basebackup/pg_basebackup.c:
 In function 'BaseBackup':
../../../../../home/andres/src/postgresql/src/bin/pg_basebackup/pg_basebackup.c:1753:1:
 warning: stack usage might be 66976 bytes [-Wstack-usage=]
 1753 | BaseBackup(char *compression_algorithm, char *compression_detail,
  | ^~
[1345/1940 42  69%] Compiling C object 
src/bin/pg_verifybackup/pg_verifybackup.p/pg_verifybackup.c.o
../../../../../home/andres/src/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:
 In function 'verify_file_checksum':
../../../../../home/andres/src/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:842:1:
 warning: stack usage is 131232 bytes [-Wstack-usage=]
  842 | verify_file_checksum(verifier_context *context, manifest_file *m,
  | ^~~~
[1349/1940 42  69%] Compiling C object 
src/bin/pg_waldump/pg_waldump.p/pg_waldump.c.o
../../../../../home/andres/src/postgresql/src/bin/pg_waldump/pg_waldump.c: In 
function 'main':
../../../../../home/andres/src/postgresql/src/bin/pg_waldump/pg_waldump.c:792:1:
 warning: stack usage might be 105104 bytes [-Wstack-usage=]
  792 | main(int argc, char **argv)
  | ^~~~
[1563/1940 42  80%] Compiling C object 
contrib/pg_walinspect/pg_walinspect.so.p/pg_walinspect.c.o
../../../../../home/andres/src/postgresql/contrib/pg_walinspect/pg_walinspect.c:
 In function 'GetWalStats':
../../../../../home/andres/src/postgresql/contrib/pg_walinspect/pg_walinspect.c:762:1:
 warning: stack usage is 104624 bytes [-Wstack-usage=]
  762 | GetWalStats(FunctionCallInfo fcinfo, XLogRecPtr start_lsn, XLogRecPtr 
end_lsn,
  | ^~~
[1581/1940 42  81%] Compiling C object 
src/test/modules/test_dsa/test_dsa.so.p/test_dsa.c.o
../../../../../home/andres/src/postgresql/src/test/modules/test_dsa/test_dsa.c: 
In function 'test_dsa_resowners':
../../../../../home/andres/src/postgresql/src/test/modules/test_dsa/test_dsa.c:64:1:
 warning: stack usage is 80080 bytes [-Wstack-usage=]
   64 | test_dsa_resowners(PG_FUNCTION_ARGS)
  | ^~


There is one warning that is just visible in an optimized build,
otherwise the precise amount of stack usage just differs some:
[1165/2017 42  57%] Compiling C object 
src/backend/postgres_lib.a.p/tsearch_spell.c.o
../../../../../home/andres/src/postgresql/src/backend/tsearch/spell.c: In 
function 'NIImportAffixes':
../../../../../home/andres/src/postgresql/src/backend/tsearch/spell.c:1425:1: 
warning: stack usage might be 74080 bytes [-Wstack-usage=]
 1425 | NIImportAffixes(IspellDict *Conf, const char *filename)
  | ^~~


Warnings in src/bin aren't as interesting as warnings in backend code, as the
latter is much more likely to be "exposed" to deep stacks and could be
vulnerable due to stack overflows.

I did verify this would have warned about d8f5acbdb9b^:

[1/2 1  50%] Compiling C object 
src/backend/postgres_lib.a.p/backup_basebackup_incremental.c.o
../../../../../home/andres/src/postgresql/src/backend/backup/basebackup_incremental.c:
 In function 'GetFileBackupMethod':
../../../../../home/andres/src/postgr

Re: RFC: Additional Directory for Extensions

2024-04-11 Thread David E. Wheeler

On Apr 4, 2024, at 1:20 PM, David E. Wheeler  wrote:

> I’ve added some docs based on your GUC description; updated patch attached.

Here’s a rebase.

I realize this probably isn’t going to happen for 17, given the freeze, but I 
would very much welcome feedback and pointers to address concerns about 
providing a second directory for extensions and DSOs. Quite a few people have 
talked about the need for this in the Extension Mini Summits[1], so I’m sure I 
could get some collaborators to make improvements or look at a different 
approach.

Best,

David

[1] 
https://justatheory.com/2024/02/extension-ecosystem-summit/#extension-ecosystem-mini-summit




v3-0001-Add-extension_directory-GUC.patch
Description: Binary data

Re: Table AM Interface Enhancements

2024-04-11 Thread Melanie Plageman

On Thu, Apr 11, 2024 at 12:19 PM Melanie Plageman
 wrote:
>
> On Wed, Apr 10, 2024 at 5:21 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:
> > > This brings up a question about the prefetching. We never had to have
> > > this discussion for sequential scan streaming read because it didn't
> > > (and still doesn't) do prefetching. But, if we push the streaming read
> > > code down into the heap AM layer, it will be doing the prefetching.
> > > So, do we remove the prefetching from acquire_sample_rows() and expect
> > > other table AMs to implement it themselves or use the streaming read
> > > API?
> >
> > The prefetching added to acquire_sample_rows was quite narrowly tailored to
> > something heap-like - it pretty much required that block numbers to be 1:1
> > with the actual physical on-disk location for the specific AM.  So I think
> > it's pretty much required for this to be pushed down.
> >
> > Using a read stream is a few lines for something like this, so I'm not 
> > worried
> > about it. I guess we could have a default implementation for block based 
> > AMs,
> > similar what we have around table_block_parallelscan_*, but not sure it's
> > worth doing that, the complexity is much lower than in the
> > table_block_parallelscan_ case.
>
> This makes sense.
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).

I will also say that, had this been 6 months ago, I would probably
suggest we restructure ANALYZE's table AM interface to accommodate
read stream setup and to address a few other things I find odd about
the current code. For example, I think creating a scan descriptor for
the analyze scan in acquire_sample_rows() is quite odd. It seems like
it would be better done in the relation_analyze callback. The
relation_analyze callback saves some state like the callbacks for
acquire_sample_rows() and the Buffer Access Strategy. But at least in
the heap implementation, it just saves them in static variables in
analyze.c. It seems like it would be better to save them in a useful
data structure that could be accessed later. We have access to pretty
much everything we need at that point (in the relation_analyze
callback). I also think heap's implementation of
table_beginscan_analyze() doesn't need most of
heap_beginscan()/initscan(), so doing this instead of something
ANALYZE specific seems more confusing than helpful.

- Melanie

Re: Table AM Interface Enhancements

Hi!

On Thu, Apr 11, 2024 at 7:19 PM Melanie Plageman
 wrote:
> On Wed, Apr 10, 2024 at 5:21 PM Andres Freund  wrote:
> > On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:
> > > This brings up a question about the prefetching. We never had to have
> > > this discussion for sequential scan streaming read because it didn't
> > > (and still doesn't) do prefetching. But, if we push the streaming read
> > > code down into the heap AM layer, it will be doing the prefetching.
> > > So, do we remove the prefetching from acquire_sample_rows() and expect
> > > other table AMs to implement it themselves or use the streaming read
> > > API?
> >
> > The prefetching added to acquire_sample_rows was quite narrowly tailored to
> > something heap-like - it pretty much required that block numbers to be 1:1
> > with the actual physical on-disk location for the specific AM.  So I think
> > it's pretty much required for this to be pushed down.
> >
> > Using a read stream is a few lines for something like this, so I'm not 
> > worried
> > about it. I guess we could have a default implementation for block based 
> > AMs,
> > similar what we have around table_block_parallelscan_*, but not sure it's
> > worth doing that, the complexity is much lower than in the
> > table_block_parallelscan_ case.
>
> This makes sense.
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).
>
> read_stream_begin_relation() doesn't just save the
> BufferAccessStrategy in the ReadStream, it uses it to set various
> other things in the ReadStream object. callback_private_data (which in
> ANALYZE's case is the BlockSamplerData) is simply saved in the
> ReadStream, so it could be set later, but that doesn't sound very
> clean to me.
>
> As such, it seems like a cleaner alternative would be to add a table
> AM callback for creating a read stream object that takes the
> parameters of read_stream_begin_relation(). But, perhaps it is a bit
> late for such additions.
>
> It also opens us up to the question of whether or not sequential scan
> should use such a callback instead of making the read stream object in
> heap_beginscan().
>
> I am happy to write a patch that does any of the above. But, I want to
> raise these questions, because perhaps I am simply missing an obvious
> alternative solution.

I understand that I'm the bad guy of this release, not sure if my
opinion counts.

But what is going on here?  I hope this work is targeting pg18.
Otherwise, do I get this right that this post feature-freeze works on
designing a new API?  Yes, 27bc1772fc masked the problem.  But it was
committed on Mar 30.  So that couldn't justify why the proper API
wasn't designed in time.  Are we judging different commits with the
same criteria?

IMHO, 041b96802e should be just reverted.

--
Regards,
Alexander Korotkov

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 1:03 PM Nathan Bossart  wrote:
> On Thu, Apr 11, 2024 at 11:48:30AM -0500, Nathan Bossart wrote:
> > On Thu, Apr 11, 2024 at 11:36:52AM -0400, Robert Haas wrote:
> >> I suggest that we close the existing CF entry as committed and
> >> somebody can start a new one for whatever remains. I think that will
> >> be less confusing.
> >
> > Done: https://commitfest.postgresql.org/48/4923/.
>
> While it's fresh on my mind, I very hastily hacked together a draft of what
> I believe is the remaining work.

That looks fine to me. And if others agree, I think it's fine to just
commit this now, post-freeze. It's only a doc change, and a
back-patchable one if you want to go that route.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 8:11 PM Jeff Davis  wrote:
> On Wed, 2024-04-10 at 15:19 +0300, Alexander Korotkov wrote:
> > 1) 9bd99f4c26 comprises the reworked patch after working with notes
> > from Jeff Davis.  I agree it would be better to wait for him to
> > express explicit agreement.  Before reverting this, I would prefer to
> > hear his opinion.
>
> On this particular feature, I had tried it in the past myself, and
> there were a number of minor frustrations and I left it unfinished. I
> quickly recognized that commit c95c25f9af was too simple to work.
>
> Commit 9bd99f4c26 looked substantially better, but I was surprised to
> see it committed so soon after the redesign. I thought a revert was
> likely outcome, but I put it on my list of things to review more deeply
> in the next couple weeks so I could give productive feedback.

Thank you for your feedback, Jeff.

> It would benefit from more discussion in v18, and I apologize for not
> getting involved earlier when the patch still could have made it into
> v17.

I believe you don't have to apologize.  It's definitely not your fault
that I've committed this patch in this shape.

--
Regards,
Alexander Korotkov

Re: Table AM Interface Enhancements

2024-04-11 Thread Jeff Davis

On Wed, 2024-04-10 at 15:19 +0300, Alexander Korotkov wrote:
> 1) 9bd99f4c26 comprises the reworked patch after working with notes
> from Jeff Davis.  I agree it would be better to wait for him to
> express explicit agreement.  Before reverting this, I would prefer to
> hear his opinion.

On this particular feature, I had tried it in the past myself, and
there were a number of minor frustrations and I left it unfinished. I
quickly recognized that commit c95c25f9af was too simple to work.

Commit 9bd99f4c26 looked substantially better, but I was surprised to
see it committed so soon after the redesign. I thought a revert was
likely outcome, but I put it on my list of things to review more deeply
in the next couple weeks so I could give productive feedback.

It would benefit from more discussion in v18, and I apologize for not
getting involved earlier when the patch still could have made it into
v17.

Regards,
Jeff Davis

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 11:48:30AM -0500, Nathan Bossart wrote:
> On Thu, Apr 11, 2024 at 11:36:52AM -0400, Robert Haas wrote:
>> I suggest that we close the existing CF entry as committed and
>> somebody can start a new one for whatever remains. I think that will
>> be less confusing.
> 
> Done: https://commitfest.postgresql.org/48/4923/.

While it's fresh on my mind, I very hastily hacked together a draft of what
I believe is the remaining work.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bb3aa06b6e55b403489afafcc8b7608516fd7b40 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Apr 2024 12:01:21 -0500
Subject: [PATCH v3 1/1] further improvements to SET ROLE docs

---
 doc/src/sgml/ref/set_role.sgml | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/set_role.sgml b/doc/src/sgml/ref/set_role.sgml
index 083e6dc6ea..9557bb77ab 100644
--- a/doc/src/sgml/ref/set_role.sgml
+++ b/doc/src/sgml/ref/set_role.sgml
@@ -37,7 +37,10 @@ RESET ROLE
written as either an identifier or a string literal.
After SET ROLE, permissions checking for SQL commands
is carried out as though the named role were the one that had logged
-   in originally.
+   in originally.  Note that SET ROLE and
+   SET SESSION AUTHORIZATION are exceptions; permissions
+   checks for those continue to use the current session user and the initial
+   session user (the authenticated user), respectively.
   
 
   
@@ -88,11 +91,6 @@ RESET ROLE
exercised either with or without SET ROLE.
   
 
-  
-   Note that when a superuser chooses to SET ROLE to a
-   non-superuser role, they lose their superuser privileges.
-  
-
   
SET ROLE has effects comparable to
SET SESSION AUTHORIZATION, but the privilege
-- 
2.25.1

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-11 Thread Alexander Lakhin




11.04.2024 16:27, Dmitry Koval wrote:


Added correction (and test), see 
v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.



Thank you for the correction, but may be an attempt to merge into implicit
pg_temp should fail just like CREATE TABLE ... PARTITION OF ... does?

Please look also at another anomaly with schemas:
CREATE SCHEMA s1;
CREATE TABLE t (i int) PARTITION BY RANGE (i);
CREATE TABLE tp_0_2 PARTITION OF t
  FOR VALUES FROM (0) TO (2);
ALTER TABLE t SPLIT PARTITION tp_0_2 INTO
  (PARTITION s1.tp0 FOR VALUES FROM (0) TO (1), PARTITION s1.tp1 FOR VALUES 
FROM (1) TO (2));
results in:
\d+ s1.*
Did not find any relation named "s1.*"
\d+ tp*
  Table "public.tp0"
...
   Table "public.tp1"
...

Best regards,
Alexander

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 11:36:52AM -0400, Robert Haas wrote:
> I suggest that we close the existing CF entry as committed and
> somebody can start a new one for whatever remains. I think that will
> be less confusing.

Done: https://commitfest.postgresql.org/48/4923/.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Another WaitEventSet resource leakage in back branches

Hi,

On 2024-03-22 21:15:45 +0900, Etsuro Fujita wrote:
> While working on [1], I noticed $SUBJECT: WaitLatchOrSocket in back
> branches is ignoring the possibility of failing partway through, too.
> I added a PG_FAINALLY block to that function, like commit 555276f85.
> Patch attached.

Could you expand a bit on the concrete scenario you're worried about here?
PG_TRY/CATCH aren't free, so adding something like this to a quite common
path, in the back branches, without a concrete analysis as to why it's needed,
seems a bit scary.

Greetings,

Andres Freund

Re: Table AM Interface Enhancements

2024-04-11 Thread Melanie Plageman

On Wed, Apr 10, 2024 at 5:21 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:
> > This brings up a question about the prefetching. We never had to have
> > this discussion for sequential scan streaming read because it didn't
> > (and still doesn't) do prefetching. But, if we push the streaming read
> > code down into the heap AM layer, it will be doing the prefetching.
> > So, do we remove the prefetching from acquire_sample_rows() and expect
> > other table AMs to implement it themselves or use the streaming read
> > API?
>
> The prefetching added to acquire_sample_rows was quite narrowly tailored to
> something heap-like - it pretty much required that block numbers to be 1:1
> with the actual physical on-disk location for the specific AM.  So I think
> it's pretty much required for this to be pushed down.
>
> Using a read stream is a few lines for something like this, so I'm not worried
> about it. I guess we could have a default implementation for block based AMs,
> similar what we have around table_block_parallelscan_*, but not sure it's
> worth doing that, the complexity is much lower than in the
> table_block_parallelscan_ case.

This makes sense.

I am working on pushing streaming ANALYZE into heap AM code, and I ran
into a few roadblocks.

If we want ANALYZE to make the ReadStream object in heap_beginscan()
(like the read stream implementation of heap sequential and TID range
scans do), I don't see any way around changing the scan_begin table AM
callback to take a BufferAccessStrategy at the least (and perhaps also
the BlockSamplerData).

read_stream_begin_relation() doesn't just save the
BufferAccessStrategy in the ReadStream, it uses it to set various
other things in the ReadStream object. callback_private_data (which in
ANALYZE's case is the BlockSamplerData) is simply saved in the
ReadStream, so it could be set later, but that doesn't sound very
clean to me.

As such, it seems like a cleaner alternative would be to add a table
AM callback for creating a read stream object that takes the
parameters of read_stream_begin_relation(). But, perhaps it is a bit
late for such additions.

It also opens us up to the question of whether or not sequential scan
should use such a callback instead of making the read stream object in
heap_beginscan().

I am happy to write a patch that does any of the above. But, I want to
raise these questions, because perhaps I am simply missing an obvious
alternative solution.

- Melanie

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-11 Thread Heikki Linnakangas


On 11/04/2024 18:09, Alexander Korotkov wrote:

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas  wrote:

On 07/04/2024 00:52, Alexander Korotkov wrote:

* At first, we check that pg_wal_replay_wait() is called in a non-atomic
* context.  That is, a procedure call isn't wrapped into a transaction,
* another procedure call, or a function call.
*


It's pretty unfortunate to have all these restrictions. It would be nice
to do:

select pg_wal_replay_wait('0/1234'); select * from foo;


This works for me, except that it needs "call" not "select".

# call pg_wal_replay_wait('0/1234'); select * from foo;
CALL
  i
---
(0 rows)


If you do that from psql prompt, it works because psql parses and sends 
it as two separate round-trips. Try:


psql postgres -p 5433 -c "call pg_wal_replay_wait('0/4101BBB8'); select 1"
ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
DETAIL:  Make sure pg_wal_replay_wait() isn't called within a 
transaction, another procedure, or a function.



This assumption that PortalRunUtility() can tolerate us popping the
snapshot sounds very fishy. I haven't looked at what's going on there,
but doesn't sound like a great assumption.


This is what PortalRunUtility() says about this.

/*
  * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
  * under us, so don't complain if it's now empty.  Otherwise, our snapshot
  * should be the top one; pop it.  Note that this could be a different
  * snapshot from the one we made above; see EnsurePortalSnapshotExists.
  */

So, if the vacuum pops a snapshot when it needs to run without a
snapshot, then it's probably OK for other utilities.  But I agree this
decision needs some consensus.


Ok, so it's at least somewhat documented that it's fine.


Overall, this feature doesn't feel quite ready for v17, and IMHO should
be reverted. It's a nice feature, so I'd love to have it fixed and
reviewed early in the v18 cycle.


Thank you for your review.  I've reverted this. Will repost this for early v18.


Thanks Alexander for working on this.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: allow changing autovacuum_max_workers without restarting

2024-04-11 Thread Imseih (AWS), Sami

> My concern with this approach is that other background workers could use up
> all the slots and prevent autovacuum workers from starting

That's a good point, the current settings do not guarantee that you
get a worker for the purpose if none are available, 
i.e. max_parallel_workers_per_gather, you may have 2 workers planned 
and 0 launched. 

> unless of
> course we reserve autovacuum_max_workers slots for _only_ autovacuum
> workers. I'm not sure if we want to get these parameters tangled up like
> this, though...

This will be confusing to describe and we will be reserving autovac workers
implicitly, rather than explicitly with a new GUC.

Regards,

Sami

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 10:03 AM Nathan Bossart
 wrote:
> On Thu, Apr 11, 2024 at 10:33:15AM +0900, Michael Paquier wrote:
> > On Tue, Apr 09, 2024 at 09:21:39AM +0300, Andrey M. Borodin wrote:
> >> Can I ask you please to help me with determining status of CF item
> >> [0]. Is it committed or there's something to move to next CF?
> >
> > Only half of the patch has been applied as of 3330a8d1b792.  Yurii and
> > Nathan, could you follow up with the rest?  Moving the patch to the
> > next CF makes sense, and the last thread update is rather recent.
>
> AFAICT there are two things remaining:
>
> * Remove the "they lose their superuser privileges" note.
> * Note that SET ROLE and SET SESSION AUTHORIZATION are exceptions.
>
> While I think these are good changes, I don't sense any urgency here, so
> I'm treating this as v18 material at this point.

I suggest that we close the existing CF entry as committed and
somebody can start a new one for whatever remains. I think that will
be less confusing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

some LLVM function checks missing in meson

2024-04-11 Thread Peter Eisentraut

I have been checking the pg_config.h generated by configure and meson to 
see if there is anything materially different.  I found that


HAVE_DECL_LLVMCREATEGDBREGISTRATIONLISTENER and
HAVE_DECL_LLVMCREATEPERFJITEVENTLISTENER

are missing on the meson side.

Something like the below would appear to fix that:

diff --git a/meson.build b/meson.build
index 43fad5323c0..cdfd31377d1 100644
--- a/meson.build
+++ b/meson.build
@@ -2301,6 +2301,14 @@ decl_checks += [
   ['pwritev', 'sys/uio.h'],
 ]

+# Check presence of some optional LLVM functions.
+if llvm.found()
+  decl_checks += [
+['LLVMCreateGDBRegistrationListener', 'llvm-c/ExecutionEngine.h'],
+['LLVMCreatePerfJITEventListener', 'llvm-c/ExecutionEngine.h'],
+  ]
+endif
+
 foreach c : decl_checks
   func = c.get(0)
   header = c.get(1)

I don't know what these functions do, but the symbols are used in the 
source code.  Thoughts?

Re: Allow non-superuser to cancel superuser tasks.

2024-04-11 Thread Kirill Reshke

On Thu, 11 Apr 2024 at 19:07, Nathan Bossart  wrote:
>
> On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote:
> > Posting updated version of this patch with comments above addressed.
>
> I look for a commitfest entry for this one, but couldn't find it.  Would
> you mind either creating one

Done: https://commitfest.postgresql.org/48/4922/

Re: [HACKERS] make async slave to wait for lsn to be replayed

Hi, Heikki!

Thank you for your interest in the subject.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas  wrote:
> On 07/04/2024 00:52, Alexander Korotkov wrote:
> > On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera  
> > wrote:
> >> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
> >> Does this mean that if a process throws an error while waiting, it'll
> >> not get cleaned up until it exits?  Maybe this is not a big deal, but it
> >> seems odd.
> >
> > I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
> > that together with the improvements upthread.
>
> Race condition:
>
> 1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN
> 2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend
> process to the LSN heap and returns
> 3. replay:  rm_redo record '0/1234'
> 4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr()
> 5. backend: current replay LSN location is '0/1234', so we exit the loop
> 6. replay: calls WaitLSNSetLatches()
>
> In a nutshell, it's possible for the loop in WaitForLSN to exit without
> cleaning up the process from the heap. I was able to hit that by adding
> a delay after the addLSNWaiter() call:
>
> > TRAP: failed Assert("!procInfo->inHeap"), File: 
> > "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
> > postgres: heikki postgres [local] 
> > CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
> > postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
> > postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
> > postgres: heikki postgres [local] 
> > CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
> > postgres: heikki postgres [local] 
> > CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
> > postgres: heikki postgres [local] 
> > CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]
>
> I think there's a similar race condition if the timeout is reached at
> the same time that the startup process wakes up the process.

Thank you for catching this.  I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

> >* At first, we check that pg_wal_replay_wait() is called in a 
> > non-atomic
> >* context.  That is, a procedure call isn't wrapped into a 
> > transaction,
> >* another procedure call, or a function call.
> >*
>
> It's pretty unfortunate to have all these restrictions. It would be nice
> to do:
>
> select pg_wal_replay_wait('0/1234'); select * from foo;

This works for me, except that it needs "call" not "select".

# call pg_wal_replay_wait('0/1234'); select * from foo;
CALL
 i
---
(0 rows)

> in a single multi-query call, to avoid the round-trip to the client. You
> can avoid it with libpq or protocol level pipelining, too, but it's more
> complicated.
>
> >* Secondly, according to PlannedStmtRequiresSnapshot(), even in an 
> > atomic
> >* context, CallStmt is processed with a snapshot.  Thankfully, we 
> > can pop
> >* this snapshot, because PortalRunUtility() can tolerate this.
>
> This assumption that PortalRunUtility() can tolerate us popping the
> snapshot sounds very fishy. I haven't looked at what's going on there,
> but doesn't sound like a great assumption.

This is what PortalRunUtility() says about this.

/*
 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
 * should be the top one; pop it.  Note that this could be a different
 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
 */

So, if the vacuum pops a snapshot when it needs to run without a
snapshot, then it's probably OK for other utilities.  But I agree this
decision needs some consensus.

> If recovery ends while a process is waiting for an LSN to arrive, does
> it ever get woken up?

If the recovery target is promote, then the user will get an error.
If the recovery target is shutdown, then connection will get
interrupted.  If the recovery target is pause, then waiting will
continue during the pause.  Not sure about the latter case.

> The docs could use some-copy-editing, but just to point out one issue:
>
> > There are also procedures to control the progress of recovery.
>
> That's copy-pasted from an earlier sentence at the table that lists
> functions like pg_promote(), pg_wal_replay_pause(), and
> pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the
> progress of recovery like those functions do, it only causes the calling
> backend to wait.
>
> Overall, this feature doesn't feel quite ready for v17, and IMHO should
> be reverted. It's a nice feature, so I'd love to have it fixed and
> reviewed early in the v18 cycle.

Thank you for your review.  I've reverted this. Will repost this for early v18.

--
Regards,
Alexander Korotkov

Re: allow changing autovacuum_max_workers without restarting

On Thu, Apr 11, 2024 at 09:42:40AM -0500, Nathan Bossart wrote:
> On Thu, Apr 11, 2024 at 02:24:18PM +, Imseih (AWS), Sami wrote:
>> max_worker_processes defines a pool of max # of background workers allowed.
>> parallel workers and extensions that spin  up background workers all utilize 
>> from 
>> this pool. 
>> 
>> Should autovacuum_max_workers be able to utilize from max_worker_processes 
>> also?
>> 
>> This will allow autovacuum_max_workers to be dynamic while the user only has
>> to deal with an already existing GUC. We may want to increase the default 
>> value
>> for max_worker_processes as part of this.
> 
> My concern with this approach is that other background workers could use up
> all the slots and prevent autovacuum workers from starting, unless of
> course we reserve autovacuum_max_workers slots for _only_ autovacuum
> workers.  I'm not sure if we want to get these parameters tangled up like
> this, though...

I see that the logical replication launcher process uses this pool, but we
take special care to make sure it gets a slot:

/*
 * Register the apply launcher.  It's probably a good idea to call this
 * before any modules had a chance to take the background worker slots.
 */
ApplyLauncherRegister();

I'm not sure there's another way to effectively reserve slots that would
work for the autovacuum workers (which need to restart to connect to
different databases), so that would need to be invented.  We'd probably
also want to fail startup if autovacuum_max_workers < max_worker_processes,
which seems like it has the potential to cause problems when folks first
upgrade to v18.

Furthermore, we might have to convert autovacuum workers to background
worker processes for this to work.  I've admittedly wondered about whether
we should do that eventually, anyway, but it'd expand the scope of this
work quite a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Can't find not null constraint, but \d+ shows that

2024-04-11 Thread Alvaro Herrera

On 2024-Apr-11, Alvaro Herrera wrote:

> Well, I think you were right that we should try to handle the situation
> of unmarking attnotnull as much as possible, to decrease the chances
> that the problematic situation occurs.  That means, we can use the
> earlier code to handle DROP COLUMN when it causes a PK to be dropped --
> even though we still need to handle the situation of an attnotnull flag
> set with no pg_constraint row.  I mean, we still have to handle DROP
> DOMAIN correctly (and there might be other cases that I haven't thought
> about) ... but surely this is a much less common situation than the one
> you reported.  So I'll merge everything and post an updated patch.

Here's roughly what I'm thinking.  If we drop a constraint, we can still
reset attnotnull in RemoveConstraintById(), but only after checking that
it's not a generated column or a replica identity.  If they are, we
don't error out -- just skip the attnotnull update.

Now, about the code to allow ALTER TABLE DROP NOT NULL in case there's
no pg_constraint row, I think at this point it's mostly dead code,
because it can only happen when you have a replica identity or generated
column ... and the DROP NOT NULL should still prevent you from dropping
the flag anyway.  But the case can still arise, if you change the
replica identity or ALTER TABLE ALTER COLUMN DROP DEFAULT, respectively.

I'm still not ready with this -- still not convinced about the new AT
pass.  Also, I want to add a test for the pg_dump behavior, and there's
an XXX comment.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)
>From ce8fff0ccb568a601ac9f765da7b7891066f522b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 11 Apr 2024 12:29:34 +0200
Subject: [PATCH] Better handle indirect constraint drops

It is possible for certain cases to remove constraints without
maintaining the attnotnull in its correct state; for example if you drop
a column that's part of the primary key, and the remaining columns don't
have not-null constraints, then we should reset the attnotnull flags for
those other columns, but we don't.  We can handle those cases better by
doing the attnotnull reset in RemoveConstraintById() instead of in
dropconstraint_internal().

However, there are some cases where we must not do so.  For example if
those other columns are in replica identity indexes or are generated
identity columns, we must keep the flag, even though it results in the
catalog inconsistency that no constraint supports the flag.

Because of these exceptions, we also include a pg_dump hack to include a
not-null constraint when the attnotnull flag is set even if no
pg_constraint row exists.  This part is undesirable but necessary,
because failing to handle the case can result in unrestorable dumps.
---
 src/backend/catalog/pg_constraint.c   | 116 +-
 src/backend/commands/tablecmds.c  | 108 ++--
 src/bin/pg_dump/pg_dump.c |  30 --
 src/test/regress/expected/constraints.out |  73 ++
 src/test/regress/sql/constraints.sql  |  34 +++
 5 files changed, 301 insertions(+), 60 deletions(-)

diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 604280d322..c48acf00b0 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -19,6 +19,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
@@ -916,6 +917,8 @@ RemoveConstraintById(Oid conId)
 	Relation	conDesc;
 	HeapTuple	tup;
 	Form_pg_constraint con;
+	bool		dropping_pk = false;
+	List	   *unconstrained_cols = NIL;

 	conDesc = table_open(ConstraintRelationId, RowExclusiveLock);

@@ -940,7 +943,9 @@ RemoveConstraintById(Oid conId)
 		/*
 		 * We need to update the relchecks count if it is a check constraint
 		 * being dropped.  This update will force backends to rebuild relcache
-		 * entries when we commit.
+		 * entries when we commit.  For not-null and primary key constraints,
+		 * obtain the list of columns affected, so that we can reset their
+		 * attnotnull flags below.
 		 */
 		if (con->contype == CONSTRAINT_CHECK)
 		{
@@ -967,6 +972,36 @@ RemoveConstraintById(Oid conId)

 			table_close(pgrel, RowExclusiveLock);
 		}
+		else if (con->contype == CONSTRAINT_NOTNULL)
+		{
+			unconstrained_cols = list_make1_int(extractNotNullColumn(tup));
+		}
+		else if (con->contype == CONSTRAINT_PRIMARY)
+		{
+			Datum		adatum;
+			ArrayType  *arr;
+			int			numkeys;
+			bool		isNull;
+			int16	   *attnums;
+
+			dropping_pk = true;
+
+			adatum = heap_getattr(tup, Anum_pg_constraint_conkey,
+  RelationGetDescr(conDesc), &isNull);
+			if (isNull)
+elog(ERROR, "null conkey for constraint %u", c

Re: allow changing autovacuum_max_workers without restarting

On Thu, Apr 11, 2024 at 02:24:18PM +, Imseih (AWS), Sami wrote:
> max_worker_processes defines a pool of max # of background workers allowed.
> parallel workers and extensions that spin  up background workers all utilize 
> from 
> this pool. 
> 
> Should autovacuum_max_workers be able to utilize from max_worker_processes 
> also?
> 
> This will allow autovacuum_max_workers to be dynamic while the user only has
> to deal with an already existing GUC. We may want to increase the default 
> value
> for max_worker_processes as part of this.

My concern with this approach is that other background workers could use up
all the slots and prevent autovacuum workers from starting, unless of
course we reserve autovacuum_max_workers slots for _only_ autovacuum
workers.  I'm not sure if we want to get these parameters tangled up like
this, though...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Allow non-superuser to cancel superuser tasks.

On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote:
>> It's feeling more natural here to check that we have a superuser-owned
>> backend first, and then do a lookup of the process type.
> 
>> I figured since there's no reason to rely on that behavior, we might as
>> well do a bit of future-proofing in case autovacuum workers are ever not
>> run as InvalidOid.
> 
> I have combined them into this:
> 
> if ((!OidIsValid(proc->roleId) || superuser_arg(proc->roleId))
> && pgstat_get_backend_type(GetNumberFromPGProc(proc)) == B_AUTOVAC_WORKER)
> 
> This is both future-proofing and natural, I suppose. Downside of this
> is double checking condition (!OidIsValid(proc->roleId) ||
> superuser_arg(proc->roleId)), but i think that is ok for the sake of
> simplicity.

If we want to retain the check, IMO we might as well combine the first two
blocks like Anthony proposed:

if (!OidIsValid(proc->roleId) || superuser_arg(proc->roleId))
{
ProcNumber procNumber = GetNumberFromPGProc(proc);
PGBackendStatus procStatus = 
pgstat_get_beentry_by_proc_number(procNumber);

if (procStatus && procStatus->st_backendType == 
B_AUTOVAC_WORKER &&
!has_privs_of_role(GetUserId(), 
ROLE_PG_SIGNAL_AUTOVAC_WORKER))
return SIGNAL_BACKEND_NOAUTOVAC;
else if (!superuser())
return SIGNAL_BACKEND_NOSUPERUSER;
}

+  
+   pg_signal_autovacuum_worker
+   Allow signaling autovacuum worker backend to cancel or 
terminate
+  

I think we need to be more specific about what pg_cancel_backend() and
pg_terminate_backend() do for autovacuum workers.  The code offers some
clues:

/*
 * SIGINT is used to signal canceling the current table's vacuum; 
SIGTERM
 * means abort and exit cleanly, and SIGQUIT means abandon ship.
 */
pqsignal(SIGINT, StatementCancelHandler);
pqsignal(SIGTERM, die);

+/* --
+ * pgstat_get_backend_type() -
+ *
+ * Return the backend type of the backend for the given proc number.
+ * --
+ */
+BackendType
+pgstat_get_backend_type(ProcNumber procNumber)
+{
+   PgBackendStatus *ret;
+
+   ret = pgstat_get_beentry_by_proc_number(procNumber);
+
+   if (!ret)
+   return B_INVALID;
+
+   return ret->st_backendType;
+}

I'm not sure we really need to introduce a new function for this.  I
avoided using it in my example snippet above.  But, maybe it'll come in
handy down the road...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: allow changing autovacuum_max_workers without restarting

2024-04-11 Thread Imseih (AWS), Sami

> I frequently hear about scenarios where users with thousands upon thousands
> of tables realize that autovacuum is struggling to keep up.  When they
> inevitably go to bump up autovacuum_max_workers, they discover that it
> requires a server restart (i.e., downtime) to take effect, causing further
> frustration.  For this reason, I think $SUBJECT is a desirable improvement.
> I spent some time looking for past discussions about this, and I was
> surprised to not find any, so I thought I'd give it a try.

I did not review the patch in detail yet, but +1 to the idea. 
It's not just thousands of tables that suffer from this.
If a user has a few large tables hogging the autovac workers, then other
tables don't get the autovac cycles they require. Users are then forced
to run manual vacuums, which adds complexity to their operations.

> The attached proof-of-concept patch demonstrates what I have in mind.
> Instead of trying to dynamically change the global process table, etc., I'm
> proposing that we introduce a new GUC that sets the effective maximum
> number of autovacuum workers that can be started at any time.

max_worker_processes defines a pool of max # of background workers allowed.
parallel workers and extensions that spin  up background workers all utilize 
from 
this pool. 

Should autovacuum_max_workers be able to utilize from max_worker_processes also?

This will allow autovacuum_max_workers to be dynamic while the user only has
to deal with an already existing GUC. We may want to increase the default value
for max_worker_processes as part of this.


Regards,

Sami
Amazon Web Services (AWS)

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

Hi, Dmitry!

On Thu, Apr 11, 2024 at 4:27 PM Dmitry Koval  wrote:
> 1.
> Alexander Lakhin sent a question about index name after MERGE (partition
> name is the same as one of the merged partitions):
>
> start of quote
> I'm also confused by an index name after MERGE:
> CREATE TABLE t (i int) PARTITION BY RANGE (i);
>
> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
>
> CREATE INDEX tidx ON t(i);
> ALTER TABLE t MERGE PARTITIONS (tp_1_2, tp_0_1) INTO tp_1_2;
> \d+ t*
>
>   Table "public.tp_1_2"
>   Column |  Type   | Collation | Nullable | Default | Storage |
> Compression | Stats target | Description
> +-+---+--+-+-+-+--+-
>   i  | integer |   |  | | plain   |
> |  |
> Partition of: t FOR VALUES FROM (0) TO (2)
> Partition constraint: ((i IS NOT NULL) AND (i >= 0) AND (i < 2))
> Indexes:
>  "merge-16385-3A14B2-tmp_i_idx" btree (i)
>
> Is the name "merge-16385-3A14B2-tmp_i_idx" valid or it's something
> temporary?
> end of quote
>
> Fix for this case added to file
> v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.
>
> 
>
> 2.
>  >It seems to me that v2-0001-Fix-for-SPLIT-MERGE-partitions-of-
>  >temporary-table.patch is not complete either.
>
> Added correction (and test), see
> v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.

Thank you, I'll review this later today.

--
Regards,
Alexander Korotkov

Re: post-freeze damage control

2024-04-11 Thread Alena Rybakina


Hi!

I also worked with this patch and until your explanation I didn’t fully 
understand the reasons why it was wrong to have this implementation when 
removing duplicate OR expressions.


Thank you, now I understand it!

I agree with explanation of Andrei Lepikhov regarding the fact that 
there were difficulties in moving the patch to another place and
the explanation of Alexander Korotkov and Peter Geoghegan regarding the 
need to apply this transformation.


Let me just add that initially this patch tried to solve a problem where 
50,000 OR expressions were generated and
there was a problem running that query using a plan with BitmapScan. I 
wrote more about this here [0].


If the patch can be improved and you can tell me how, because I don’t 
quite understand how to do it yet, to be honest, then I’ll be happy to 
work on it too.


[0] 
https://www.postgresql.org/message-id/052172e4-6d75-8069-3179-26de339dca03%40postgrespro.ru


--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Allow non-superuser to cancel superuser tasks.

On Thu, Apr 11, 2024 at 04:55:59PM +0500, Kirill Reshke wrote:
> Posting updated version of this patch with comments above addressed.

I look for a commitfest entry for this one, but couldn't find it.  Would
you mind either creating one or, if I've somehow missed it, pointing me to
the existing entry?

https://commitfest.postgresql.org/48/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SET ROLE documentation improvement

On Thu, Apr 11, 2024 at 10:33:15AM +0900, Michael Paquier wrote:
> On Tue, Apr 09, 2024 at 09:21:39AM +0300, Andrey M. Borodin wrote:
>> Can I ask you please to help me with determining status of CF item
>> [0]. Is it committed or there's something to move to next CF?
> 
> Only half of the patch has been applied as of 3330a8d1b792.  Yurii and
> Nathan, could you follow up with the rest?  Moving the patch to the
> next CF makes sense, and the last thread update is rather recent.

AFAICT there are two things remaining:

* Remove the "they lose their superuser privileges" note.
* Note that SET ROLE and SET SESSION AUTHORIZATION are exceptions.

While I think these are good changes, I don't sense any urgency here, so
I'm treating this as v18 material at this point.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix possible dereference null pointer (src/backend/replication/logical/reorderbuffer.c)

2024-04-11 Thread Ranier Vilela

Em qui., 11 de abr. de 2024 às 09:54, Heikki Linnakangas 
escreveu:

> On 11/04/2024 15:03, Ranier Vilela wrote:
> > Em qua., 10 de abr. de 2024 às 18:28, Heikki Linnakangas
> > mailto:hlinn...@iki.fi>> escreveu:
> >
> > On 10/04/2024 21:07, Ranier Vilela wrote:
> >  > Hi,
> >  >
> >  > Per Coverity.
> >  >
> >  > The function ReorderBufferTXNByXid,
> >  > can return NULL when the parameter *create* is false.
> >  >
> >  > In the functions ReorderBufferSetBaseSnapshot
> >  > and ReorderBufferXidHasBaseSnapshot,
> >  > the second call to ReorderBufferTXNByXid,
> >  > pass false to *create* argument.
> >  >
> >  > In the function ReorderBufferSetBaseSnapshot,
> >  > fixed passing true as argument to always return
> >  > a valid ReorderBufferTXN pointer.
> >  >
> >  > In the function ReorderBufferXidHasBaseSnapshot,
> >  > fixed by checking if the pointer is NULL.
> >
> > If it's a "known subxid", the top-level XID should already have its
> > ReorderBufferTXN entry, so ReorderBufferTXN() should never return
> NULL.
> >
> > There are several conditions to not return NULL,
> > I think trusting never is insecure.
>
> Well, you could make it an elog(ERROR, ..) instead. But the point is
> that it should not happen, and if it does for some reason, that's very
> suprpising and there is a bug somewhere. In that case, we should *not*
> just blindly create it and proceed as if everything was OK.
>
Thanks for the clarification.
I will then suggest improving robustness,
but avoiding hiding any possible errors that may occur.

v1 patch attached.

best regards,
Ranier Vilela


v1-fix-possible-dereference-null-pointer-reorderbuffer.patch
Description: Binary data

Re: Typos in the code and README

2024-04-11 Thread Daniel Gustafsson

> On 11 Apr 2024, at 15:29, David Rowley  wrote:
> 
> On Fri, 12 Apr 2024, 1:05 am Daniel Gustafsson,  > wrote:
>> Now that the tree has settled down a bit post-freeze I ran some tooling to
>> check spelling.  I was primarily interested in docs and README* which were
>> mostly free from simply typos, while the code had some in various comments 
>> and
>> one in code.  The attached fixes all that I came across (not cross-referenced
>> against ongoing reverts or any other fixup threads but will be before pushing
>> of course).
> 
> 
> I see you've corrected "iif" to "if". It should be "iff".

Gotcha, will fix.  I opted for "if" due to recent threads where the use of
"iff" was discouraged due to not being commonly known, but that was in doc/ and
not code comments.

--
Daniel Gustafsson

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada

On Thu, Apr 11, 2024 at 10:46 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Heikki,
>
> I also prototyped the idea, which has almost the same shape.
> I attached just in case, but we may not have to see.
>
> Few comments based on the experiment.

Thank you for reviewing the patch. I didn't include the following
suggestions since firstly I wanted to just fix binaryheap part while
keeping other parts. If we need these changes, we can do them in
separate commits as fixes.

>
> ```
> +   /* txn_heap is ordered by transaction size */
> +   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
> NULL);
> ```
>
> I think the pairing heap should be in the same MemoryContext with the buffer.
> Can we add MemoryContextSwithTo()?

The pairingheap_allocate() allocates a tiny amount of memory for
pairingheap and its memory usage doesn't grow even when adding more
data. And since it's allocated in logical decoding context its
lifetime is also fine. So I'm not sure it's worth including it in
rb->context for better memory accountability.

>
> ```
> +   /* Update the max-heap */
> +   if (oldsize != 0)
> +   pairingheap_remove(rb->txn_heap, &txn->txn_node);
> +   pairingheap_add(rb->txn_heap, &txn->txn_node);
> ...
> +   /* Update the max-heap */
> +   pairingheap_remove(rb->txn_heap, &txn->txn_node);
> +   if (txn->size != 0)
> +   pairingheap_add(rb->txn_heap, &txn->txn_node);
> ```
>
> Since the number of stored transactions does not affect to the insert 
> operation, we may able
> to add the node while creating ReorederBufferTXN and remove while cleaning up 
> it. This can
> reduce branches in ReorderBufferChangeMemoryUpdate().

I think it also means that we need to remove the entry while cleaning
up even if it doesn't have any changes, which is done in O(log n). I
feel the current approach that we don't store transactions with size 0
in the heap is better and I'm not sure that reducing these branches
really contributes to the performance improvements..

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Typos in the code and README

2024-04-11 Thread David Rowley

On Fri, 12 Apr 2024, 1:05 am Daniel Gustafsson,  wrote:

> Now that the tree has settled down a bit post-freeze I ran some tooling to
> check spelling.  I was primarily interested in docs and README* which were
> mostly free from simply typos, while the code had some in various comments
> and
> one in code.  The attached fixes all that I came across (not
> cross-referenced
> against ongoing reverts or any other fixup threads but will be before
> pushing
> of course).
>

I see you've corrected "iif" to "if". It should be "iff".

David

>

TerminateOtherDBBackends code comments inconsistency.

2024-04-11 Thread Kirill Reshke

Hi hackers!

While working on [0] i have noticed this comment in
TerminateOtherDBBackends function:

/*
* Check whether we have the necessary rights to terminate other
* sessions. We don't terminate any session until we ensure that we
* have rights on all the sessions to be terminated. These checks are
* the same as we do in pg_terminate_backend.
*
* In this case we don't raise some warnings - like "PID %d is not a
* PostgreSQL server process", because for us already finished session
* is not a problem.
*/

This statement is not true after 3a9b18b.
"These checks are the same as we do in pg_terminate_backend."

But the code is still correct, I assume... or not? In fact, we are
killing autovacuum workers which are working with a given database
(proc->roleId == 0), which is OK in that case. Are there any other
cases when proc->roleId == 0 but we should not be able to kill such a
process?


[0] 
https://www.postgresql.org/message-id/flat/CALdSSPiOLADNe1ZS-1dnQXWVXgGAC6%3D3TBKABu9C3_97YGOoMg%40mail.gmail.com#4e869bc4b6ad88afe70e4484ffd256e2

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-11 Thread Dmitry Koval


Hi!

1.
Alexander Lakhin sent a question about index name after MERGE (partition 
name is the same as one of the merged partitions):


start of quote
I'm also confused by an index name after MERGE:
CREATE TABLE t (i int) PARTITION BY RANGE (i);

CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);

CREATE INDEX tidx ON t(i);
ALTER TABLE t MERGE PARTITIONS (tp_1_2, tp_0_1) INTO tp_1_2;
\d+ t*

 Table "public.tp_1_2"
 Column |  Type   | Collation | Nullable | Default | Storage | 
Compression | Stats target | Description

+-+---+--+-+-+-+--+-
 i  | integer |   |  | | plain   | 
   |  |

Partition of: t FOR VALUES FROM (0) TO (2)
Partition constraint: ((i IS NOT NULL) AND (i >= 0) AND (i < 2))
Indexes:
"merge-16385-3A14B2-tmp_i_idx" btree (i)

Is the name "merge-16385-3A14B2-tmp_i_idx" valid or it's something 
temporary?

end of quote

Fix for this case added to file 
v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.




2.
>It seems to me that v2-0001-Fix-for-SPLIT-MERGE-partitions-of-
>temporary-table.patch is not complete either.

Added correction (and test), see 
v3-0001-Fix-for-SPLIT-MERGE-partitions-of-temporary-table.patch.


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.comFrom 58eb4abd4f065b6aa423bbddf62acd1799eba22e Mon Sep 17 00:00:00 2001
From: Koval Dmitry 
Date: Wed, 10 Apr 2024 18:54:05 +0300
Subject: [PATCH v3 1/2] Fix for SPLIT/MERGE partitions of temporary table

---
 src/backend/commands/tablecmds.c  | 30 ---
 src/backend/parser/parse_utilcmd.c|  2 +-
 src/test/regress/expected/partition_merge.out | 87 ++-
 src/test/regress/expected/partition_split.out | 29 +++
 src/test/regress/sql/partition_merge.sql  | 70 ++-
 src/test/regress/sql/partition_split.sql  | 23 +
 6 files changed, 227 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a98a0af48..2769be55be 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -21145,17 +21145,20 @@ moveSplitTableRows(Relation rel, Relation splitRel, 
List *partlist, List *newPar
  * createPartitionTable: create table for a new partition with given name
  * (newPartName) like table (modelRelName)
  *
- * Emulates command: CREATE TABLE  (LIKE 
+ * Emulates command: CREATE [TEMP] TABLE  (LIKE 
  * INCLUDING ALL EXCLUDING INDEXES EXCLUDING IDENTITY)
  */
 static void
-createPartitionTable(RangeVar *newPartName, RangeVar *modelRelName,
+createPartitionTable(Relation rel, RangeVar *newPartName, RangeVar 
*modelRelName,
 AlterTableUtilityContext *context)
 {
CreateStmt *createStmt;
TableLikeClause *tlc;
PlannedStmt *wrapper;
 
+   newPartName->relpersistence = rel->rd_rel->relpersistence;
+   newPartName->schemaname = modelRelName->schemaname;
+
createStmt = makeNode(CreateStmt);
createStmt->relation = newPartName;
createStmt->tableElts = NIL;
@@ -21291,7 +21294,7 @@ ATExecSplitPartition(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
SinglePartitionSpec *sps = (SinglePartitionSpec *) 
lfirst(listptr);
RelationnewPartRel;
 
-   createPartitionTable(sps->name, parentName, context);
+   createPartitionTable(rel, sps->name, parentName, context);
 
/* Open the new partition and acquire exclusive lock on it. */
newPartRel = table_openrv(sps->name, AccessExclusiveLock);
@@ -21491,7 +21494,8 @@ ATExecMergePartitions(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
mergePartName = 
makeRangeVar(get_namespace_name(RelationGetNamespace(rel)),
 
tmpRelName, -1);
}
-   createPartitionTable(mergePartName,
+   createPartitionTable(rel,
+mergePartName,
 
makeRangeVar(get_namespace_name(RelationGetNamespace(rel)),
  
RelationGetRelationName(rel), -1),
 context);
@@ -21507,12 +21511,6 @@ ATExecMergePartitions(List **wqueue, AlteredTableInfo 
*tab, Relation rel,
/* Copy data from merged partitions to new partition. */
moveMergedTablesRows(rel, mergingPartitionsList, newPartRel);
 
-   /*
-* Attach a new partition to the partitioned table. wqueue = NULL:
-* verification for each cloned constraint is not need.
-*/
-   attachPartitionTable(NULL, rel, newP

Re: Table AM Interface Enhancements