Re: Pluggable cumulative statistics

2024-07-08 Thread Bertrand Drouvot
Hi,

On Tue, Jul 09, 2024 at 10:45:05AM +0900, Michael Paquier wrote:
> On Mon, Jul 08, 2024 at 07:22:32AM +0000, Bertrand Drouvot wrote:
> > Yeah, what I meant to say is that one could think for example that's the
> > PgStatShared_Archiver size while in fact it's the PgStat_ArchiverStats size.
> > I think it's more confusing when writing the stats. Here we are manipulating
> > "snapshot" and "snapshot" offsets. It was not that confusing when reading 
> > as we
> > are manipulating "shmem" and "shared" offsets.
> > 
> > As I said, the code is fully correct, that's just the wording here that 
> > sounds
> > weird to me in the "snapshot" context.
> 
> After sleeping on it, I can see your point.  If we were to do the
> (shared_data_len -> stats_data_len) switch, could it make sense to
> rename shared_data_off to stats_data_off to have a better symmetry?
> This one is the offset of the stats data in a shmem entry, so perhaps
> shared_data_off is OK, but it feels a bit inconsistent as well.

Agree that if we were to rename one of them then the second one should be
renamed to.

I gave a second thought on it, and I think that this is the "data" part that 
lead
to the confusion (as too generic), what about?

shared_data_len -> shared_stats_len
shared_data_off -> shared_stats_off

That looks ok to me even in the snapshot context (shared is fine after all
because that's where the stats come from).

Attached a patch proposal doing so.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 390fdbc5b3fb1f25df809d0541507b886f1b15ec Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 9 Jul 2024 04:52:04 +
Subject: [PATCH v1] Renaming two fields in PgStat_KindInfo

Renaming shared_data_off to shared_stats_off and shared_data_len to
shared_stats_len. "data" seems too generic and started to be confusing since
b68b29bc8f makes use of the "_len" one in the snapshot context. Using "stats"
instead as that's what those fields are linked to.
---
 src/backend/utils/activity/pgstat.c | 56 ++---
 src/include/utils/pgstat_internal.h | 12 +++
 2 files changed, 34 insertions(+), 34 deletions(-)
  84.2% src/backend/utils/activity/
  15.7% src/include/utils/

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 65b937a85f..bb777423fb 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -274,8 +274,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.accessed_across_databases = true,
 
 		.shared_size = sizeof(PgStatShared_Database),
-		.shared_data_off = offsetof(PgStatShared_Database, stats),
-		.shared_data_len = sizeof(((PgStatShared_Database *) 0)->stats),
+		.shared_stats_off = offsetof(PgStatShared_Database, stats),
+		.shared_stats_len = sizeof(((PgStatShared_Database *) 0)->stats),
 		.pending_size = sizeof(PgStat_StatDBEntry),
 
 		.flush_pending_cb = pgstat_database_flush_cb,
@@ -288,8 +288,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.fixed_amount = false,
 
 		.shared_size = sizeof(PgStatShared_Relation),
-		.shared_data_off = offsetof(PgStatShared_Relation, stats),
-		.shared_data_len = sizeof(((PgStatShared_Relation *) 0)->stats),
+		.shared_stats_off = offsetof(PgStatShared_Relation, stats),
+		.shared_stats_len = sizeof(((PgStatShared_Relation *) 0)->stats),
 		.pending_size = sizeof(PgStat_TableStatus),
 
 		.flush_pending_cb = pgstat_relation_flush_cb,
@@ -302,8 +302,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.fixed_amount = false,
 
 		.shared_size = sizeof(PgStatShared_Function),
-		.shared_data_off = offsetof(PgStatShared_Function, stats),
-		.shared_data_len = sizeof(((PgStatShared_Function *) 0)->stats),
+		.shared_stats_off = offsetof(PgStatShared_Function, stats),
+		.shared_stats_len = sizeof(((PgStatShared_Function *) 0)->stats),
 		.pending_size = sizeof(PgStat_FunctionCounts),
 
 		.flush_pending_cb = pgstat_function_flush_cb,
@@ -317,8 +317,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.accessed_across_databases = true,
 
 		.shared_size = sizeof(PgStatShared_ReplSlot),
-		.shared_data_off = offsetof(PgStatShared_ReplSlot, stats),
-		.shared_data_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats),
+		.shared_stats_off = offsetof(PgStatShared_ReplSlot, stats),
+		.shared_stats_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats),
 
 		.reset_timestamp_cb = pgstat_replslot_reset_timestamp_cb,
 		.to_serialized_name = pgstat_replslot_to_serialized_name_cb,
@@ -333,8 +333,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.accessed_across_databases = true,
 
 		.

Re: Pluggable cumulative statistics

2024-07-08 Thread Bertrand Drouvot
Hi,

On Mon, Jul 08, 2024 at 07:22:32AM +, Bertrand Drouvot wrote:
> Except the above (which is just a Nit), 0001 LGTM.
> 

Looking at 0002:

It looks pretty straightforward, just one comment:

+   ptr = ((char *) ctl) + kind_info->shared_ctl_off;
+   kind_info->init_shmem_cb((void *) ptr);

I don't think we need to cast ptr to void when calling init_shmem_cb(). Looking
at some examples in the code, it does not look like we cast the argument to void
when a function has (void *) as parameter (also there is examples in 0003 where
it's not done, see next comments for 0003).

So I think removing the cast here would be more consistent.

Looking at 0003:

It looks pretty straightforward. Also for example, here:

+   fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
+   write_chunk_s(fpout, );
write_chunk(fpout, ptr, info->shared_data_len);

ptr is not casted to void when calling write_chunk() while its second parameter
is a "void *".

+   ptr = ((char *) shmem) + 
info->shared_ctl_off +
+   info->shared_data_off;
+
+   if (!read_chunk(fpin, ptr,

Same here for read_chunk().

I think that's perfectly fine and that we should do the same in 0002 when
calling init_shmem_cb() for consistency.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-07-08 Thread Bertrand Drouvot
Hi,

On Mon, Jul 08, 2024 at 03:49:34PM +0900, Michael Paquier wrote:
> On Mon, Jul 08, 2024 at 06:39:56AM +0000, Bertrand Drouvot wrote:
> > +   for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; 
> > kind++)
> > +   {
> > +   char   *ptr;
> > +   const PgStat_KindInfo *info = pgstat_get_kind_info(kind);
> > 
> > I wonder if we could avoid going through stats that are not fixed ones. 
> > What about
> > doing something like?
> > Same would apply for the read part added in 9004abf6206e.
> 
> This becomes more relevant when the custom stats are added, as this
> performs a scan across the full range of IDs supported.  So this
> choice is here for consistency, and to ease the pluggability.

Gotcha.

> 
> > 2 ===
> > 
> > +   pgstat_build_snapshot_fixed(kind);
> > +   ptr = ((char *) ) + 
> > info->snapshot_ctl_off;
> > +   write_chunk(fpout, ptr, info->shared_data_len);
> > 
> > I think that using "shared_data_len" is confusing here (was not the case in 
> > the
> > context of 9004abf6206e). I mean it is perfectly correct but the wording 
> > "shared"
> > looks weird to me when being used here. What it really is, is the size of 
> > the
> > stats. What about renaming shared_data_len with stats_data_len?
> 
> It is the stats data associated to a shared entry.  I think that's OK,
> but perhaps I'm just used to it as I've been staring at this area for
> days.

Yeah, what I meant to say is that one could think for example that's the
PgStatShared_Archiver size while in fact it's the PgStat_ArchiverStats size.
I think it's more confusing when writing the stats. Here we are manipulating
"snapshot" and "snapshot" offsets. It was not that confusing when reading as we
are manipulating "shmem" and "shared" offsets.

As I said, the code is fully correct, that's just the wording here that sounds
weird to me in the "snapshot" context.

Except the above (which is just a Nit), 0001 LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-07-08 Thread Bertrand Drouvot
Hi,

On Mon, Jul 08, 2024 at 02:30:23PM +0900, Michael Paquier wrote:
> On Fri, Jul 05, 2024 at 09:35:19AM +0900, Michael Paquier wrote:
> > On Thu, Jul 04, 2024 at 11:30:17AM +, Bertrand Drouvot wrote:
> >> On Wed, Jul 03, 2024 at 06:47:15PM +0900, Michael Paquier wrote:
> >>> - PgStat_Snapshot holds an array of void* also indexed by
> >>> PGSTAT_NUM_KINDS, pointing to the fixed stats stored in the
> >>> snapshots.
> >> 
> >> Same, that's just a 96 bytes overhead (8 * PGSTAT_NUM_KINDS) as compared 
> >> to now.
> > 
> > Still Andres does not seem to like that much, well ;)
> 
> Please find attached a rebased patch set labelled v4.

Thanks!

> Built-in
> fixed-numbered stats are still attached to the snapshot and shmem
> control structures, and custom fixed stats kinds are tracked in the
> same way as v3 with new members tracking data stored in
> TopMemoryContext for the snapshots and shmem for the control data.
> So, the custom and built-in stats kinds are separated into separate
> parts of the structures, including the "valid" flags for the
> snapshots.  And this avoids any redirection when looking at the
> built-in fixed-numbered stats.

Yeap.

> I've tried at address all the previous comments (there could be stuff
> I've missed while rebasing, of course).

Thanks!

> The first three patches are refactoring pieces to make the rest more
> edible, while 0004~ implement the main logic with templates in
> modules/injection_points:
> - 0001 refactors pgstat_write_statsfile() so as a loop om
> PgStat_KindInfo is used to write the data.  This is done with the
> addition of snapshot_ctl_off in PgStat_KindInfo, to point to the area
> in PgStat_Snapshot where the data is located for fixed stats.
> 9004abf6206e has done the same for the read part.

Looking at 0001:

1 ==

+   for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; 
kind++)
+   {
+   char   *ptr;
+   const PgStat_KindInfo *info = pgstat_get_kind_info(kind);

I wonder if we could avoid going through stats that are not fixed ones. What 
about
doing something like?

"
for (int kind = ; kind <= ; kind++);
"

Would probably need to change the indexing logic though.

and then we could replace:

+   if (!info->fixed_amount)
+   continue;

with an assert instead. 

Same would apply for the read part added in 9004abf6206e.

2 ===

+   pgstat_build_snapshot_fixed(kind);
+   ptr = ((char *) ) + info->snapshot_ctl_off;
+   write_chunk(fpout, ptr, info->shared_data_len);

I think that using "shared_data_len" is confusing here (was not the case in the
context of 9004abf6206e). I mean it is perfectly correct but the wording 
"shared"
looks weird to me when being used here. What it really is, is the size of the
stats. What about renaming shared_data_len with stats_data_len?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: walsender.c comment with no context is hard to understand

2024-07-08 Thread Bertrand Drouvot
Hi,

On Mon, Jul 08, 2024 at 11:20:45AM +0530, Amit Kapila wrote:
> On Mon, Jul 8, 2024 at 11:08 AM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Jul 08, 2024 at 08:46:19AM +0530, Amit Kapila wrote:
> > > This sounds better but it is better to add just before we determine
> > > am_cascading_walsender as is done in the attached. What do you think?
> >
> > Thanks! LGTM.
> >
> 
> I would like to push this to HEAD only as we don't see any bug that
> this change can prevent. What do you think?
> 

Yeah, fully agree. I don't see how the previous check location could produce
any bug.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: walsender.c comment with no context is hard to understand

2024-07-07 Thread Bertrand Drouvot
Hi,

On Mon, Jul 08, 2024 at 08:46:19AM +0530, Amit Kapila wrote:
> This sounds better but it is better to add just before we determine
> am_cascading_walsender as is done in the attached. What do you think?

Thanks! LGTM.

> 
> BTW, is it possible that the promotion gets completed after we wait
> for the required WAL and before assigning am_cascading_walsender?

Yeah, I don't think there is anything that would prevent a promotion to
happen and complete here. I did a few tests (pausing the walsender with gdb at
various places and promoting the standby).

> think even if that happens we can correctly determine the required
> timeline because all the required WAL is already available, is that
> correct 

Yeah that's correct.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: walsender.c comment with no context is hard to understand

2024-07-06 Thread Bertrand Drouvot
Hi,

On Fri, Jul 05, 2024 at 11:10:00AM +0530, Amit Kapila wrote:
> On Fri, Jun 28, 2024 at 6:30 PM Bertrand Drouvot
>  wrote:
> >
> > On Fri, Jun 28, 2024 at 03:15:22PM +0530, Amit Kapila wrote:
> > > On Fri, Jun 28, 2024 at 12:55 PM Peter Smith  
> > > wrote:
> > > >
> > >
> > > I don't know whether your assumption is correct. AFAICS, those two
> > > lines should be together. Let us ee if Bertrand remembers anything?
> > >
> >
> > IIRC the WalSndWaitForWal() call has been moved to ensure that we can 
> > determine
> > the timeline accurately.
> >
> 
> This part is understandable but I don't understand the part of the
> comment (This is needed to determine am_cascading_walsender accurately
> ..) atop a call to WalSndWaitForWal(). The am_cascading_walsender is
> determined based on the results of RecoveryInProgress(). Can the wait
> for WAL by using WalSndWaitForWal() change the result of
> RecoveryInProgress()?

No, but WalSndWaitForWal() must be called _before_ assigning
"am_cascading_walsender = RecoveryInProgress();". The reason is that during
a promotion am_cascading_walsender must be assigned _after_ the walsender is
waked up (after the promotion). So that when the walsender exits 
WalSndWaitForWal(),
then am_cascading_walsender is assigned "accurately" and so the timeline is. 

What I meant to say in this comment is that "am_cascading_walsender = 
RecoveryInProgress();"
must be called _after_ "flushptr = WalSndWaitForWal(targetPagePtr + reqLen);".

For example, swaping both lines would cause the 035_standby_logical_decoding.pl
to fail during the promotion test as the walsender would read from the 
"previous"
timeline and then produce things like:

"ERROR:  could not find record while sending logically-decoded data: invalid 
record length at 0/6427B20: expected at least 24, got 0"

To avoid ambiguity should we replace?

"
/*
 * Make sure we have enough WAL available before retrieving the current
 * timeline. This is needed to determine am_cascading_walsender accurately
 * which is needed to determine the current timeline.
 */
"

With:

"
/*
 * Make sure we have enough WAL available before retrieving the current
 * timeline. am_cascading_walsender must be assigned after
 * WalSndWaitForWal() (so that it is also correct when the walsender 
wakes
 * up after a promotion).
 */
"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-07-04 Thread Bertrand Drouvot
ed stats (by moving the "stats" part at the header of their dedicated
struct). That would mean having things like:

"
typedef struct PgStatShared_Archiver
{
PgStat_ArchiverStats stats;
/* lock protects ->reset_offset as well as stats->stat_reset_timestamp */
LWLock      lock;
uint32  changecount;
PgStat_ArchiverStats reset_offset;
} PgStatShared_Archiver;
"

Not sure that's worth it though.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Restart pg_usleep when interrupted

2024-07-02 Thread Bertrand Drouvot
Hi,

On Fri, Jun 28, 2024 at 05:50:20PM -0500, Sami Imseih wrote:
> 
> Thanks for the feedback!
> 
> > On Jun 28, 2024, at 4:34 PM, Tom Lane  wrote:
> > 
> > Sami Imseih  writes:
> >> Reattaching the patch.
> > 
> > I feel like this is fundamentally a wrong solution, for the reasons
> > cited in the comment for pg_usleep: long sleeps are a bad idea
> > because of the resulting uncertainty about whether we'll respond to
> > interrupts and such promptly.  An example here is that if we get
> > a query cancel interrupt, we should probably not insist on finishing
> > out the current sleep before responding.
> 
> The case which brought up this discussion is the pg_usleep that 
> is called within the vacuum_delay_point being interrupted. 
> 
> When I read the same code comment you cited, it sounded to me 
> that “long sleeps” are those that are in seconds or minutes. The longest 
> vacuum delay allowed is 100ms.

I think that with the proposed patch the actual wait time can be "long".

Indeed, the time between the interruptions and restarts of the nanosleep() call
will lead to drift (as mentioned in the nanosleep() man page). So, with a large
number of interruptions, the actual wait time could be way longer than the
expected wait time.

To put numbers, I did a few tests with the patch (and with v2 shared in [1]):

cost delay is 1ms and cost limit is 10.

With 50 indexes and 10 parallel workers I can see things like:

2024-07-02 08:22:23.789 UTC [2189616] LOG:  expected 1.00, actual 239.378368
2024-07-02 08:22:24.575 UTC [2189616] LOG:  expected 0.10, actual 224.331737
2024-07-02 08:22:25.363 UTC [2189616] LOG:  expected 1.30, actual 230.462793
2024-07-02 08:22:26.154 UTC [2189616] LOG:  expected 1.00, actual 225.980803

Means we waited more than the max allowed cost delay (100ms).

With 49 parallel workers, it's worst as I can see things like:

2024-07-02 08:26:36.069 UTC [2189807] LOG:  expected 1.00, actual 
1106.790136
2024-07-02 08:26:36.298 UTC [2189807] LOG:  expected 1.00, actual 218.148985

The first actual wait time is about 1 second (it has been interrupted about
16300 times during this second).

To avoid this drift, the nanosleep() man page suggests to use clock_nanosleep()
with an absolute time value, that might be another idea to explore.

[1]: 
https://www.postgresql.org/message-id/flat/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-07-01 Thread Bertrand Drouvot
Hi,

On Mon, Jul 01, 2024 at 09:39:17AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote:
> > > Another thought for the RelationRelationId class case: we could check if 
> > > there
> > > is a lock first and if there is no lock then acquire one. That way that 
> > > would
> > > ensure the relation is always locked (so no "risk" anymore), but OTOH it 
> > > may
> > > add "unecessary" locking (see 2. mentioned previously).
> > 
> > Please find attached v12 implementing this idea for the RelationRelationId 
> > class
> > case. As mentioned, it may add unnecessary locking for 2. but I think that's
> > worth it to ensure that we are always on the safe side of thing. This idea 
> > is
> > implemented in LockNotPinnedObjectById().
> 
> Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, 
> make
> use of CheckRelationOidLockedByMe() added in 0cecc908e97.

Please find attached v14, mandatory rebase due to 65b71dec2d5.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 31767e9eb290909943d03408d9f8e827cf17e0bd Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v14] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that's worth it.

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 121 +++-
 src/backend/catalog/heap.c|   7 +
 src/backend/catalog/index.c   |  26 
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  26 
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  40 +-
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   7 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c |  39 ++
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/alter.c  |   4 +
 src/backend/commands/amcmds.c |   1 +
 src/backend/commands/cluster.c|   7 +
 src/backend/comman

Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Bertrand Drouvot
Hi,

On Mon, Jul 01, 2024 at 10:21:35AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > On Mon, Jul 01, 2024 at 06:42:46AM +, Bertrand Drouvot wrote:
> >> I think it would make sense to declare / define those functions only for
> >> assert enabled build: please find attached a tiny patch doing so.
> 
> > Not convinced that's a good idea.  What about out-of-core code that
> > may use these routines for runtime checks in non-assert builds?
> 
> Yeah.  Also, I believe it's possible for an extension that's been
> built with assertions enabled to be used with a core server that
> wasn't.  This is why, for example, ExceptionalCondition() is not
> ifdef'd away in a non-assert build.  Even if you think there's
> no use for CheckRelation[Oid]LockedByMe except in assertions,
> it'd still be plenty reasonable for an extension to call them
> in assertions.

Yeah good point, thanks for the feedback! I've withdrawn the CF entry.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Bertrand Drouvot
Hi,

On Mon, Jul 01, 2024 at 05:01:46PM +0900, Michael Paquier wrote:
> On Mon, Jul 01, 2024 at 06:42:46AM +0000, Bertrand Drouvot wrote:
> > While working on a rebase for [1] due to 0cecc908e97, I noticed that
> > CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in
> > assertions.
> > 
> > I think it would make sense to declare / define those functions only for
> > assert enabled build: please find attached a tiny patch doing so.
> > 
> > Thoughts?
> 
> Not convinced that's a good idea.  What about out-of-core code that
> may use these routines for runtime checks in non-assert builds?

Thanks for the feedback.

Yeah that could be an issue for CheckRelationLockedByMe() 
(CheckRelationOidLockedByMe()
is too recent to be a concern).

Having said that 1. out of core could want to use CheckRelationOidLockedByMe() (
probably if it was already using CheckRelationLockedByMe()) and 2. I just
submitted a rebase for [1] in which I thought that using
CheckRelationOidLockedByMe() would be a good idea.

So I think that we can get rid of this proposal.

[1]: 
https://www.postgresql.org/message-id/ZoJ5RVtMziIa3TQp%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-07-01 Thread Bertrand Drouvot
Hi,

On Wed, Jun 26, 2024 at 10:24:41AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote:
> > Another thought for the RelationRelationId class case: we could check if 
> > there
> > is a lock first and if there is no lock then acquire one. That way that 
> > would
> > ensure the relation is always locked (so no "risk" anymore), but OTOH it may
> > add "unecessary" locking (see 2. mentioned previously).
> 
> Please find attached v12 implementing this idea for the RelationRelationId 
> class
> case. As mentioned, it may add unnecessary locking for 2. but I think that's
> worth it to ensure that we are always on the safe side of thing. This idea is
> implemented in LockNotPinnedObjectById().

Please find attached v13, mandatory rebase due to 0cecc908e97. In passing, make
use of CheckRelationOidLockedByMe() added in 0cecc908e97.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 461ae5d2fa037b2b9fd285c2a328c8bc785eaec8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v13] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that's worth it.

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 121 +++-
 src/backend/catalog/heap.c|   7 +
 src/backend/catalog/index.c   |  26 
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  26 
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  40 +-
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   7 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c |  39 ++
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/alter.c  |   4 +
 src/backend/commands/amcmds.c |   1 +
 src/backend/commands/cluster.c|   7 +
 src/backend/commands/event_trigger.c  |   1 +
 src/backend/commands/extension.c  |   5 +
 src/backend/commands/foreigncmds.c|   7 +
 src/backend/commands/functioncmds.c   |   6 +
 src/backend/commands/indexcmds.c  |   2 +
 src/ba

Re: Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Bertrand Drouvot
Hi,

On Mon, Jul 01, 2024 at 12:35:34PM +0530, Bharath Rupireddy wrote:
> Hi,
> 
> On Mon, Jul 1, 2024 at 12:12 PM Bertrand Drouvot
>  wrote:
> >
> > Hi hackers,
> >
> > While working on a rebase for [1] due to 0cecc908e97, I noticed that
> > CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in
> > assertions.
> >
> > I think it would make sense to declare / define those functions only for
> > assert enabled build: please find attached a tiny patch doing so.
> >
> > Thoughts?
> 
> If turning the CheckRelationXXXLocked() compile for non-assert builds,
> why not do the same for LWLockHeldByMe, LWLockAnyHeldByMe and
> LWLockHeldByMeInMode that are debug-only and being used in asserts?
> While it might reduce the compiled binary size a bit for release
> builds, we may have to be cautious about external or out of core
> modules using them.

Thanks for the feedback.

CheckRelationOidLockedByMe() is new (as it has been added in 0cecc908e97. While
its counterpart CheckRelationLockedByMe() has been added since a few years 
(2018)
in commit b04aeb0a053, I thought it would make sense to surround both of them.

While it's true that we could also surround LWLockHeldByMe() (added in 
e6cba71503f
, 2004 and signature change in ea9df812d85, 2014), LWLockAnyHeldByMe() (added in
eed959a457e, 2022) and LWLockHeldByMeInMode() (added in 016abf1fb83, 2016), I'm
not sure we should (due to their "age" and as you said we have to be cautious
about out of core modules / extensions that may use them).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Surround CheckRelation[Oid]LockedByMe() with USE_ASSERT_CHECKING

2024-07-01 Thread Bertrand Drouvot
Hi hackers,

While working on a rebase for [1] due to 0cecc908e97, I noticed that
CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in
assertions.

I think it would make sense to declare / define those functions only for
assert enabled build: please find attached a tiny patch doing so.

Thoughts?

[1]: 
https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From bee765581b6b356a6c9c2e4407243f1622b8e84a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 1 Jul 2024 05:38:30 +
Subject: [PATCH v1] Surround CheckRelation[Oid]LockedByMe() with
 USE_ASSERT_CHECKING

As CheckRelationLockedByMe() and CheckRelationOidLockedByMe() are used only in
assertions, declare / define those functions only for assert enabled build.
---
 src/backend/storage/lmgr/lmgr.c | 2 ++
 src/include/storage/lmgr.h  | 2 ++
 2 files changed, 4 insertions(+)
  50.0% src/backend/storage/lmgr/
  50.0% src/include/storage/

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 094522acb41..234cab4f7af 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -318,6 +318,7 @@ UnlockRelation(Relation relation, LOCKMODE lockmode)
 	LockRelease(, lockmode, false);
 }
 
+#ifdef USE_ASSERT_CHECKING
 /*
  *		CheckRelationLockedByMe
  *
@@ -352,6 +353,7 @@ CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode, bool orstronger)
 
 	return LockHeldByMe(, lockmode, orstronger);
 }
+#endif
 
 /*
  *		LockHasWaitersRelation
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index ce15125ac3b..3d5f989ae80 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -46,10 +46,12 @@ extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void LockRelation(Relation relation, LOCKMODE lockmode);
 extern bool ConditionalLockRelation(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelation(Relation relation, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
 extern bool CheckRelationLockedByMe(Relation relation, LOCKMODE lockmode,
 	bool orstronger);
 extern bool CheckRelationOidLockedByMe(Oid relid, LOCKMODE lockmode,
 	   bool orstronger);
+#endif
 extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
-- 
2.34.1



Re: Track the amount of time waiting due to cost_delay

2024-06-30 Thread Bertrand Drouvot
Hi,

On Fri, Jun 28, 2024 at 08:07:39PM +, Imseih (AWS), Sami wrote:
> > 46ebdfe164 will interrupt the leaders sleep every time a parallel workers 
> > reports
> > progress, and we currently don't handle interrupts by restarting the sleep 
> > with
> > the remaining time. nanosleep does provide the ability to restart with the 
> > remaining
> > time [1], but I don't think it's worth the effort to ensure more accurate
> > vacuum delays for the leader process. 
> 
> After discussing offline with Bertrand, it may be better to have 
> a solution to deal with the interrupts and allows the sleep to continue to
> completion. This will simplify this patch and will be useful 
> for other cases in which parallel workers need to send a message
> to the leader. This is the thread [1] for that discussion.
> 
> [1] 
> https://www.postgresql.org/message-id/01000190606e3d2a-116ead16-84d2-4449-8d18-5053da66b1f4-00%40email.amazonses.com
> 

Yeah, I think it would make sense to put this thread on hold until we know more
about [1] (you mentioned above) outcome.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: walsender.c comment with no context is hard to understand

2024-06-28 Thread Bertrand Drouvot
Hi,

On Fri, Jun 28, 2024 at 03:15:22PM +0530, Amit Kapila wrote:
> On Fri, Jun 28, 2024 at 12:55 PM Peter Smith  wrote:
> >
> > On Fri, Jun 28, 2024 at 4:18 PM Amit Kapila  wrote:
> > >
> > > On Fri, Jun 28, 2024 at 5:15 AM Peter Smith  wrote:
> > > >
> > > > On Thu, Jun 27, 2024 at 3:44 PM Michael Paquier  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 26, 2024 at 02:30:26PM +0530, vignesh C wrote:
> > > > > > On Mon, 3 Jun 2024 at 11:21, Peter Smith  
> > > > > > wrote:
> > > > > >> Perhaps the comment should say something like it used to:
> > > > > >> /* Fail if there is not enough WAL available. This can happen 
> > > > > >> during
> > > > > >> shutdown. */
> > > > > >
> > > > > > Agree with this, +1 for this change.
> > > > >
> > > > > That would be an improvement.  Would you like to send a patch with all
> > > > > the areas you think could stand for improvements?
> > > > > --
> > > >
> > > > OK, I attached a patch equivalent of the suggestion in this thread.
> > > >
> > >
> > > Shouldn't the check for flushptr (if (flushptr < targetPagePtr +
> > > reqLen)) be moved immediately after the call to WalSndWaitForWal().
> > > The comment seems to suggests the same: "Make sure we have enough WAL
> > > available before retrieving the current timeline. .."
> > >
> >
> > Yes, as I wrote in the first post, those lines did once used to be
> > adjacent in logical_read_xlog_page.
> >
> > I also wondered if they still belonged together, but I opted for the
> > safest option of fixing only the comment instead of refactoring old
> > code when no problem had been reported.
> >
> > AFAICT these lines became separated due to a 2023 patch [1], and you
> > were one of the reviewers of that patch, so I assumed the code
> > separation was deemed OK at that time. Unless it was some mistake that
> > slipped past multiple reviewers?
> >
> 
> I don't know whether your assumption is correct. AFAICS, those two
> lines should be together. Let us ee if Bertrand remembers anything?
> 

IIRC the WalSndWaitForWal() call has been moved to ensure that we can determine
the timeline accurately. I agree with Amit that it would make more sense to
move the (flushptr < targetPagePtr + reqLen) check just after the flushptr
assignement. I don't recall that we discussed any reason of not doing so.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-26 Thread Bertrand Drouvot
Hi,

On Wed, Jun 26, 2024 at 09:15:48AM +, Zhijie Hou (Fujitsu) wrote:
> Renamed these to the names suggested by Amit.
> 
> Attach the v2 patch set which addressed above and removed
> the changes in release-17.sgml according to the comment from Amit.
> 

Thanks! LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-26 Thread Bertrand Drouvot
Hi,

On Fri, Jun 21, 2024 at 01:22:43PM +, Bertrand Drouvot wrote:
> Another thought for the RelationRelationId class case: we could check if there
> is a lock first and if there is no lock then acquire one. That way that would
> ensure the relation is always locked (so no "risk" anymore), but OTOH it may
> add "unecessary" locking (see 2. mentioned previously).

Please find attached v12 implementing this idea for the RelationRelationId class
case. As mentioned, it may add unnecessary locking for 2. but I think that's
worth it to ensure that we are always on the safe side of thing. This idea is
implemented in LockNotPinnedObjectById().

A few remarks:

- there is one place where the relation is not visible (even if
CommandCounterIncrement() is used). That's in TypeCreate(), because the new 
relation Oid is _not_ added to pg_class yet.
Indeed, in heap_create_with_catalog(), AddNewRelationType() is called before
AddNewRelationTuple()). I put a comment in this part of the code explaining why
it's not necessary to call LockRelationOid() here.

- some namespace related stuff is removed from 
"test_oat_hooks/expected/alter_table.out".
That's due to the logic in cachedNamespacePath() and the fact that the same
namespace related stuff is added prior in alter_table.out.

- the patch touches 37 .c files, but that's mainly due to the fact that
LockNotPinnedObjectById() has to be called in a lot of places.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From d6c91b19a585d0a6f5c0abacb9c59cc5acd795f7 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v12] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch takes into account any type of objects except the ones that are pinned
(they are not droppable because the system requires it).

A special case is done for objects that belong to the RelationRelationId class.
For those, we should be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends the table...)

To avoid any risks for the RelationRelationId class case, we acquire a lock if
there is none. That may add unnecessary lock for 2. but that's worth it.

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 contrib/test_decoding/expected/twophase.out   |   3 +-
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 125 -
 src/backend/catalog/heap.c|   7 +
 src/backend/catalog/index.c   |  26 
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  26 
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  36 -
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   7 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_typ

Re: New standby_slot_names GUC in PG 17

2024-06-26 Thread Bertrand Drouvot
Hi,

On Wed, Jun 26, 2024 at 11:39:45AM +0530, Amit Kapila wrote:
> On Wed, Jun 26, 2024 at 10:19 AM Bertrand Drouvot
>  wrote:
> >
> >
> > 2 
> >
> > Should we rename StandbySlotNamesConfigData too?
> >
> 
> How about SyncStandbySlotsConfigData?
> 
> > 3 
> >
> > Should we rename SlotExistsInStandbySlotNames too?
> >
> 
> Similarly SlotExistsInSyncStandbySlots?
> 
> > 4 
> >
> > Should we rename validate_standby_slots() too?
> >
> 
> And validate_sync_standby_slots()?
> 

Thanks!

All of the above proposal sound good to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-25 Thread Bertrand Drouvot
Hi,

On Wed, Jun 26, 2024 at 04:17:45AM +, Zhijie Hou (Fujitsu) wrote:
> On Wednesday, June 26, 2024 9:40 AM Masahiko Sawada  
> wrote:
> > 
> > On Tue, Jun 25, 2024 at 5:32 PM Amit Kapila 
> > wrote:
> > >
> > > I feel synchronized better indicates the purpose because we ensure
> > > such slots are synchronized before we process changes for logical
> > > failover slots. We already have a 'failover' option for logical slots
> > > which could make things confusing if we add 'failover' where physical
> > > slots need to be specified.
> > 
> > Agreed. So +1 for synchronized_stnadby_slots.
> 
> +1.
> 
> Since there is a consensus on this name, I am attaching the patch to rename
> the GUC to synchronized_stnadby_slots. I have confirmed that the regression
> tests and pgindent passed for the patch.
> 

Thanks for the patch!

A few comments:

1 

In the commit message:

"
The standby_slot_names GUC is intended to allow specification of physical
standby slots that must be synchronized before they are visible to
subscribers
"

Not sure that wording is correct, if we feel the need to explain the GUC,
maybe repeat some wording from bf279ddd1c?

2 

Should we rename StandbySlotNamesConfigData too?

3 

Should we rename SlotExistsInStandbySlotNames too?

4 

Should we rename validate_standby_slots() too?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-25 Thread Bertrand Drouvot
Hi,

On Tue, Jun 25, 2024 at 01:12:16AM +, Imseih (AWS), Sami wrote:

Thanks for the feedback!

> >> 2. the leader being interrupted while waiting is also already happening on 
> >> master
> >> due to the pgstat_progress_parallel_incr_param() calls in
> >> parallel_vacuum_process_one_index() (that have been added in
> >> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> 46ebdfe164 will interrupt the leaders sleep every time a parallel workers 
> reports
> progress, and we currently don't handle interrupts by restarting the sleep 
> with
> the remaining time. nanosleep does provide the ability to restart with the 
> remaining
> time [1], but I don't think it's worth the effort to ensure more accurate
> vacuum delays for the leader process. 

+1. I don't think it's necessary to have a 100% accurate delay for all the
times the delay is involded. I think that's an heuristic parameter (among
with cost limit). What matters at the end is by how much you've been able to
pause the whole vacuum (and not by a sleep by sleep basis)).

> > 1. Having a time based only approach to throttle 
> 
> I do agree with a time based approach overall.
> 
> 
> > 1.1) the more parallel workers is used, the less the impact of the leader on
> > the vacuum index phase duration/workload is (because the repartition is done
> > on more processes).
> 
> Did you mean " because the vacuum is done on more processes"? 

Yes.

> When a leader is operating on a large index(s) during the entirety
> of the vacuum operation, wouldn't more parallel workers end up
> interrupting the leader more often?

That's right but my point was about the impact on the "whole" duration time and
"whole" workload (leader + workers included) and not about the number of times 
the
leader is interrupted. If there is say 100 workers then interrupting the leader
(1 process out of 101) is probably less of an issue as it means that there is a
lot of work to be done to have those 100 workers busy. I don't think the size of
the index the leader is vacuuming has an impact. I think that having the leader
vacuuming a 100 GB index or 100 x 1GB indexes is the same (as long as all the
other workers are actives during all that time).

> > 3. A 1 second reporting "throttling" looks a reasonable threshold as:
> 
> > 3.1 the idea is to have a significant impact when the leader could have been
> > interrupted say hundred/thousand times per second.
> 
> > 3.2 it does not make that much sense for any tools to sample 
> > pg_stat_progress_vacuum
> > multiple times per second (so a one second reporting granularity seems ok).
> 
> I feel 1 second may still be too frequent. 

Maybe we'll need more measurements but this is what my test case made of:

vacuum_cost_delay = 1
vacuum_cost_limit = 10
8 parallel workers, 1 leader
21 indexes (about 1GB each, one 40MB), all in memory

lead to:

With 1 second reporting frequency, the leader has been interruped about 2500
times over 8m39s leading to about the same time as on master (8m43s).

> What about 10 seconds ( or 30 seconds )? 

I'm not sure (may need more measurements) but it would probably complicate the
reporting a bit (as with the current v3 we'd miss reporting the indexes that
take less time than the threshold to complete).

> I think this metric in particular will be mainly useful for vacuum runs that 
> are 
> running for minutes or more, making reporting every 10 or 30 seconds 
> still useful.

Agree. OTOH, one could be interested to diagnose what happened during a say 5
seconds peak on I/O resource consumption/latency. Sampling 
pg_stat_progress_vacuum
at 1 second interval and see by how much the vaccum has been paused during that
time could help too (specially if it is made of a lot of parallel workers that
could lead to a lot of I/O). But it would miss data if we are reporting at a
larger rate.

> It just occurred to me also that pgstat_progress_parallel_incr_param 
> should have a code comment that it will interrupt a leader process and
> cause activity such as a sleep to end early.

Good point, I'll add a comment for it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-25 Thread Bertrand Drouvot
Hi,

On Tue, Jun 25, 2024 at 10:24:41AM +0530, Amit Kapila wrote:
> On Tue, Jun 25, 2024 at 8:20 AM Masahiko Sawada  wrote:
> >
> > On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > I agree the current name seems too generic and the suggested ' 
> > > synchronized_standby_slots '
> > > is better than the current one.
> > >
> > > Some other ideas could be:
> > >
> > > synchronize_slots_on_standbys: it indicates that the standbys that enabled
> > > slot sync should be listed in this GUC.
> > >
> > > logical_replication_wait_slots: it means the logical replication(logical
> > > Walsender process) will wait for these slots to advance the confirm flush
> > > lsn before proceeding.
> >
> > I feel that the name that has some connection to "logical replication"
> > also sounds good. Let me add some ideas:
> >
> > - logical_replication_synchronous_standby_slots (might be too long)
> > - logical_replication_synchronous_slots
> >
> 
> I see your point about keeping logical_replication in the name but
> that could also lead one to think that this list can contain logical
> slots.

Agree, and we may add the same functionality for physical replication slots
in the future too (it has been discussed in the thread [1]). So I don't think
"logical" should be part of the name.

> OTOH, there is some value in keeping '_standby_' in the name as
> that is more closely associated with physical standby's and this list
> contains physical slots corresponding to physical standby's. So, my
> preference is in order as follows: synchronized_standby_slots,
> wait_for_standby_slots, logical_replication_wait_slots,
> logical_replication_synchronous_slots, and
> logical_replication_synchronous_standby_slots.
> 

I like the idea of having "synchronize[d]" in the name as it makes think of 
the feature it is linked to [2]. The slots mentioned in this parameter are
linked to the "primary_slot_name" parameter on the standby, so what about?

synchronized_primary_slot_names 

It makes clear it is somehow linked to "primary_slot_name" and that we want them
to be in sync.

So I'd vote for (in that order);

synchronized_primary_slot_names, synchronized_standby_slots

[1]: 
https://www.postgresql.org/message-id/bb437218-73bc-34c3-b8fb-8c1be4ddaec9%40enterprisedb.com
[2]: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=93db6cbda037f1be9544932bd9a785dabf3ff712

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-24 Thread Bertrand Drouvot
Hi,

On Sat, Jun 22, 2024 at 12:48:33PM +, Bertrand Drouvot wrote:
> 1. vacuuming indexes time has been longer on master because with v2, the 
> leader
> has been interrupted 342605 times while waiting, then making v2 "faster".
> 
> 2. the leader being interrupted while waiting is also already happening on 
> master
> due to the pgstat_progress_parallel_incr_param() calls in
> parallel_vacuum_process_one_index() (that have been added in 
> 46ebdfe164). It has been the case "only" 36 times during my test case.
> 
> I think that 2. is less of a concern but I think that 1. is something that 
> needs
> to be addressed because the leader process is not honouring its cost delay 
> wait
> time in a noticeable way (at least during my test case).
> 
> I did not think of a proposal yet, just sharing my investigation as to why
> v2 has been faster than master during the vacuuming indexes phase.

I think that a reasonable approach is to make the reporting from the parallel
workers to the leader less aggressive (means occur less frequently).

Please find attached v3, that:

- ensures that there is at least 1 second between 2 reports, per parallel 
worker,
to the leader.

- ensures that the reported delayed time is still correct (keep track of the
delayed time between 2 reports).

- does not add any extra pg_clock_gettime_ns() calls (as compare to v2).

Remarks:

1. Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable. I don't think that the number of parallel workers has
to come into play as:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The throttling is not based on the cost limit as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The throttling is not based on the actual cost delay value because the leader
could be interrupted at the beginning, the midle or whatever part of the wait 
and
we are more interested about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample 
pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

With this approach in place, v3 attached applied, during my test case:

- the leader has been interrupted about 2500 times (instead of about 345000
times with v2)

- the vacuum index phase duration is very close to the master one (it has been
4 seconds faster (over a 8 minutes 40 seconds duration time), instead of 3
minutes faster with v2).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 99f417c0bcd7c29e126fdccdd6214ea37db67379 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Jun 2024 08:43:26 +
Subject: [PATCH v3] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does n

Re: Track the amount of time waiting due to cost_delay

2024-06-22 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 11:56:26AM +, Bertrand Drouvot wrote:
> == Observations 
> ===
> 
> As compare to v2:
> 
> 1. scanning heap time is about the same
> 2. vacuuming indexes time is about 3 minutes longer on master
> 3. vacuuming heap time is about the same

I had a closer look to understand why the vacuuming indexes time has been about
3 minutes longer on master.

During the vacuuming indexes phase, the leader is helping vacuuming the indexes
until it reaches WaitForParallelWorkersToFinish() (means when all the remaining
indexes are currently handled by the parallel workers, the leader has nothing
more to do and so it is waiting for the parallel workers to finish).

During the time the leader process is involved in indexes vacuuming it is
also subject to wait due to cost delay.

But with v2 applied, the leader may be interrupted by the parallel workers while
it is waiting (due to the new pgstat_progress_parallel_incr_param() calls that
the patch is adding).

So, with v2 applied, the leader is waiting less (as interrupted while waiting)
when being involved in indexes vacuuming and that's why v2 is "faster" than
master.

To put some numbers, I did count the number of times the leader's pg_usleep() 
has
been interrupted (by counting the number of times the nanosleep() did return a
value < 0 in pg_usleep()). Here they are:

v2: the leader has been interrupted about 342605 times
master: the leader has been interrupted about 36 times

The ones on master are mainly coming from the 
pgstat_progress_parallel_incr_param() 
calls in parallel_vacuum_process_one_index().

The additional ones on v2 are coming from the 
pgstat_progress_parallel_incr_param()
calls added in vacuum_delay_point().

 Conclusion ==

1. vacuuming indexes time has been longer on master because with v2, the leader
has been interrupted 342605 times while waiting, then making v2 "faster".

2. the leader being interrupted while waiting is also already happening on 
master
due to the pgstat_progress_parallel_incr_param() calls in
parallel_vacuum_process_one_index() (that have been added in 
46ebdfe164). It has been the case "only" 36 times during my test case.

I think that 2. is less of a concern but I think that 1. is something that needs
to be addressed because the leader process is not honouring its cost delay wait
time in a noticeable way (at least during my test case).

I did not think of a proposal yet, just sharing my investigation as to why
v2 has been faster than master during the vacuuming indexes phase.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-21 Thread Bertrand Drouvot
Hi,

On Wed, Jun 19, 2024 at 02:11:50PM +, Bertrand Drouvot wrote:
> To sum up, I did not see any cases that did not lead to 1. or 2., so I think
> it's safe to not add an extra lock for the RelationRelationId case. If, for 
> any
> reason, there is still cases that are outside 1. or 2. then they may lead to
> orphaned dependencies linked to the RelationRelationId class. I think that's
> fine to take that "risk" given that a. that would not be worst than currently
> and b. I did not see any of those in our fleet currently (while I have seen a 
> non
> negligible amount outside of the RelationRelationId case).

Another thought for the RelationRelationId class case: we could check if there
is a lock first and if there is no lock then acquire one. That way that would
ensure the relation is always locked (so no "risk" anymore), but OTOH it may
add "unecessary" locking (see 2. mentioned previously).

I think I do prefer this approach to be on the safe side of thing, what do
you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-06-20 Thread Bertrand Drouvot
Hi,

On Thu, Jun 20, 2024 at 09:46:30AM +0900, Michael Paquier wrote:
> On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> > - How should the persistence of the custom stats be achieved?
> > Callbacks to give custom stats kinds a way to write/read their data,
> > push everything into a single file, or support both?
> > - Should this do like custom RMGRs and assign to each stats kinds ID
> > that are set in stone rather than dynamic ones?

> These two questions have been itching me in terms of how it would work
> for extension developers, after noticing that custom RMGRs are used
> more than I thought:
> https://wiki.postgresql.org/wiki/CustomWALResourceManagers
> 
> The result is proving to be nicer, shorter by 300 lines in total and
> much simpler when it comes to think about the way stats are flushed
> because it is possible to achieve the same result as the first patch
> set without manipulating any of the code paths doing the read and
> write of the pgstats file.

I think it makes sense to follow the same "behavior" as the custom
wal resource managers. That, indeed, looks much more simpler than v1.

> In terms of implementation, pgstat.c's KindInfo data is divided into
> two parts, for efficiency:
> - The exiting in-core stats with designated initializers, renamed as
> built-in stats kinds.
> - The custom stats kinds are saved in TopMemoryContext,

Agree that a backend lifetime memory area is fine for that purpose.

> and can only
> be registered with shared_preload_libraries.  The patch reserves a set
> of 128 harcoded slots for all the custom kinds making the lookups for
> the KindInfos quite cheap.

+   MemoryContextAllocZero(TopMemoryContext,
+  
sizeof(PgStat_KindInfo *) * PGSTAT_KIND_CUSTOM_SIZE);

and that's only 8 * PGSTAT_KIND_CUSTOM_SIZE bytes in total.

I had a quick look at the patches (have in mind to do more):

> With that in mind, the patch set is more pleasant to the eye, and the
> attached v2 consists of:
> - 0001 and 0002 are some cleanups, same as previously to prepare for
> the backend-side APIs.

0001 and 0002 look pretty straightforward at a quick look.

> - 0003 adds the backend support to plug-in custom stats.

1 ===

It looks to me that there is a mix of "in core" and "built-in" to name the
non custom stats. Maybe it's worth to just use one?
 
As I can see (and as you said above) this is mainly inspired by the custom
resource manager and 2 === and 3 === are probably copy/paste consequences.

2 ===

+   if (pgstat_kind_custom_infos[idx] != NULL &&
+   pgstat_kind_custom_infos[idx]->name != NULL)
+   ereport(ERROR,
+   (errmsg("failed to register custom cumulative 
statistics \"%s\" with ID %u", kind_info->name, kind),
+errdetail("Custom resource manager \"%s\" 
already registered with the same ID.",
+  
pgstat_kind_custom_infos[idx]->name)));

s/Custom resource manager/Custom cumulative statistics/

3 ===

+   ereport(LOG,
+   (errmsg("registered custom resource manager \"%s\" with 
ID %u",
+   kind_info->name, kind)));

s/custom resource manager/custom cumulative statistics/

> - 0004 includes documentation.

Did not look yet.

> - 0005 is an example of how to use them, with a TAP test providing
> coverage.

Did not look yet.

As I said, I've in mind to do a more in depth review. I've noted the above while
doing a quick read of the patches so thought it makes sense to share them
now while at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Pluggable cumulative statistics

2024-06-20 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 04:59:50PM +0900, Michael Paquier wrote:
> Hi all,
> 
> 2) Make the shmem pgstats pluggable so as it is possible for extensions
> to register their own stats kinds.

Thanks for the patch! I like the idea of having custom stats (it has also been
somehow mentioned in [1]).

> 2) is actually something that can be used for more things than
> just pg_stat_statements, because people love extensions and
> statistics (spoiler: I do).

+1

> * Making custom stats data persistent is an interesting problem, and
> there are a couple of approaches I've considered:
> ** Allow custom kinds to define callbacks to read and write data from
> a source they'd want, like their own file through a fd.  This has the
> disadvantage to remove the benefit of c) above.
> ** Store everything in the existing stats file, adding one type of
> entry like 'S' and 'N' with a "custom" type, where the *name* of the
> custom stats kind is stored instead of its ID computed from shared
> memory.

What about having 2 files?

- One is the existing stats file
- One "predefined" for all the custom stats (so what you've done minus the
in-core stats). This one would not be configurable and the extensions will
not need to know about it.

Would that remove the benefit from c) that you mentioned up-thread?

[1]: 
https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vxymc%40awork3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-19 Thread Bertrand Drouvot
Hi,

On Mon, Jun 17, 2024 at 05:57:05PM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote:
> > So to try to sum up here: I'm not sure I agree with this design. But I
> > also feel like the design is not as clear and consistently implemented
> > as it could be. So even if I just ignored the question of whether it's
> > the right design, it feels like we're a ways from having something
> > potentially committable here, because of issues like the ones I
> > mentioned in the last paragraph.
> > 
> 
> Agree. I'll now move on with the "XXX Do we need a lock for 
> RelationRelationId?"
> comments that I put in v10 (see [1]) and study all the cases around them.

A. I went through all of them, did some testing for all, and reached the
conclusion that we must be in one of the two following cases that would already
prevent the relation to be dropped:

1. The relation is already locked (could be an existing relation or a relation
that we are creating).

2. The relation is protected indirectly (i.e an index protected by a lock on
its table, a table protected by a lock on a function that depends of
the table...).

So we don't need to add a lock if this is a RelationRelationId object class for
the cases above.

As a consequence, I replaced the "XXX" related comments that were in v10 by
another set of comments in v11 (attached) like "No need to call 
LockRelationOid()
(through LockNotPinnedObject())". Reason is to make it clear in the code
and also to ease the review.

B. I explained in [1] (while sharing v10) that the object locking is now outside
of the dependency code except for (and I explained why):

recordDependencyOnExpr() 
recordDependencyOnSingleRelExpr()
makeConfigurationDependencies()

So I also did some testing, on the RelationRelationId case, for those and I
reached the same conclusion as the one shared above.

For A. and B. the testing has been done by adding a "ereport(WARNING.." at
those places when a RelationRelationId is involved. Then I run "make check"
and went to the failed tests (output were not the expected ones due to the
extra "WARNING"), reproduced them with gdb and checked for the lock on the
relation producing the "WARNING". All of those were linked to 1. or 2.

Note that adding an assertion on an existing lock would not work for the cases
described in 2.

So, I'm now confident that we must be in 1. or 2. but it's also possible
that I've missed some cases (though I think the way the testing has been done is
not that weak).

To sum up, I did not see any cases that did not lead to 1. or 2., so I think
it's safe to not add an extra lock for the RelationRelationId case. If, for any
reason, there is still cases that are outside 1. or 2. then they may lead to
orphaned dependencies linked to the RelationRelationId class. I think that's
fine to take that "risk" given that a. that would not be worst than currently
and b. I did not see any of those in our fleet currently (while I have seen a 
non
negligible amount outside of the RelationRelationId case).

> Once done, I think that it will easier to 1.remove ambiguity, 2.document and
> 3.do the "right" thing regarding the RelationRelationId object class.
> 

Please find attached v11, where I added more detailed comments in the commit
message and also in the code (I also removed the useless check on
AuthMemRelationId).

[1]: 
https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 50e11432960e0ad5d940d2e7d9557fc4770d8262 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v11] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the n

Re: Avoid orphaned objects dependencies, take 3

2024-06-17 Thread Bertrand Drouvot
Hi,

On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote:
> On Fri, Jun 14, 2024 at 3:54 AM Bertrand Drouvot
>  wrote:
> > > Ah, right. So, I was assuming that, with either this version of your
> > > patch or the earlier version, we'd end up locking the constraint
> > > itself. Was I wrong about that?
> >
> > The child contraint itself is not locked when going through
> > ConstraintSetParentConstraint().
> >
> > While at it, let's look at a full example and focus on your concern.
> 
> I'm not at the point of having a concern yet, honestly. I'm trying to
> understand the design ideas. The commit message just says that we take
> a conflicting lock, but it doesn't mention which object types that
> principle does or doesn't apply to. I think the idea of skipping it
> for cases where it's redundant with the relation lock could be the
> right idea, but if that's what we're doing, don't we need to explain
> the principle somewhere? And shouldn't we also apply it across all
> object types that have the same property?

Yeah, I still need to deeply study this area and document it. 

> Along the same lines:
> 
> + /*
> + * Those don't rely on LockDatabaseObject() when being dropped (see
> + * AcquireDeletionLock()). Also it looks like they can not produce
> + * orphaned dependent objects when being dropped.
> + */
> + if (object->classId == RelationRelationId || object->classId ==
> AuthMemRelationId)
> + return;
> 
> "It looks like X cannot happen" is not confidence-inspiring.

Yeah, it is not. It is just a "feeling" that I need to work on to remove
any ambiguity and/or adjust the code as needed.

> At the
> very least, a better comment is needed here. But also, that relation
> has no exception for AuthMemRelationId, only for RelationRelationId.
> And also, the exception for RelationRelationId doesn't imply that we
> don't need a conflicting lock here: the special case for
> RelationRelationId in AcquireDeletionLock() is necessary because the
> lock tag format is different for relations than for other database
> objects, not because we don't need a lock at all. If the handling here
> were really symmetric with what AcquireDeletionLock(), the coding
> would be to call either LockRelationOid() or LockDatabaseObject()
> depending on whether classid == RelationRelationId.

Agree.

> Now, that isn't
> actually necessary, because we already have relation-locking calls
> elsewhere, but my point is that the rationale this commit gives for
> WHY it isn't necessary seems to me to be wrong in multiple ways.

Agree. I'm not done with that part yet (should have made it more clear).

> So to try to sum up here: I'm not sure I agree with this design. But I
> also feel like the design is not as clear and consistently implemented
> as it could be. So even if I just ignored the question of whether it's
> the right design, it feels like we're a ways from having something
> potentially committable here, because of issues like the ones I
> mentioned in the last paragraph.
> 

Agree. I'll now move on with the "XXX Do we need a lock for RelationRelationId?"
comments that I put in v10 (see [1]) and study all the cases around them.

Once done, I think that it will easier to 1.remove ambiguity, 2.document and
3.do the "right" thing regarding the RelationRelationId object class.

[1]: 
https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow logical failover slots to wait on synchronous replication

2024-06-17 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 10:00:46AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote:
> > On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote:
> > > The existing 'standby_slot_names' isn't great for users who are running
> > > clusters with quorum-based synchronous replicas. For instance, if
> > > the user has  synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
> > > bit tedious to have to reconfigure the standby_slot_names to set it to
> > > the most updated 3 sync replicas whenever different sync replicas start
> > > lagging. In the event that both GUCs are set, 'standby_slot_names' takes
> > > precedence.
> > 
> > Hm.  IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E"
> > to get the desired behavior today.  That might ordinarily be okay, but it
> > could cause logical replication to be held back unnecessarily if one of the
> > replicas falls behind for whatever reason.  A way to tie standby_slot_names
> > to synchronous replication instead does seem like it would be useful in
> > this case.
> 
> FWIW, I have the same understanding and also think your proposal would be
> useful in this case.

A few random comments about v1:

1 

+   int mode = SyncRepWaitMode;

It's set to SyncRepWaitMode and then never change. Worth to get rid of "mode"?

2 

+   static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE] = 
{InvalidXLogRecPtr};

I did some testing and saw that the lsn[] values were not always set to
InvalidXLogRecPtr right after. It looks like that, in that case, we should
avoid setting the lsn[] values at compile time. Then, what about?

1. remove the "static". 

or

2. keep the static but set the lsn[] values after its declaration.

3 

-   /*
-* Return false if not all the standbys have caught up to the specified
-* WAL location.
-*/
-   if (caught_up_slot_num != standby_slot_names_config->nslotnames)
-   return false;
+   if (!XLogRecPtrIsInvalid(lsn[mode]) && lsn[mode] >= 
wait_for_lsn)
+   return true;

lsn[] values are/(should have been, see 2 above) just been initialized to
InvalidXLogRecPtr so that XLogRecPtrIsInvalid(lsn[mode]) will always return
true. I think this check is not needed then.

4 

> > > I did some very brief pgbench runs to compare the latency. Client instance
> > > was running pgbench and 10 logical clients while the Postgres box hosted
> > > the writer and 5 synchronous replicas.

> > > There's a hit to TPS

Out of curiosity, did you compare with standby_slot_names_from_syncrep set to 
off
and standby_slot_names not empty?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-17 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 04:52:09PM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote:
> > On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot
> >  wrote:
> > > Do you still find the code hard to maintain with v9?
> > 
> > I don't think it substantially changes my concerns as compared with
> > the earlier version.
> 
> Thanks for the feedback, I'll give it more thoughts.

Please find attached v10 that puts the object locking outside of the dependency
code.

It's done that way except for:

recordDependencyOnExpr() 
recordDependencyOnSingleRelExpr()
makeConfigurationDependencies()

The reason is that I think that it would need part of the logic that his inside
the above functions to be duplicated and I'm not sure that's worth it.

For example, we would probably need to:

- make additional calls to find_expr_references_walker() 
- make additional scan on the config map

It's also not done outside of recordDependencyOnCurrentExtension() as:

1. I think it is clear enough that way (as it is clear that the lock is taken on
a ExtensionRelationId object class).
2. why to include "commands/extension.h" in more places (locking would
depend of "creating_extension" and "CurrentExtensionObject"), while 1.?

Remarks:

1. depLockAndCheckObject() and friends in v9 have been renamed to
LockNotPinnedObject() and friends (as the vast majority of their calls are now
done outside of the dependency code).

2. regarding the concern around RelationRelationId (discussed in [1]), v10 adds
a comment "XXX Do we need a lock for RelationRelationId?" at the places we
may want to lock this object class. I did not think about it yet (but will do),
I only added this comment at multiple places.

I think that v10 is easier to follow (as compare to v9) as we can easily see for
which object class we'll put a lock on.

Thoughts?

[1]: 
https://www.postgresql.org/message-id/Zmv3TPfJAyQXhIdu%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From dc75e3255803617d21c55d77dc218904bd729d81 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v10] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place before the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  | 109 ++-
 src/backend/catalog/heap.c|   8 ++
 src/backend/catalog/index.c   |  16 +++
 src/backend/catalog/objectaddress.c   |  57 
 src/backend/catalog/pg_aggregate.c|   9 ++
 src/backend/catalog/pg_attrdef.c  |   1 +
 src/backend/catalog/pg_cast.c |   5 +
 src/backend/catalog/pg_collation.c|   1 +
 src/backend/catalog/pg_constraint.c   |  11 ++
 src/backend/catalog/pg_conversion.c   |   2 +
 src/backend/catalog/pg_depend.c   |  27 +++-
 src/backend/catalog/pg_operator.c |  19 +++
 src/backend/catalog/pg_proc.c |  17 ++-
 src/backend/catalog/pg_publication.c  |   5 +
 src/backend/catalog/pg_range.c|   6 +
 src/backend/catalog/pg_type.c |  33 +
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/alter.c  |   4 +
 src/backend/commands/amcmds.c |  

Re: Avoid orphaned objects dependencies, take 3

2024-06-14 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 02:27:45PM -0400, Robert Haas wrote:
> On Thu, Jun 13, 2024 at 12:52 PM Bertrand Drouvot
>  wrote:
> > > > table_open(childRelId, ...) would lock any "ALTER TABLE  
> > > > DROP CONSTRAINT"
> > > > already. Not sure I understand your concern here.
> > >
> > > I believe this is not true. This would take a lock on the table, not
> > > the constraint itself.
> >
> > I agree that it would not lock the constraint itself. What I meant to say 
> > is that
> > , nevertheless, the constraint can not be dropped. Indeed, the "ALTER TABLE"
> > necessary to drop the constraint (ALTER TABLE  DROP CONSTRAINT) 
> > would
> > be locked by the table_open(childRelId, ...).
> 
> Ah, right. So, I was assuming that, with either this version of your
> patch or the earlier version, we'd end up locking the constraint
> itself. Was I wrong about that?

The child contraint itself is not locked when going through
ConstraintSetParentConstraint().

While at it, let's look at a full example and focus on your concern.

Let's do that with this gdb file:

"
$ cat gdb.txt
b dependency.c:1542
command 1
printf "Will return for: classId %d and objectId %d\n", 
object->classId, object->objectId
c
end
b dependency.c:1547 if object->classId == 2606
command 2
printf "Will lock constraint: classId %d and objectId %d\n", 
object->classId, object->objectId
c
end
"

knowing that:

"
Line 1542 is the return here in depLockAndCheckObject() (your concern):

if (object->classId == RelationRelationId || object->classId == 
AuthMemRelationId)
return;

Line 1547 is the lock here in depLockAndCheckObject():

/* assume we should lock the whole object not a sub-object */
LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
"

So, with gdb attached to a session let's:

1. Create the parent relation

CREATE TABLE upsert_test (
a   INT,
b   TEXT
) PARTITION BY LIST (a);

gdb produces:

---
Will return for: classId 1259 and objectId 16384
Will return for: classId 1259 and objectId 16384
---

Oid 16384 is upsert_test, so I think the return (dependency.c:1542) is fine as
we are creating the object (it can't be dropped as not visible to anyone else).

2. Create another relation (will be the child)

CREATE TABLE upsert_test_2 (b TEXT, a int);

gdb produces:

---
Will return for: classId 1259 and objectId 16391
Will return for: classId 1259 and objectId 16394
Will return for: classId 1259 and objectId 16394
Will return for: classId 1259 and objectId 16391
---

Oid 16391 is upsert_test_2
Oid 16394 is pg_toast_16391

so I think the return (dependency.c:1542) is fine as we are creating those
objects (can't be dropped as not visible to anyone else).

3. Attach the partition

ALTER TABLE upsert_test ATTACH PARTITION upsert_test_2 FOR VALUES IN (2);

gdb produces:

---
Will return for: classId 1259 and objectId 16384
---

That's fine because we'd already had locked the relation 16384 through
AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid().

4. Add a constraint on the child relation

ALTER TABLE upsert_test_2 add constraint bdtc2 UNIQUE (a);

gdb produces:

---
Will return for: classId 1259 and objectId 16391
Will lock constraint: classId 2606 and objectId 16397
---

That's fine because we'd already had locked the relation 16391 through
AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid().

Oid 16397 is the constraint we're creating (bdtc2).

5. Add a constraint on the parent relation (this goes through
ConstraintSetParentConstraint())

ALTER TABLE upsert_test add constraint bdtc1 UNIQUE (a);

gdb produces:

---
Will return for: classId 1259 and objectId 16384
Will lock constraint: classId 2606 and objectId 16399
Will return for: classId 1259 and objectId 16398
Will return for: classId 1259 and objectId 16391
Will lock constraint: classId 2606 and objectId 16399
Will return for: classId 1259 and objectId 16391
---

Regarding "Will return for: classId 1259 and objectId 16384":
That's fine because we'd already had locked the relation 16384 through
AlterTableLookupRelation()->RangeVarGetRelidExtended()->LockRelationOid().

Regarding "Will lock constraint: classId 2606 and objectId 16399":
Oid 16399 is the constraint that we're creating.

Regarding "Will return for: classId 1259 and objectId 16398":
That's fine because Oid 16398 is an index that we're creating while creating
the constraint (so the index can't be dropped as not visible to anyone else).

Regarding "Will return for: classId 1259 and objectId 16391":
That's fine because 16384 we'd be already locked (as mentioned above). And
I think that's fine because trying to drop "upsert_test_2" (ak

Re: Avoid orphaned objects dependencies, take 3

2024-06-13 Thread Bertrand Drouvot
Hi,

On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote:
> On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot
>  wrote:
> > Do you still find the code hard to maintain with v9?
> 
> I don't think it substantially changes my concerns as compared with
> the earlier version.

Thanks for the feedback, I'll give it more thoughts.

> 
> > > but we're not similarly careful about other operations e.g.
> > > ConstraintSetParentConstraint is called by DefineIndex which calls
> > > table_open(childRelId, ...) first, but there's no logic in DefineIndex
> > > to lock the constraint.
> >
> > table_open(childRelId, ...) would lock any "ALTER TABLE  DROP 
> > CONSTRAINT"
> > already. Not sure I understand your concern here.
> 
> I believe this is not true. This would take a lock on the table, not
> the constraint itself.

I agree that it would not lock the constraint itself. What I meant to say is 
that
, nevertheless, the constraint can not be dropped. Indeed, the "ALTER TABLE"
necessary to drop the constraint (ALTER TABLE  DROP CONSTRAINT) 
would
be locked by the table_open(childRelId, ...).

That's why I don't understand your concern with this particular example. But
anyway, I'll double check your related concern:

+ if (object->classId == RelationRelationId || object->classId ==
AuthMemRelationId)
+ return;

in depLockAndCheckObject(). 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-13 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 08:26:23AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:
> 
> > Thank you for the proposal and the patch. I understand the motivation
> > of this patch.
> 
> Thanks for looking at it!
> 
> > Beside the point Nathan mentioned, I'm slightly worried
> > that massive parallel messages could be sent to the leader process
> > when the cost_limit value is low.
> 
> I see, I can/will do some testing in this area and share the numbers.

Here is the result of the test. It has been launched several times and it
produced the same (surprising result) each time.

== Context 

The testing has been done with this relation (large_tbl) and its indexes:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
 FROM pg_buffercache b JOIN pg_class c
 ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
  WHERE datname = current_database()))
 JOIN pg_namespace n ON n.oid = c.relnamespace
 GROUP BY n.nspname, c.relname
 ORDER BY 3 DESC
 LIMIT 22;

 nspname |  relname   | buffers
-++-
 public  | large_tbl  |  80
 public  | large_tbl_filler13 |  125000
 public  | large_tbl_filler6  |  125000
 public  | large_tbl_filler5  |  125000
 public  | large_tbl_filler3  |  125000
 public  | large_tbl_filler15 |  125000
 public  | large_tbl_filler4  |  125000
 public  | large_tbl_filler20 |  125000
 public  | large_tbl_filler18 |  125000
 public  | large_tbl_filler14 |  125000
 public  | large_tbl_filler8  |  125000
 public  | large_tbl_filler11 |  125000
 public  | large_tbl_filler19 |  125000
 public  | large_tbl_filler7  |  125000
 public  | large_tbl_filler1  |  125000
 public  | large_tbl_filler12 |  125000
 public  | large_tbl_filler9  |  125000
 public  | large_tbl_filler17 |  125000
 public  | large_tbl_filler16 |  125000
 public  | large_tbl_filler10 |  125000
 public  | large_tbl_filler2  |  125000
 public  | large_tbl_pkey |5486
(22 rows)

All of them completly fit in memory (to avoid I/O read latency during the 
vacuum).

The config, outside of default is:

max_wal_size = 4GB
shared_buffers = 30GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 8
max_parallel_workers = 10
vacuum_cost_limit = 10
autovacuum_vacuum_cost_limit = 10

My system is not overloaded, has enough resources to run this test and only this
test is running.

== Results 

== With v2 (attached) applied on master

postgres=# VACUUM (PARALLEL 8) large_tbl;
VACUUM
Time: 1146873.016 ms (19:06.873)

The duration is splitted that way:

Vacuum phase: cumulative time (cumulative time delayed)
===
scanning heap: 00:08:16.414628 (time_delayed: 444370)
vacuuming indexes: 00:14:55.314699 (time_delayed: 2545293)
vacuuming heap: 00:19:06.814617 (time_delayed: 2767540)

I sampled active sessions from pg_stat_activity (one second interval), here is
the summary during the vacuuming indexes phase (ordered by count):

 leader_pid |  pid   |   wait_event   | count
+++---
 452996 | 453225 | VacuumDelay|   366
 452996 | 453223 | VacuumDelay|   363
 452996 | 453226 | VacuumDelay|   362
 452996 | 453224 | VacuumDelay|   361
 452996 | 453222 | VacuumDelay|   359
 452996 | 453221 | VacuumDelay|   359
| 452996 | VacuumDelay|   331
| 452996 | CPU|30
 452996 | 453224 | WALWriteLock   |23
 452996 | 453222 | WALWriteLock   |20
 452996 | 453226 | WALWriteLock   |20
 452996 | 453221 | WALWriteLock   |19
| 452996 | WalSync|18
 452996 | 453225 | WALWriteLock   |18
 452996 | 453223 | WALWriteLock   |16
| 452996 | WALWriteLock   |15
 452996 | 453221 | CPU|14
 452996 | 453222 | CPU|14
 452996 | 453223 | CPU|12
 452996 | 453224 | CPU|10
 452996 | 453226 | CPU|10
 452996 | 453225 | CPU| 8
 452996 | 453223 | WalSync| 4
 452996 | 453221 | WalSync| 2
 452996 | 453226 | WalWrite   | 2
 452996 | 453221 | WalWrite   | 1
| 452996 | ParallelFinish | 1
 452996 | 453224 | WalSync| 1
 452996 | 453225 | WalSync| 1
 452996 | 453222 | WalWrite   | 1
 452996 | 453225 | WalWrite   | 1
 452996 | 453222 | WalSync| 1
 452996 | 4

Re: relfilenode statistics

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 03:35:23PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 10 Jun 2024 08:09:56 +0000, Bertrand Drouvot 
>  wrote in 
> > 
> > My idea was to move all that is in pg_statio_all_tables to relfilenode stats
> > and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) 
> > ensure
> > the user can still retrieve the stats from pg_statio_all_tables in such a 
> > way
> > that it survives a rewrite, 3) provide dedicated APIs to retrieve
> > relfilenode stats but only for the current relfilenode, 4) document this
> > behavior. This is what the POC patch is doing for heap_blks_written (would
> > need to do the same for heap_blks_read and friends) except for the 
> > documentation
> > part. What do you think?
> 
> In my opinion,

Thanks for looking at it!

> it is certainly strange that bufmgr is aware of
> relation kinds, but introducing relfilenode stats to avoid this skew
> doesn't seem to be the best way, as it invites inconclusive arguments
> like the one raised above. The fact that we transfer counters from old
> relfilenodes to new ones indicates that we are not really interested
> in counts by relfilenode. If that's the case, wouldn't it be simpler
> to call pgstat_count_relation_buffer_read() from bufmgr.c and then
> branch according to relkind within that function? If you're concerned
> about the additional branch, some ingenuity may be needed.

That may be doable for "read" activities but what about write activities?
Do you mean not relying on relfilenode stats for reads but relying on 
relfilenode
stats for writes?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 01:13:48PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 5:49 AM Bertrand Drouvot
>  wrote:
> > As we can see the actual wait time is 30ms less than the intended wait time 
> > with
> > this simple test. So I still think we should go with 1) actual wait time 
> > and 2)
> > report the number of waits (as mentioned in [1]). Does that make sense to 
> > you?
> 
> I like the idea of reporting the actual wait time better,

+1

> provided
> that we verify that doing so isn't too expensive. I think it probably
> isn't, because in a long-running VACUUM there is likely to be disk
> I/O, so the CPU overhead of a few extra gettimeofday() calls should be
> fairly low by comparison.

Agree.

> I wonder if there's a noticeable hit when
> everything is in-memory. I guess probably not, because with any sort
> of normal configuration, we shouldn't be delaying after every block we
> process, so the cost of those gettimeofday() calls should still be
> getting spread across quite a bit of real work.

I did some testing, with:

shared_buffers = 12GB
vacuum_cost_delay = 1
autovacuum_vacuum_cost_delay = 1
max_parallel_maintenance_workers = 0
max_parallel_workers = 0

added to a default config file.

A table and all its indexes were fully in memory, the numbers are:

postgres=# SELECT n.nspname, c.relname, count(*) AS buffers
 FROM pg_buffercache b JOIN pg_class c
 ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database
  WHERE datname = current_database()))
 JOIN pg_namespace n ON n.oid = c.relnamespace
 GROUP BY n.nspname, c.relname
 ORDER BY 3 DESC
 LIMIT 11;

 nspname |  relname  | buffers
-+---+-
 public  | large_tbl |  80
 public  | large_tbl_pkey|5486
 public  | large_tbl_filler7 |1859
 public  | large_tbl_filler4 |1859
 public  | large_tbl_filler1 |1859
 public  | large_tbl_filler6 |1859
 public  | large_tbl_filler3 |1859
 public  | large_tbl_filler2 |1859
 public  | large_tbl_filler5 |1859
 public  | large_tbl_filler8 |1859
 public  | large_tbl_version |1576
(11 rows)


The observed timings when vacuuming this table are:

On master:

vacuum phase: cumulative duration
-

scanning heap: 00:00:37.808184
vacuuming indexes: 00:00:41.808176
vacuuming heap: 00:00:54.808156

On master patched with actual time delayed:

vacuum phase: cumulative duration
-

scanning heap: 00:00:36.502104 (time_delayed: 22202)
vacuuming indexes: 00:00:41.002103 (time_delayed: 23769)
vacuuming heap: 00:00:54.302096 (time_delayed: 34886)

As we can see there is no noticeable degradation while the vacuum entered about
34886 times in this instrumentation code path (cost_delay was set to 1).

> That said, I'm not sure this experiment shows a real problem with the
> idea of showing intended wait time. It does establish the concept that
> repeated signals can throw our numbers off, but 30ms isn't much of a
> discrepancy.

Yeah, the idea was just to show how easy it is to create a 30ms discrepancy.

> I'm worried about being off by a factor of two, or an
> order of magnitude. I think we still don't know if that can happen,
> but if we're going to show actual wait time anyway, then we don't need
> to explore the problems with other hypothetical systems too much.

Agree.

> I'm not convinced that reporting the number of waits is useful. If we
> were going to report a possibly-inaccurate amount of actual waiting,
> then also reporting the number of waits might make it easier to figure
> out when the possibly-inaccurate number was in fact inaccurate. But I
> think it's way better to report an accurate amount of actual waiting,
> and then I'm not sure what we gain by also reporting the number of
> waits.

Sami shared his thoughts in [1] and [2] and so did I in [3]. If some of us still
don't think that reporting the number of waits is useful then we can probably
start without it.

[1]: 
https://www.postgresql.org/message-id/0EA474B6-BF88-49AE-82CA-C1A9A3C17727%40amazon.com
[2]: 
https://www.postgresql.org/message-id/E12435E2-5FCA-49B0-9ADB-0E7153F95E2D%40amazon.com
[3]: 
https://www.postgresql.org/message-id/ZmmGG4e%2BqTBD2kfn%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 02:48:30PM -0400, Robert Haas wrote:
> On Tue, Jun 11, 2024 at 2:47 PM Nathan Bossart  
> wrote:
> > I'm struggling to think of a scenario in which the number of waits would be
> > useful, assuming you already know the amount of time spent waiting.

If we provide the actual time spent waiting, providing the number of waits would
allow to see if there is a diff between the actual time and the intended time
(i.e: number of waits * cost_delay, should the cost_delay be the same during
the vacuum duration). That should trigger some thoughts if the diff is large
enough.

I think that what we are doing here is to somehow add instrumentation around the
"WAIT_EVENT_VACUUM_DELAY" wait event. If we were to add instrumentation for wait
events (generaly speaking) we'd probably also expose the number of waits per
wait event (in addition to the time waited).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-12 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 11:40:36AM -0500, Nathan Bossart wrote:
> On Tue, Jun 11, 2024 at 07:25:11AM +0000, Bertrand Drouvot wrote:
> > So I think that in v2 we could: 1) measure the actual wait time instead, 2)
> > count the number of times the vacuum slept. We could also 3) reports the
> > effective cost limit (as proposed by Nathan up-thread) (I think that 3) 
> > could
> > be misleading but I'll yield to majority opinion if people think it's not).
> 
> I still think the effective cost limit would be useful, if for no other
> reason than to help reinforce that it is distributed among the autovacuum
> workers.

I also think it can be useful, my concern is more to put this information in
pg_stat_progress_vacuum. What about Sawada-san proposal in [1]? (we could
create a new view that would contain those data: per-worker dobalance, 
cost_lmit,
cost_delay, active, and failsafe). 

[1]: 
https://www.postgresql.org/message-id/CAD21AoDOu%3DDZcC%2BPemYmCNGSwbgL1s-5OZkZ1Spd5pSxofWNCw%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow logical failover slots to wait on synchronous replication

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote:
> > The existing 'standby_slot_names' isn't great for users who are running
> > clusters with quorum-based synchronous replicas. For instance, if
> > the user has  synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a
> > bit tedious to have to reconfigure the standby_slot_names to set it to
> > the most updated 3 sync replicas whenever different sync replicas start
> > lagging. In the event that both GUCs are set, 'standby_slot_names' takes
> > precedence.
> 
> Hm.  IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E"
> to get the desired behavior today.  That might ordinarily be okay, but it
> could cause logical replication to be held back unnecessarily if one of the
> replicas falls behind for whatever reason.  A way to tie standby_slot_names
> to synchronous replication instead does seem like it would be useful in
> this case.

FWIW, I have the same understanding and also think your proposal would be
useful in this case.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

Though I (now) think that it would make sense to record the actual delay time 
instead (see [1]), I think it's interesting to do some testing as you suggested.

With record_actual_time.txt (attached) applied on top of v1, we can see the
intended and actual wait time.

On my system, "no load at all" except the vacuum running, I see no diff:

 Tue Jun 11 09:22:06 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |41107 |   41107 | 
00:00:42.301851
(1 row)

 Tue Jun 11 09:22:07 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |42076 |   42076 | 
00:00:43.301848
(1 row)

 Tue Jun 11 09:22:08 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |43045 |   43045 | 
00:00:44.301854
(1 row)

But if I launch pg_reload_conf() 10 times in a row, I can see:

 Tue Jun 11 09:22:09 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |44064 |   44034 | 
00:00:45.302965
(1 row)

 Tue Jun 11 09:22:10 2024 (every 1s)

  pid  | relid | phase | time_delayed | actual_time_delayed |
duration
---+---+---+--+-+-
 54754 | 16385 | scanning heap |45033 |   45003 | 
00:00:46.301858


As we can see the actual wait time is 30ms less than the intended wait time with
this simple test. So I still think we should go with 1) actual wait time and 2)
report the number of waits (as mentioned in [1]). Does that make sense to you?


[1]: 
https://www.postgresql.org/message-id/Zmf712A5xcOM9Hlg%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 1345e99dcb..e4ba8de00a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
 S.param8 AS indexes_total, S.param9 AS indexes_processed,
-S.param10 AS time_delayed
+S.param10 AS time_delayed, S.param11 AS actual_time_delayed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2551408a86..bbb5002efe 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2381,18 +2381,29 @@ vacuum_delay_point(void)
/* Nap if appropriate */
if (msec > 0)
{
+   instr_time  delay_start;
+   instr_time  delay_time;
+
if (msec > vacuum_cost_delay * 4)
msec = vacuum_cost_delay * 4;
 
pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+   INSTR_TIME_SET_CURRENT(delay_start);
pg_usleep(msec * 1000);
+   INSTR_TIME_SET_CURRENT(delay_time);
pgstat_report_wait_end();
/* Report the amount of time we slept */
+   INSTR_TIME_SUBTRACT(delay_time, delay_start);
if (VacuumSharedCostBalance != NULL)
+   {

pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
+   
pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_ACTUAL_TIME_DELA

Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Tue, Jun 11, 2024 at 04:07:05PM +0900, Masahiko Sawada wrote:

> Thank you for the proposal and the patch. I understand the motivation
> of this patch.

Thanks for looking at it!

> Beside the point Nathan mentioned, I'm slightly worried
> that massive parallel messages could be sent to the leader process
> when the cost_limit value is low.

I see, I can/will do some testing in this area and share the numbers.

> 
> FWIW when I want to confirm the vacuum delay effect, I often use the
> information from the DEBUG2 log message in VacuumUpdateCosts()
> function. Exposing these data (per-worker dobalance, cost_lmit,
> cost_delay, active, and failsafe) somewhere in a view might also be
> helpful for users for checking vacuum delay effects.

Do you mean add time_delayed in pg_stat_progress_vacuum and cost_limit + the
other data you mentioned above in another dedicated view?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 05:58:13PM -0400, Robert Haas wrote:
> On Mon, Jun 10, 2024 at 11:36 AM Nathan Bossart
>  wrote:
> > Hm.  Should we measure the actual time spent sleeping, or is a rough
> > estimate good enough?  I believe pg_usleep() might return early (e.g., if
> > the process is signaled) or late, so this field could end up being
> > inaccurate, although probably not by much.  If we're okay with millisecond
> > granularity, my first instinct is that what you've proposed is fine, but I
> > figured I'd bring it up anyway.
> 
> I bet you could also sleep for longer than planned, throwing the
> numbers off in the other direction.

Thanks for looking at it! Agree, that's how I read "or late" from Nathan's
comment above.

> I'm always suspicious of this sort of thing. I tend to find nothing
> gives me the right answer unless I assume that the actual sleep times
> are randomly and systematically different from the intended sleep
> times but arbitrarily large amounts. I think we should at least do
> some testing: if we measure both the intended sleep time and the
> actual sleep time, how close are they? Does it change if the system is
> under crushing load (which might elongate sleeps) or if we spam
> SIGUSR1 against the vacuum process (which might shorten them)?

OTOH Sami proposed in [1] to count the number of times the vacuum went into
delay. I think that's a good idea. His idea makes me think that (in addition to
the number of wait times) it would make sense to measure the "actual" sleep time
(and not the intended one) then (so that one could measure the difference 
between
the intended wait time (number of wait times * cost delay (if it does not change
during the vacuum duration)) and the actual measured wait time).

So I think that in v2 we could: 1) measure the actual wait time instead, 2)
count the number of times the vacuum slept. We could also 3) reports the
effective cost limit (as proposed by Nathan up-thread) (I think that 3) could
be misleading but I'll yield to majority opinion if people think it's not).

Thoughts?


[1]: 
https://www.postgresql.org/message-id/A0935130-7C4B-4094-B6E4-C7D5086D50EF%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 08:12:46PM +, Imseih (AWS), Sami wrote:
> >> This sounds like useful information to me.
> 
> > Thanks for looking at it!
> 
> The  VacuumDelay is the only visibility available to
> gauge the cost_delay. Having this information
> advertised by pg_stat_progress_vacuum as is being proposed
> is much better.

Thanks for looking at it!

> However, I also think that the
> "number of times"  the vacuum went into delay will be needed
> as well. Both values will be useful to tune cost_delay and cost_limit. 

Yeah, I think that's a good idea. With v1 one could figure out how many times
the delay has been triggered but that does not work anymore if: 1) cost_delay
changed during the vacuum duration or 2) the patch changes the way time_delayed
is measured/reported (means get the actual wait time and not the theoritical
time as v1 does). 

> 
> It may also make sense to accumulate the total_time in delay
> and the number of times delayed in a cumulative statistics [0]
> view to allow a user to trend this information overtime.
> I don't think this info fits in any of the existing views, i.e.
> pg_stat_database, so maybe a new view for cumulative
> vacuum stats may be needed. This is likely a separate
> discussion, but calling it out here.

+1

> >> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> >> pg_stat_activity in order to know what percentage of time was being spent
> >> in cost delay.  Is that how you'd expect for this to be used in practice?
> 
> > Yeah, one could use a query such as:
> 
> > select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p 
> > JOIN pg_stat_activity a using (pid)
> 
> Maybe all  progress views should just expose the 
> "beentry->st_activity_start_timestamp " 
> to let the user know when the current operation began.

Yeah maybe, I think this is likely a separate discussion too, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Bertrand Drouvot
On Mon, Jun 10, 2024 at 02:20:16PM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 05:48:22PM +0000, Bertrand Drouvot wrote:
> > On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> >> I wonder if we should also
> >> surface the effective cost limit for each autovacuum worker.
> > 
> > I'm not sure about it as I think that it could be misleading: one could 
> > query
> > pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
> > due to _this_ cost_limit. But that's not necessary true as the cost_limit 
> > could
> > have changed multiple times since the vacuum started. So, unless there is
> > frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed 
> > and
> > the cost_limit could be misleadind IMHO.
> 
> Well, that's true for the delay, too, right (at least as of commit
> 7d71d3d)?

Yeah right, but the patch exposes the total amount of time the vacuum has
been delayed (not the cost_delay per say) which does not sound misleading to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-10 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 10:36:42AM -0500, Nathan Bossart wrote:
> On Mon, Jun 10, 2024 at 06:05:13AM +0000, Bertrand Drouvot wrote:
> > During the last pgconf.dev I attended Robert´s presentation about 
> > autovacuum and
> > it made me remember of an idea I had some time ago: $SUBJECT
> 
> This sounds like useful information to me.

Thanks for looking at it!

> I wonder if we should also
> surface the effective cost limit for each autovacuum worker.

I'm not sure about it as I think that it could be misleading: one could query
pg_stat_progress_vacuum and conclude that the time_delayed he is seeing is
due to _this_ cost_limit. But that's not necessary true as the cost_limit could
have changed multiple times since the vacuum started. So, unless there is
frequent sampling on pg_stat_progress_vacuum, displaying the time_delayed and
the cost_limit could be misleadind IMHO.

> > Currently one can change [autovacuum_]vacuum_cost_delay and
> > [auto vacuum]vacuum_cost_limit but has no reliable way to measure the 
> > impact of
> > the changes on the vacuum duration: one could observe the vacuum duration
> > variation but the correlation to the changes is not accurate (as many others
> > factors could impact the vacuum duration (load on the system, i/o 
> > latency,...)).
> 
> IIUC you'd need to get information from both pg_stat_progress_vacuum and
> pg_stat_activity in order to know what percentage of time was being spent
> in cost delay.  Is that how you'd expect for this to be used in practice?

Yeah, one could use a query such as:

select p.*, now() - a.xact_start as duration from pg_stat_progress_vacuum p 
JOIN pg_stat_activity a using (pid)

for example. Worth to provide an example somewhere in the doc?

> > pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
> > pg_usleep(msec * 1000);
> > pgstat_report_wait_end();
> > +   /* Report the amount of time we slept */
> > +   if (VacuumSharedCostBalance != NULL)
> > +   
> > pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> > +   else
> > +   
> > pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
> 
> Hm.  Should we measure the actual time spent sleeping, or is a rough
> estimate good enough?  I believe pg_usleep() might return early (e.g., if
> the process is signaled) or late, so this field could end up being
> inaccurate, although probably not by much.  If we're okay with millisecond
> granularity, my first instinct is that what you've proposed is fine, but I
> figured I'd bring it up anyway.

Thanks for bringing that up! I had the same thought when writing the code and
came to the same conclusion. I think that's a good enough estimation and 
specially
during a long running vacuum (which is probably the case where users care the 
most).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-10 Thread Bertrand Drouvot
Hi,

On Fri, Jun 07, 2024 at 09:24:41AM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 11:17 PM Andres Freund  wrote:
> > If we just want to keep prior stats upon arelation rewrite, we can just copy
> > the stats from the old relfilenode.  Or we can decide that those stats don't
> > really make sense anymore, and start from scratch.
> 
> I think we need to think carefully about what we want the user
> experience to be here. "Per-relfilenode stats" could mean "sometimes I
> don't know the relation OID so I want to use the relfilenumber
> instead, without changing the user experience" or it could mean "some
> of these stats actually properly pertain to the relfilenode rather
> than the relation so I want to associate them with the right object
> and that will affect how the user sees things." We need to decide
> which it is. If it's the former, then we need to examine whether the
> goal of hiding the distinction between relfilenode stats and relation
> stats from the user is in fact feasible. If it's the latter, then we
> need to make sure the whole patch reflects that design, which would
> include e.g. NOT copying stats from the old to the new relfilenode,
> and which would also include documenting the behavior in a way that
> will be understandable to users.

Thanks for sharing your thoughts!

Let's take the current heap_blks_read as an example: it currently survives
a relation rewrite and I guess we don't want to change the existing user
experience for it.

Now say we want to add "heap_blks_written" (like in this POC patch) then I think
that it makes sense for the user to 1) query this new stat from the same place
as the existing heap_blks_read: from pg_statio_all_tables and 2) to have the 
same
experience as far the relation rewrite is concerned (keep the previous stats).

To achieve the rewrite behavior we could:

1) copy the stats from the OLD relfilenode to the relation (like in the POC 
patch)
2) copy the stats from the OLD relfilenode to the NEW one (could be in a 
dedicated
field)

The PROS of 1) is that the behavior is consistent with the current 
heap_blks_read
and that the user could still see the current relfilenode stats (through a new 
API)
if he wants to.

> In my experience, the worst thing you can do in cases like this is be
> somewhere in the middle. Then you tend to end up with stuff like: the
> difference isn't supposed to be something that the user knows or cares
> about, except that they do have to know and care because you haven't
> thoroughly covered up the deception, and often they have to reverse
> engineer the behavior because you didn't document what was really
> happening because you imagined that they wouldn't notice.

My idea was to move all that is in pg_statio_all_tables to relfilenode stats
and 1) add new stats to pg_statio_all_tables (like heap_blks_written), 2) ensure
the user can still retrieve the stats from pg_statio_all_tables in such a way
that it survives a rewrite, 3) provide dedicated APIs to retrieve
relfilenode stats but only for the current relfilenode, 4) document this
behavior. This is what the POC patch is doing for heap_blks_written (would
need to do the same for heap_blks_read and friends) except for the documentation
part. What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Proposal to add page headers to SLRU pages

2024-06-10 Thread Bertrand Drouvot
Hi,

On Tue, Mar 19, 2024 at 06:48:33AM +, Li, Yong wrote:
> 
> Unfortunately, the test requires a setup of two different versions of PG. I 
> am not
> aware of an existing test infrastructure which can run automated tests using 
> two
> PGs. I did manually verify the output of pg_upgrade. 

I think there is something in t/002_pg_upgrade.pl (see 
src/bin/pg_upgrade/TESTING),
that could be used to run automated tests using an old and a current version.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Bertrand Drouvot
Hi,

On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 06:29:17AM +0000, Bertrand Drouvot wrote:
> > Thanks for the report! I think it makes sense to add it to the list of known
> > failures.
> > 
> > One way to deal with those corner cases could be to make use of injection 
> > points
> > around places where RUNNING_XACTS is emitted, thoughts?
> 
> Ah.  You mean to force a wait in the code path generating the standby
> snapshots for the sake of this test?

Yeah.

>  That's interesting, I like it.

Great, will look at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-06-10 Thread Bertrand Drouvot
Hi Alexander,

On Sat, Jun 08, 2024 at 07:00:00AM +0300, Alexander Lakhin wrote:
> Hello Bertrand and Michael,
> 
> 23.01.2024 11:07, Bertrand Drouvot wrote:
> > On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
> > 
> > > Anyway, that's not the end of it.  What should we do for snapshot
> > > snapshot records coming from the bgwriter?
> > What about?
> > 
> > 3) depending on how stabilized this test (and others that suffer from 
> > "random"
> > xl_running_xacts) is, then think about the bgwriter.
> 
> A recent buildfarm failure [1] reminds me of that remaining question.
> 
> It's hard to determine from this info, why row_removal_activeslot was not
> invalidated, but running this test on a slowed down Windows VM, I (still)
> get the same looking failures caused by RUNNING_XACTS appeared just before
> `invalidating obsolete replication slot "row_removal_inactiveslot"`.
> So I would consider this failure as yet another result of bgwriter activity
> and add it to the list of known failures as such...

Thanks for the report! I think it makes sense to add it to the list of known
failures.

One way to deal with those corner cases could be to make use of injection points
around places where RUNNING_XACTS is emitted, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Track the amount of time waiting due to cost_delay

2024-06-10 Thread Bertrand Drouvot
Hi hackers,

During the last pgconf.dev I attended Robert’s presentation about autovacuum and
it made me remember of an idea I had some time ago: $SUBJECT

Please find attached a patch doing so by adding a new field (aka "time_delayed")
to the pg_stat_progress_vacuum view. 

Currently one can change [autovacuum_]vacuum_cost_delay and
[auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact of
the changes on the vacuum duration: one could observe the vacuum duration
variation but the correlation to the changes is not accurate (as many others
factors could impact the vacuum duration (load on the system, i/o latency,...)).

This new field reports the time that the vacuum has to sleep due to cost delay:
it could be useful to 1) measure the impact of the current cost_delay and
cost_limit settings and 2) when experimenting new values (and then help for
decision making for those parameters).

The patch is relatively small thanks to the work that has been done in
f1889729dd (to allow parallel worker to report to the leader).

[1]: 
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/29-how-autovacuum-goes-wrong-and-can-we-please-make-it-stop-doing-that/

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 750dfc26cd6fcf5a5618c3fe35fc42d5b5c66f00 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 6 Jun 2024 12:35:57 +
Subject: [PATCH v1] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

Bump catversion because this changes the definition of pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml | 11 +++
 src/backend/catalog/system_views.sql |  3 ++-
 src/backend/commands/vacuum.c|  6 ++
 src/include/catalog/catversion.h |  2 +-
 src/include/commands/progress.h  |  1 +
 src/test/regress/expected/rules.out  |  3 ++-
 6 files changed, 23 insertions(+), 3 deletions(-)
  47.6% doc/src/sgml/
   3.7% src/backend/catalog/
  26.8% src/backend/commands/
   7.5% src/include/catalog/
   4.1% src/include/commands/
  10.0% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..cdd0f0e533 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6290,6 +6290,17 @@ FROM pg_stat_get_backend_idset() AS backendid;
cleaning up indexes.
   
  
+
+ 
+  
+   time_delayed bigint
+  
+  
+   Total amount of time spent in milliseconds waiting due to vacuum_cost_delay
+   or autovacuum_vacuum_cost_delay. In case of parallel
+   vacuum the reported time is across all the workers and the leader.
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..1345e99dcb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
-S.param8 AS indexes_total, S.param9 AS indexes_processed
+S.param8 AS indexes_total, S.param9 AS indexes_processed,
+S.param10 AS time_delayed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 48f8eab202..2551408a86 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -2386,6 +2387,11 @@ vacuum_delay_point(void)
 		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
 		pg_usleep(msec * 1000);
 		pgstat_report_wait_end();
+		/* Report the amount of time we slept */
+		if (VacuumSharedCostBalance != NULL)
+			pgstat_progress_parallel_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
+		else
+			pgstat_progress_incr_param(PROGRESS_VACUUM_TIME_DELAYED, msec);
 
 		/*
 		 * We don't want to ignore postmaster death during very long vacuums
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f0809c0e58..40b4f1d1e4 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +5

Re: relfilenode statistics

2024-06-07 Thread Bertrand Drouvot
Hi,

On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
> >  wrote:
> > > I think we should keep the stats in the relation during relfilenode 
> > > changes.
> > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes 
> > > in
> > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you 
> > > can
> > > see in the example provided up-thread the new heap_blks_written statistic 
> > > has
> > > been preserved during the TRUNCATE.
> >
> > Yeah, I think there's something weird about this design. Somehow we're
> > ending up with both per-relation and per-relfilenode counters:
> >
> > +   pg_stat_get_blocks_written(C.oid) +
> > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> > C.relfilenode) AS heap_blks_written,
> >
> > I'll defer to Andres if he thinks that's awesome, but to me it does
> > not seem right to track some blocks written in a per-relation counter
> > and others in a per-relfilenode counter.
> 
> It doesn't immediately sound awesome. Nor really necessary?
> 
> If we just want to keep prior stats upon arelation rewrite, we can just copy
> the stats from the old relfilenode.

Agree, that's another option. But I think that would be in another field like
"cumulative_XXX" to ensure one could still retrieve stats that are "dedicated"
to this particular "new" relfilenode. Thoughts?

> Or we can decide that those stats don't
> really make sense anymore, and start from scratch.
> 
> 
> I *guess* I could see an occasional benefit in having both counter for "prior
> relfilenodes" and "current relfilenode" - except that stats get reset manually
> and upon crash anyway, making this less useful than if it were really
> "lifetime" stats.

Right but currently they are not lost during a relation rewrite. If we decide to
not keep the relfilenode stats during a rewrite then things like heap_blks_read
would stop surviving a rewrite (if we move it to relfilenode stats) while it
currently does. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-07 Thread Bertrand Drouvot
Hi,

On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote:
> > The main argument is that we currently don’t have writes counters for 
> > relations.
> > The reason is that we don’t have the relation OID when writing buffers out.
> > Tracking writes per relfilenode would allow us to track/consolidate writes 
> > per
> > relation (example in the v1 patch and in the message up-thread).
> > 
> > I think that adding instrumentation in this area (writes counters) could be
> > beneficial (like it is for the ones we currently have for reads).
> > 
> > Second argument is that this is also beneficial for the "Split index and
> > table statistics into different types of stats" thread (mentioned in the 
> > previous
> > message). It would allow us to avoid additional branches in some situations 
> > (like
> > the one mentioned by Andres in the link I provided up-thread).
> 
> I think there's another *very* significant benefit:
> 
> Right now physical replication doesn't populate statistics fields like
> n_dead_tup, which can be a huge issue after failovers, because there's little
> information about what autovacuum needs to do.
> 
> Auto-analyze *partially* can fix it at times, if it's lucky enough to see
> enough dead tuples - but that's not a given and even if it works, is often
> wildly inaccurate.
> 
> 
> Once we put things like n_dead_tup into per-relfilenode stats,

Hm - I had in mind to populate relfilenode stats only with stats that are
somehow related to I/O activities. Which ones do you have in mind to put in 
relfilenode stats?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-06-07 Thread Bertrand Drouvot
Hi,

On Thu, Jun 06, 2024 at 04:00:23PM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot
>  wrote:
> > v9 is more invasive (as it changes code in much more places) than v8 but it 
> > is
> > easier to follow (as it is now clear where the new lock is acquired).
> 
> Hmm, this definitely isn't what I had in mind. Possibly that's a sign
> that what I had in mind was dumb, but for sure it's not what I
> imagined. What I thought you were going to do was add calls like
> LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock)
> in various places, or perhaps LockRelationOid(reloid,
> AccessShareLock), or whatever the case may be.

I see what you’re saying, doing things like:

LockDatabaseObject(TypeRelationId, returnType, 0, AccessShareLock);
in ProcedureCreate() for example.

> Here you've got stuff
> like this:
> 
> - record_object_address_dependencies(, addrs_auto,
> -DEPENDENCY_AUTO);
> + lock_record_object_address_dependencies(, addrs_auto,
> + DEPENDENCY_AUTO);
> 
> ...which to me looks like the locking is still pushed down inside the
> dependency code.

Yes but it’s now located in places where, I think, it’s easier to understand
what’s going on (as compare to v8), except maybe for:

recordDependencyOnExpr()
makeOperatorDependencies()
GenerateTypeDependencies()
makeParserDependencies()
makeDictionaryDependencies()
makeTSTemplateDependencies()
makeConfigurationDependencies()

but probably for:

heap_create_with_catalog()
StorePartitionKey()
index_create()
AggregateCreate()
CastCreate()
CreateConstraintEntry()
ProcedureCreate()
RangeCreate()
InsertExtensionTuple()
CreateTransform()
CreateProceduralLanguage()

The reasons I keep it linked to the dependency code are:

- To ensure we don’t miss anything (well, with the new Assert in place that’s
probably a tangential argument)

- It’s not only about locking the object: it’s also about 1) verifying the 
object
is pinned, 2) checking it still exists and 3) provide a description in the error
message if we can (in case the object does not exist anymore). Relying on an
already build object (in the dependency code) avoid to 1) define the object(s)
one more time or 2) create new functions that would do the same as 
isObjectPinned()
and getObjectDescription() with a different set of arguments.

That may sounds like weak arguments but it has been my reasoning.

Do you still find the code hard to maintain with v9?

> 
> And you also have stuff like this:
> 
>   ObjectAddressSet(referenced, RelationRelationId, childTableId);
> + depLockAndCheckObject();
>   recordDependencyOn(, , DEPENDENCY_PARTITION_SEC);
> 
> But in depLockAndCheckObject you have:
> 
> + if (object->classId == RelationRelationId || object->classId ==
> AuthMemRelationId)
> + return;
> 
> That doesn't seem right, because then it seems like the call isn't
> doing anything, but there isn't really any reason for it to not be
> doing anything. If we're dropping a dependency on a table, then it
> seems like we need to have a lock on that table. Presumably the reason
> why we don't end up with dangling dependencies in such cases now is
> because we're careful about doing LockRelation() in the right places,

Yeah, that's what I think: we're already careful when we deal with relations.

> but we're not similarly careful about other operations e.g.
> ConstraintSetParentConstraint is called by DefineIndex which calls
> table_open(childRelId, ...) first, but there's no logic in DefineIndex
> to lock the constraint.

table_open(childRelId, ...) would lock any "ALTER TABLE  DROP 
CONSTRAINT"
already. Not sure I understand your concern here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: How about using dirty snapshots to locate dependent objects?

2024-06-06 Thread Bertrand Drouvot
On Thu, Jun 06, 2024 at 05:59:00PM +0530, Ashutosh Sharma wrote:
> Hello everyone,
> 
> At present, we use MVCC snapshots to identify dependent objects. This
> implies that if a new dependent object is inserted within a transaction
> that is still ongoing, our search for dependent objects won't include this
> recently added one. Consequently, if someone attempts to drop the
> referenced object, it will be dropped, and when the ongoing transaction
> completes, we will end up having an entry for a referenced object that has
> already been dropped. This situation can lead to an inconsistent state.
> Below is an example illustrating this scenario:
> 
> Session 1:
> - create table t1(a int);
> - insert into t1 select i from generate_series(1, 1000) i;
> - create extension btree_gist;
> - create index i1 on t1 using gist( a );
> 
> Session 2: (While the index creation in session 1 is in progress, drop the
> btree_gist extension)
> - drop extension btree_gist;
> 
> Above command succeeds and so does the create index command running in
> session 1, post this, if we try running anything on table t1, i1, it fails
> with an error: "cache lookup failed for opclass ..."
> 
> Attached is the patch that I have tried, which seems to be working for me.
> It's not polished and thoroughly tested, but just sharing here to clarify
> what I am trying to suggest. Please have a look and let me know your
> thoughts.

Thanks for the patch proposal!

The patch does not fix the other way around:

- session 1: BEGIN; DROP extension btree_gist;
- session 2: create index i1 on t1 using gist( a );
- session 1: commits while session 2 is creating the index

and does not address all the possible orphaned dependencies cases.

There is an ongoing thread (see [1]) to fix the orphaned dependencies issue.

v9 attached in [1] fixes the case you describe here.

[1]: 
https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-06 Thread Bertrand Drouvot
Hi,

On Mon, Jun 03, 2024 at 11:11:46AM +, Bertrand Drouvot wrote:
> Yeah, I’ll update the commit message in V2 with better explanations once I get
> feedback on V1 (should we decide to move on with the relfilenode stats idea).
> 

Please find attached v2, mandatory rebase due to cd312adc56. In passing it
provides a more detailed commit message (also making clear that the goal of this
patch is to start the discussion and agree on the design before moving forward.)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 81d25e077c9f4eafa5304c257d1b39ee8a811628 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Nov 2023 02:30:01 +
Subject: [PATCH v2] Provide relfilenode statistics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation.

relfilenode stats is also beneficial for the "Split index and table statistics
into different types of stats" work in progress: it would allow us to avoid
additional branches in some situations.

=== Remarks ===

This is a POC patch. There is still work to do: there is more places we should
add relfilenode counters, create more APIS to retrieve the relfilenode stats,
the patch takes care of rewrite generated by TRUNCATE but there is more to
care about like CLUSTER,VACUUM FULL.

The new logic to retrieve stats in pg_statio_all_tables has been implemented
only for the new blocks_written stat (we'd need to do the same for the existing
buffer read / buffer hit if we agree on the approach implemented here).

The goal of this patch is to start the discussion and agree on the design before
moving forward.
---
 src/backend/access/rmgrdesc/xactdesc.c|   5 +-
 src/backend/catalog/storage.c |   8 ++
 src/backend/catalog/system_functions.sql  |   2 +-
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/postmaster/checkpointer.c |   5 +
 src/backend/storage/buffer/bufmgr.c   |   6 +-
 src/backend/storage/smgr/md.c |   7 ++
 src/backend/utils/activity/pgstat.c   |  39 --
 src/backend/utils/activity/pgstat_database.c  |  12 +-
 src/backend/utils/activity/pgstat_function.c  |  13 +-
 src/backend/utils/activity/pgstat_relation.c  | 112 --
 src/backend/utils/activity/pgstat_replslot.c  |  13 +-
 src/backend/utils/activity/pgstat_shmem.c |  19 ++-
 .../utils/activity/pgstat_subscription.c  |  12 +-
 src/backend/utils/activity/pgstat_xact.c  |  60 +++---
 src/backend/utils/adt/pgstatfuncs.c   |  34 +-
 src/include/access/tableam.h  |  19 +++
 src/include/access/xact.h |   1 +
 src/include/catalog/pg_proc.dat   |  14 ++-
 src/include/pgstat.h  |  19 ++-
 src/include/utils/pgstat_internal.h   |  34 --
 src/test/recovery/t/029_stats_restart.pl  |  40 +++
 .../recovery/t/030_stats_cleanup_replica.pl   |   6 +-
 src/test/regress/expected/rules.out   |  12 +-
 src/test/regress/expected/stats.out   |  30 ++---
 src/test/regress/sql/stats.sql|  30 ++---
 src/test/subscription/t/026_stats.pl  |   4 +-
 src/tools/pgindent/typedefs.list  |   1 +
 28 files changed, 415 insertions(+), 147 deletions(-)
   4.6% src/backend/catalog/
  47.8% src/backend/utils/activity/
   6.5% src/backend/utils/adt/
   3.7% src/backend/
   3.3% src/include/access/
   3.3% src/include/catalog/
   6.2% src/include/utils/
   3.3% src/include/
  12.1% src/test/recovery/t/
   5.5% src/test/regress/expected/
   3.0% src/test/

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index dccca201e0..c02b079645 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -319,10 +319,11 @@ xact_desc_stats(StringInfo buf, const char *label,
 		appendStringInfo(buf, "; %sdropped stats:", label);
 		for (i = 0; i < ndropped; i++)
 		{
-			appendStringInfo(buf, " %d/%u/%u",
+			appendStringInfo(buf, " %d/%u/%u/%u",
 			 dropped_stats[i].kind,
 			 dropped_stats[i].dboid,
-			 dropped_stats[i].objoid);
+			 dropped_stats[i].objoid,
+			 dropped_stats[i].relfile);
 		}
 	}
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f56b3cc0f2..db6107cd90 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -33,6 +33,7 @@
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
+#include "utils/pgstat_internal.h"
 #include "utils/rel.h"
 
 /

Re: Avoid orphaned objects dependencies, take 3

2024-06-05 Thread Bertrand Drouvot
Hi,

On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote:
> On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot
>  wrote:
> > The reason why we are using a dirty snapshot here is for the cases where we 
> > are
> > recording a dependency on a referenced object that we are creating at the 
> > same
> > time behind the scene (for example, creating a composite type while creating
> > a relation). Without the dirty snapshot, then the object we are creating 
> > behind
> > the scene (the composite type) would not be visible and we would wrongly 
> > assume
> > that it has been dropped.
> 
> The usual reason for using a dirty snapshot is that you want to see
> uncommitted work by other transactions. It sounds like you're saying
> you just need to see uncommitted work by the same transaction. If
> that's true, I think using HeapTupleSatisfiesSelf would be clearer. Or
> maybe we just need to put CommandCounterIncrement() calls in the right
> places to avoid having the problem in the first place. Or maybe this
> is another sign that we're doing the work at the wrong level.

Thanks for having discussed your concern with Tom last week during pgconf.dev
and shared the outcome to me. I understand your concern regarding code
maintainability with the current approach.

Please find attached v9 that:

- Acquire the lock and check for object existence at an upper level, means 
before
calling recordDependencyOn() and recordMultipleDependencies().

- Get rid of the SNAPSHOT_SELF snapshot usage and relies on
CommandCounterIncrement() instead to ensure new entries are visible when
we check for object existence (for the cases where we create additional object
behind the scene: like composite type while creating a relation).

- Add an assertion in recordMultipleDependencies() to ensure that we locked the
object before recording the dependency (to ensure we don't miss any cases now 
that
the lock is acquired at an upper level).

A few remarks:

My first attempt has been to move eliminate_duplicate_dependencies() out of
record_object_address_dependencies() so that we get the calls in this order:

eliminate_duplicate_dependencies()
depLockAndCheckObjects()
record_object_address_dependencies()

What I'm doing instead in v9 is to rename record_object_address_dependencies()
to lock_record_object_address_dependencies() and add depLockAndCheckObjects()
in it at the right place. That way the caller of
[lock_]record_object_address_dependencies() is not responsible of calling
eliminate_duplicate_dependencies() (which would have been the case with my first
attempt).

We need to setup the LOCKTAG before calling the new Assert in
recordMultipleDependencies(). So, using "#ifdef USE_ASSERT_CHECKING" here to
not setup the LOCKTAG on a non Assert enabled build.

v9 is more invasive (as it changes code in much more places) than v8 but it is
easier to follow (as it is now clear where the new lock is acquired).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 6b6ee40fab0b011e9858a0a25624935c59732bfd Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v9] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/aclchk.c  |   1 +
 src/backend/catalog/dependency.c  |  83 +--
 src/backend/catalog/heap.c|   7 +-
 src/b

Re: relfilenode statistics

2024-06-04 Thread Bertrand Drouvot
On Tue, Jun 04, 2024 at 09:26:27AM -0400, Robert Haas wrote:
> On Mon, Jun 3, 2024 at 7:11 AM Bertrand Drouvot
>  wrote:
> > We’d move the current buffer read and buffer hit counters from the relation 
> > stats
> > to the relfilenode stats (while still being able to retrieve them from the
> > pg_statio_all_tables/indexes views: see the example for the new 
> > heap_blks_written
> > stat added in the patch). Generally speaking, I think that tracking 
> > counters at
> > a common level (i.e relfilenode level instead of table or index level) is
> > beneficial (avoid storing/allocating space for the same counters in multiple
> > structs) and sounds more intuitive to me.
> 
> Hmm. So if I CLUSTER or VACUUM FULL the relation, the relfilenode
> changes. Does that mean I lose all of those stats? Is that a problem?
> Or is it good? Or what?

I think we should keep the stats in the relation during relfilenode changes.
As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
table_relation_set_new_filelocator() and in pg_statio_all_tables): as you can
see in the example provided up-thread the new heap_blks_written statistic has
been preserved during the TRUNCATE. 

Please note that the v1 POC only takes care of the new heap_blks_written stat 
and
that the logic used in table_relation_set_new_filelocator() would probably need
to be applied in rebuild_relation() or such to deal with CLUSTER or VACUUM FULL.

For the relation, the new counter "blocks_written" has been added to the
PgStat_StatTabEntry struct (it's not needed in the PgStat_TableCounts one as the
relfilenode stat takes care of it). It's added in PgStat_StatTabEntry only
to copy/preserve the relfilenode stats during rewrite operations and to retrieve
the stats in pg_statio_all_tables.

Then, if later we split the relation stats to index/table stats, we'd have
blocks_written defined in less structs (as compare to doing the split without
relfilenode stat in place).

As mentioned up-thread, the new logic has been implemented in v1 only for the
new blocks_written stat (we'd need to do the same for the existing buffer read /
buffer hit if we agree on the approach implemented in v1).

> I also thought about the other direction. Suppose I drop the a
> relation and create a new one that gets a different relation OID but
> the same relfilenode. But I don't think that's a problem: dropping the
> relation should forcibly remove the old stats, so there won't be any
> conflict in this case.

Yeah.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: relfilenode statistics

2024-06-03 Thread Bertrand Drouvot
Hi Robert,

On Mon, May 27, 2024 at 09:10:13AM -0400, Robert Haas wrote:
> Hi Bertrand,
> 
> It would be helpful to me if the reasons why we're splitting out
> relfilenodestats could be more clearly spelled out. I see Andres's
> comment in the thread to which you linked, but it's pretty vague about
> why we should do this ("it's not nice") and whether we should do this
> ("I wonder if this is an argument for") and maybe that's all fine if
> Andres is going to be the one to review and commit this, but even if
> then it would be nice if the rest of us could follow along from home,
> and right now I can't.

Thanks for the feedback! 

You’re completely right, my previous message is missing clear explanation as to
why I think that relfilenode stats could be useful. Let me try to fix this.

The main argument is that we currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation (example in the v1 patch and in the message up-thread).

I think that adding instrumentation in this area (writes counters) could be
beneficial (like it is for the ones we currently have for reads).

Second argument is that this is also beneficial for the "Split index and
table statistics into different types of stats" thread (mentioned in the 
previous
message). It would allow us to avoid additional branches in some situations 
(like
the one mentioned by Andres in the link I provided up-thread).

If we agree that the main argument makes sense to think about having relfilenode
stats then I think using them as proposed in the second argument makes sense 
too:

We’d move the current buffer read and buffer hit counters from the relation 
stats
to the relfilenode stats (while still being able to retrieve them from the 
pg_statio_all_tables/indexes views: see the example for the new 
heap_blks_written
stat added in the patch). Generally speaking, I think that tracking counters at
a common level (i.e relfilenode level instead of table or index level) is
beneficial (avoid storing/allocating space for the same counters in multiple
structs) and sounds more intuitive to me.

Also I think this is open door for new ideas: for example, with relfilenode
statistics in place, we could probably also start thinking about tracking
checksum errors per relfllenode.

> The commit message is often a good place to spell this kind of thing
> out, because then it's included with every version of the patch you
> post, and may be of some use to the eventual committer in writing
> their commit message. The body of the email where you post the patch
> set can be fine, too.
> 

Yeah, I’ll update the commit message in V2 with better explanations once I get
feedback on V1 (should we decide to move on with the relfilenode stats idea).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




relfilenode statistics

2024-05-25 Thread Bertrand Drouvot
Hi hackers,

Please find attached a POC patch to implement $SUBJECT.

Adding relfilenode statistics has been proposed in [1]. The idea is to allow
tracking dirtied blocks, written blocks,... on a per relation basis.

The attached patch is not in a fully "polished" state yet: there is more places
we should add relfilenode counters, create more APIS to retrieve the relfilenode
stats

But I think that it is in a state that can be used to discuss the approach it
is implementing (so that we can agree or not on it) before moving forward.

The approach that is implemented in this patch is the following:

- A new PGSTAT_KIND_RELFILENODE is added
- A new attribute (aka relfile) has been added to PgStat_HashKey so that we
can record (dboid, spcOid and relfile) to identify a relfilenode entry
- pgstat_create_transactional() is used in RelationCreateStorage()
- pgstat_drop_transactional() is used in RelationDropStorage()
- RelationPreserveStorage() will remove the entry from the list of dropped stats

The current approach to deal with table rewrite is to:

- copy the relfilenode stats in table_relation_set_new_filelocator() from
the relfilenode stats entry to the shared table stats entry
- in the pg_statio_all_tables view: add the table stats entry (that contains
"previous" relfilenode stats (due to the above) that were linked to this 
relation
) to the current relfilenode stats linked to the relation

An example is done in the attached patch for the new heap_blks_written field
in pg_statio_all_tables. Outcome is:

"
postgres=# create table bdt (a int);
CREATE TABLE
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
 0
(1 row)

postgres=# insert into bdt select generate_series(1,1);
INSERT 0 1
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
 0
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
45
(1 row)

postgres=# truncate table bdt;
TRUNCATE TABLE
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
45
(1 row)

postgres=# insert into bdt select generate_series(1,1);
INSERT 0 1
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
45
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=# select heap_blks_written from pg_statio_all_tables where relname = 
'bdt';
 heap_blks_written
---
90
(1 row)
"

Some remarks:

- My first attempt has been to call the pgstat_create_transactional() and
pgstat_drop_transactional() at the same places it is done for the relations but
that did not work well (mainly due to corner cases in case of rewrite).

- Please don't take care of the pgstat_count_buffer_read() and 
pgstat_count_buffer_hit() calls in pgstat_report_relfilenode_buffer_read()
and pgstat_report_relfilenode_buffer_hit(). Those stats will follow the same
flow as the one done and explained above for the new heap_blks_written one (
should we agree on it).

Looking forward to your comments, feedback.

Regards,

[1]: 
https://www.postgresql.org/message-id/20231113204439.r4lmys73tessqmak%40awork3.anarazel.de

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From e102c9d15c08c638879ece26008faee58cf4a07e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Nov 2023 02:30:01 +
Subject: [PATCH v1] Provide relfilenode statistics

---
 src/backend/access/rmgrdesc/xactdesc.c|   5 +-
 src/backend/catalog/storage.c |   8 ++
 src/backend/catalog/system_functions.sql  |   2 +-
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/postmaster/checkpointer.c |   5 +
 src/backend/storage/buffer/bufmgr.c   |   6 +-
 src/backend/storage/smgr/md.c |   7 ++
 src/backend/utils/activity/pgstat.c   |  39 --
 src/backend/utils/activity/pgstat_database.c  |  12 +-
 src/backend/utils/activity/pgstat_function.c  |  13 +-
 src/backend/utils/activity/pgstat_relation.c  | 112 --
 src/backend/utils/activity/pgstat_replslot.c  |  13 +-
 src/backend/utils/activity/pgstat_shmem.c |  19 ++-
 .../utils/activity/pgstat_subscription.c  |  12 +-
 src/backend/utils/activity/pgstat_xact.c  |  60 +++---
 src/backend/utils/adt/pgstatfuncs.c   |  34 +-
 src/include/access/tableam.h  |  19 +++
 src/include/access/xact.h |   1 +
 src/include/catalog/pg_proc.dat   |  14 ++-
 src/include/pgstat.h  |  19 ++-
 src/include/utils/

Re: Avoid orphaned objects dependencies, take 3

2024-05-23 Thread Bertrand Drouvot
Hi,

On Thu, May 23, 2024 at 02:10:54PM -0400, Robert Haas wrote:
> On Thu, May 23, 2024 at 12:19 AM Bertrand Drouvot
>  wrote:
> > The reason why we are using a dirty snapshot here is for the cases where we 
> > are
> > recording a dependency on a referenced object that we are creating at the 
> > same
> > time behind the scene (for example, creating a composite type while creating
> > a relation). Without the dirty snapshot, then the object we are creating 
> > behind
> > the scene (the composite type) would not be visible and we would wrongly 
> > assume
> > that it has been dropped.
> 
> The usual reason for using a dirty snapshot is that you want to see
> uncommitted work by other transactions. It sounds like you're saying
> you just need to see uncommitted work by the same transaction.

Right.

> If that's true, I think using HeapTupleSatisfiesSelf would be clearer.

Oh thanks! I did not know about the SNAPSHOT_SELF snapshot type (I should have
check all the snapshot types first though) and that's exactly what is needed 
here.

Please find attached v8 making use of SnapshotSelf instead of a dirty snapshot.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 2611fb874a476c1a8d665f97d973193beb70292a Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v8] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  59 
 src/backend/catalog/objectaddress.c   |  66 +
 src/backend/catalog/pg_depend.c   |  12 ++
 src/backend/utils/errcodes.txt|   1 +
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 .../expected/test_dependencies_locks.out  | 129 ++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/test_dependencies_locks.spec|  89 
 9 files changed, 359 insertions(+)
  24.3% src/backend/catalog/
  45.3% src/test/isolation/expected/
  28.2% src/test/isolation/specs/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..4271ed7c5b 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,65 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a compo

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-22 Thread Bertrand Drouvot
Hi,

On Wed, May 22, 2024 at 07:01:54PM -0400, Jonathan S. Katz wrote:
> Thanks. As such I made it:
> 
> "which provides descriptions about wait events and can be combined with
> `pg_stat_activity` to give more insight into why an active session is
> waiting."
> 

Thanks! Works for me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-22 Thread Bertrand Drouvot
Hi,

On Wed, May 22, 2024 at 10:48:12AM -0400, Robert Haas wrote:
> On Wed, May 22, 2024 at 6:21 AM Bertrand Drouvot
>  wrote:
> > I started initially with [1] but it was focusing on function-schema only.
> 
> Yeah, that's what I thought we would want to do. And then just extend
> that to the other cases.
> 
> > Then I proposed [2] making use of a dirty snapshot when recording the 
> > dependency.
> > But this approach appeared to be "scary" and it was still failing to close
> > some race conditions.
> 
> The current patch still seems to be using dirty snapshots for some
> reason, which struck me as a bit odd. My intuition is that if we're
> relying on dirty snapshots to solve problems, we likely haven't solved
> the problems correctly, which seems consistent with your statement
> about "failing to close some race conditions". But I don't think I
> understand the situation well enough to be sure just yet.

The reason why we are using a dirty snapshot here is for the cases where we are
recording a dependency on a referenced object that we are creating at the same
time behind the scene (for example, creating a composite type while creating
a relation). Without the dirty snapshot, then the object we are creating behind
the scene (the composite type) would not be visible and we would wrongly assume
that it has been dropped.

Note that the usage of the dirty snapshot is only when the object is first
reported as "non existing" by the new ObjectByIdExist() function.

> > I think there is 2 cases here:
> >
> > First case: the "conflicting" lock mode is for one of our own lock then 
> > LockCheckConflicts()
> > would report this as a NON conflict.
> >
> > Second case: the "conflicting" lock mode is NOT for one of our own lock 
> > then LockCheckConflicts()
> > would report a conflict. But I've the feeling that the existing code would
> > already lock those sessions.
> >
> > Was your concern about "First case" or "Second case" or both?
> 
> The second case, I guess. It's bad to take a weaker lock first and
> then a stronger lock on the same object later, because it can lead to
> deadlocks that would have been avoided if the stronger lock had been
> taken at the outset.

In the example I shared up-thread that would be the opposite: the Session 1 
would
take an AccessExclusiveLock lock on the object before taking an AccessShareLock
during changeDependencyFor().

> Here, it seems like things would happen in the
> other order: if we took two locks, we'd probably take the stronger
> lock in the higher-level code and then the weaker lock in the
> dependency code.

Yeah, I agree.

> That shouldn't break anything; it's just a bit
> inefficient.

Yeah, the second lock is useless in that case (like in the example up-thread).

> My concern was really more about the maintainability of
> the code. I fear that if we add code that takes heavyweight locks in
> surprising places, we might later find the behavior difficult to
> reason about.
>

I think I understand your concern about code maintainability but I'm not sure
that adding locks while recording a dependency is that surprising.

> Tom, what is your thought about that concern?

+1

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-22 Thread Bertrand Drouvot
Hi,

On Tue, May 21, 2024 at 08:53:06AM -0400, Robert Haas wrote:
> On Wed, May 15, 2024 at 6:31 AM Bertrand Drouvot
>  wrote:
> > Please find attached v6 (only diff with v5 is moving the tests as suggested
> > above).
> 
> I don't immediately know what to think about this patch.

Thanks for looking at it!

> I've known about this issue for a long time, but I didn't think this was how 
> we
> would fix it.

I started initially with [1] but it was focusing on function-schema only.

Then I proposed [2] making use of a dirty snapshot when recording the 
dependency.
But this approach appeared to be "scary" and it was still failing to close
some race conditions.

Then, Tom proposed another approach in [2] which is that "creation DDL will have
to take a lock on each referenced object that'd conflict with a lock taken by 
DROP".
This is the one the current patch is trying to implement.

> What you've done here instead is add locking at a much
> lower level - whenever we are adding a dependency on an object, we
> lock the object. The advantage of that approach is that we definitely
> won't miss anything.

Right, as there is much more than the ones related to schemas, for example:

- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper

to name a few.

> The disadvantage of that approach is that it
> means we have some very low-level code taking locks, which means it's
> not obvious which operations are taking what locks.

Right, but the new operations are "only" the ones leading to recording or 
altering
a dependency.

> Maybe it could
> even result in some redundancy, like the higher-level code taking a
> lock also (possibly in a different mode) and then this code taking
> another one.

The one that is added here is in AccessShareLock mode. It could conflict with
the ones in AccessExclusiveLock means (If I'm not missing any):

- AcquireDeletionLock(): which is exactly what we want
- get_object_address()
- get_object_address_rv()
- ExecAlterObjectDependsStmt()
- ExecRenameStmt()
- ExecAlterObjectDependsStmt()
- ExecAlterOwnerStmt()
- RemoveObjects()
- AlterPublication()

I think there is 2 cases here:

First case: the "conflicting" lock mode is for one of our own lock then 
LockCheckConflicts()
would report this as a NON conflict.

Second case: the "conflicting" lock mode is NOT for one of our own lock then 
LockCheckConflicts()
would report a conflict. But I've the feeling that the existing code would
already lock those sessions.

One example where it would be the case:

Session 1: doing "BEGIN; ALTER FUNCTION noschemas.foo2() SET SCHEMA 
alterschema" would
acquire the lock in AccessExclusiveLock during 
ExecAlterObjectSchemaStmt()->get_object_address()->LockDatabaseObject()
(in the existing code and before the new call that would occur through 
changeDependencyFor()->depLockAndCheckObject()
with the patch in place).

Then, session 2: doing "alter function noschemas.foo2() owner to newrole;"
would be locked in the existing code while doing 
ExecAlterOwnerStmt()->get_object_address()->LockDatabaseObject()).

Means that in this example, the new lock this patch is introducing would not be
responsible of session 2 beging locked.

Was your concern about "First case" or "Second case" or both?

[1]: 
https://www.postgresql.org/message-id/flat/5a9daaae-5538-b209-6279-e903c3ea2157%40amazon.com
[2]: 
https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-21 Thread Bertrand Drouvot
Hi,

On Sun, May 19, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> Hello Bertrand,
> 
> Probably, it's worth to avoid ERRCODE_INTERNAL_ERROR here in light of [1]
> and [2], as these errors are not that abnormal (not Assert-like).
> 
> [1] https://www.postgresql.org/message-id/Zic_GNgos5sMxKoa%40paquier.xyz
> [2] https://commitfest.postgresql.org/48/4735/
> 

Thanks for mentioning the above examples, I agree that it's worth to avoid
ERRCODE_INTERNAL_ERROR here: please find attached v7 that makes use of a new
ERRCODE: ERRCODE_DEPENDENT_OBJECTS_DOES_NOT_EXIST 

I thought about this name as it is close enough to the already existing 
"ERRCODE_DEPENDENT_OBJECTS_STILL_EXIST" but I'm open to suggestion too.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 79809f88311ef1cf4e8f3250e1508fe0b7d86602 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v7] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  58 
 src/backend/catalog/objectaddress.c   |  70 ++
 src/backend/catalog/pg_depend.c   |  12 ++
 src/backend/utils/errcodes.txt|   1 +
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 .../expected/test_dependencies_locks.out  | 129 ++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/test_dependencies_locks.spec|  89 
 9 files changed, 362 insertions(+)
  24.6% src/backend/catalog/
  45.1% src/test/isolation/expected/
  28.1% src/test/isolation/specs/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..89b2f18f98 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,64 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a composite type while creating a relation), so bypass the syscache
+		 * lookup and use a dirty snaphot instead to cover this scenario.
+		 */
+		if (!ObjectByIdExist(object, true))
+		{
+			/*
+			 * If the object has been dropped before we get a chance to get
+			 * its description, then emit a generic error message. That looks
+			 * like a good compromise over extra complexity.
+			 */
+			if (object_description)
+ereport(ERROR,
+		(errcode(ERRCODE_DEPE

Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-20 Thread Bertrand Drouvot
Hi,

On Sun, May 19, 2024 at 05:10:10PM -0400, Jonathan S. Katz wrote:
> On 5/16/24 1:15 AM, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Wed, May 15, 2024 at 09:45:35PM -0400, Jonathan S. Katz wrote:
> > > Hi,
> > > 
> > > Attached is a copy of the PostgreSQL 17 Beta 1 release announcement draft.
> > 
> > Thanks for working on it!
> > 
> > I've one comment:
> > 
> > > PostgreSQL 17 also introduces a new view, 
> > > [`pg_wait_events`](https://www.postgresql.org/docs/17/view-pg-wait-events.html),
> > >  which provides descriptions about wait events and can be combined with 
> > > `pg_stat_activity` to give more insight into an operation.
> > 
> > Instead of "to give more insight into an operation", what about "to give 
> > more
> > insight about what a session is waiting for (should it be active)"?
> 
> I put:
> 
> "to give more in insight into why a session is blocked."

Thanks!

> 
> Does that work?
> 

I think using "waiting" is better (as the view is "pg_wait_events" and the
join with pg_stat_activity would be on the "wait_event_type" and "wait_event"
columns).

The reason I mentioned "should it be active" is because wait_event and 
wait_event_type
could be non empty in pg_stat_activity while the session is not in an active 
state
anymore (then not waiting).

A right query would be like the one in [1]:

"
SELECT a.pid, a.wait_event, w.description
  FROM pg_stat_activity a JOIN
   pg_wait_events w ON (a.wait_event_type = w.type AND
a.wait_event = w.name)
  WHERE a.wait_event is NOT NULL and a.state = 'active';
"

means filtering on the "active" state too, and that's what the description
proposal I made was trying to highlight.

[1]: https://www.postgresql.org/docs/devel/monitoring-stats.html

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 09:45:35PM -0400, Jonathan S. Katz wrote:
> Hi,
> 
> Attached is a copy of the PostgreSQL 17 Beta 1 release announcement draft.

Thanks for working on it!

I've one comment:

> PostgreSQL 17 also introduces a new view, 
> [`pg_wait_events`](https://www.postgresql.org/docs/17/view-pg-wait-events.html),
>  which provides descriptions about wait events and can be combined with 
> `pg_stat_activity` to give more insight into an operation.

Instead of "to give more insight into an operation", what about "to give more
insight about what a session is waiting for (should it be active)"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 08:31:43AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote:
> > +++ 
> > b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec
> >  
> > 
> > This introduces a module with only one single spec.  I could get
> > behind an extra module if splitting the tests into more specs makes
> > sense or if there is a restriction on installcheck.  However, for 
> > one spec file filed with a bunch of objects, and note that I'm OK to
> > let this single spec grow more for this range of tests, it seems to me
> > that this would be better under src/test/isolation/.
> 
> Yeah, I was not sure about this one (the location is from take 2 mentioned
> up-thread). I'll look at moving the tests to src/test/isolation/.

Please find attached v6 (only diff with v5 is moving the tests as suggested
above).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 59f7befd34b8aa4ba0483a100e85bacbc1a76707 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v6] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  54 
 src/backend/catalog/objectaddress.c   |  70 ++
 src/backend/catalog/pg_depend.c   |  12 ++
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 .../expected/test_dependencies_locks.out  | 129 ++
 src/test/isolation/isolation_schedule |   1 +
 .../specs/test_dependencies_locks.spec|  89 
 8 files changed, 357 insertions(+)
  24.1% src/backend/catalog/
  45.9% src/test/isolation/expected/
  28.6% src/test/isolation/specs/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..a49357bbe2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,60 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a composite type while creating a relation), so bypass the syscache
+		 * lookup and use a dirty snaphot instead to cover this scenario.
+		 */
+		if (!ObjectByIdExist(object, true))
+		{
+			/*
+			 * If the object has been dropped before we get a chance to get
+			 * its description, then

Re: Avoid orphaned objects dependencies, take 3

2024-05-15 Thread Bertrand Drouvot
Hi,

On Tue, May 14, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 09.05.2024 15:20, Bertrand Drouvot wrote:
> > Oh I see, your test updates an existing dependency. v4 took care about 
> > brand new
> > dependencies creation (recordMultipleDependencies()) but forgot to take care
> > about changing an existing dependency (which is done in another code path:
> > changeDependencyFor()).
> > 
> > Please find attached v5 that adds:
> > 
> > - a call to the new depLockAndCheckObject() function in 
> > changeDependencyFor().
> > - a test when altering an existing dependency.
> > 
> > With v5 applied, I don't see the issue anymore.
> 
> Me too. Thank you for the improved version!
> I will test the patch in the background, but for now I see no other
> issues with it.

Thanks for confirming!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 10:14:09AM +0900, Michael Paquier wrote:
> On Thu, May 09, 2024 at 12:20:51PM +0000, Bertrand Drouvot wrote:
> > Oh I see, your test updates an existing dependency. v4 took care about 
> > brand new 
> > dependencies creation (recordMultipleDependencies()) but forgot to take care
> > about changing an existing dependency (which is done in another code path:
> > changeDependencyFor()).
> > 
> > Please find attached v5 that adds:
> > 
> > - a call to the new depLockAndCheckObject() function in 
> > changeDependencyFor().
> > - a test when altering an existing dependency.
> > 
> > With v5 applied, I don't see the issue anymore.
> 
> +if (object_description)
> +ereport(ERROR, errmsg("%s does not exist", 
> object_description));
> +else
> +ereport(ERROR, errmsg("a dependent object does not ex
> 
> This generates an internal error code.  Is that intended?

Thanks for looking at it!

Yes, it's like when say you want to create an object in a schema that does not
exist (see get_namespace_oid()).

> --- /dev/null
> +++ 
> b/src/test/modules/test_dependencies_locks/specs/test_dependencies_locks.spec 
> 
> This introduces a module with only one single spec.  I could get
> behind an extra module if splitting the tests into more specs makes
> sense or if there is a restriction on installcheck.  However, for 
> one spec file filed with a bunch of objects, and note that I'm OK to
> let this single spec grow more for this range of tests, it seems to me
> that this would be better under src/test/isolation/.

Yeah, I was not sure about this one (the location is from take 2 mentioned
up-thread). I'll look at moving the tests to src/test/isolation/.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Log details for stats dropped more than once

2024-05-15 Thread Bertrand Drouvot
Hi,

On Wed, May 15, 2024 at 02:47:29PM +0900, Michael Paquier wrote:
> On Tue, May 14, 2024 at 10:07:14AM +0000, Bertrand Drouvot wrote:
> > While resuming the work on refilenode stats (mentioned in [1] but did not 
> > share
> > the patch yet), I realized that my current POC patch is buggy enough to 
> > produce
> > things like:
> > 
> > 024-05-14 09:51:14.783 UTC [1788714] FATAL:  can only drop stats once
> > 
> > While the CONTEXT provides the list of dropped stats:
> > 
> > 2024-05-14 09:51:14.783 UTC [1788714] CONTEXT:  WAL redo at 0/D75F478 for 
> > Transaction/ABORT: 2024-05-14 09:51:14.782223+00; dropped stats: 
> > 2/16384/27512/0 2/16384/27515/0 2/16384/27516/0
> 
> Can refcount be useful to know in this errcontext?

Thanks for looking at it!

Do you mean as part of the added errdetail_internal()? If so, yeah I think it's
a good idea (done that way in v2 attached).

> > Attached a tiny patch to report the stat that generates the error. The 
> > patch uses
> > errdetail_internal() as the extra details don't seem to be useful to average
> > users.
> 
> I think that's fine.  Overall that looks like useful information for
> debugging, so no objections from here.

Thanks! BTW, I just realized that adding more details for this error has already
been mentioned in [1] (and Andres did propose a slightly different version).

[1]: 
https://www.postgresql.org/message-id/20240505160915.6boysum4f34siqct%40awork3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 0866e8252f2038f3fdbd9cfabb214461689210df Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 14 May 2024 09:10:50 +
Subject: [PATCH v2] Log details for stats dropped more than once

Adding errdetail_internal() in pgstat_drop_entry_internal() to display the
PgStat_HashKey and the entry's refcount when the "can only drop stats once"
error is reported.
---
 src/backend/utils/activity/pgstat_shmem.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 91591da395..0595c08d6e 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -785,7 +785,12 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent,
 	 * backends to release their references.
 	 */
 	if (shent->dropped)
-		elog(ERROR, "can only drop stats once");
+		ereport(ERROR,
+errmsg("can only drop stats once"),
+errdetail_internal("Stats kind=%s dboid=%u objoid=%u refcount=%u.",
+   pgstat_get_kind_info(shent->key.kind)->name,
+   shent->key.dboid, shent->key.objoid,
+   pg_atomic_read_u32(>refcount)));
 	shent->dropped = true;
 
 	/* release refcount marking entry as not dropped */
-- 
2.34.1



Log details for stats dropped more than once

2024-05-14 Thread Bertrand Drouvot
Hi hackers,

While resuming the work on refilenode stats (mentioned in [1] but did not share
the patch yet), I realized that my current POC patch is buggy enough to produce
things like:

024-05-14 09:51:14.783 UTC [1788714] FATAL:  can only drop stats once

While the CONTEXT provides the list of dropped stats:

2024-05-14 09:51:14.783 UTC [1788714] CONTEXT:  WAL redo at 0/D75F478 for 
Transaction/ABORT: 2024-05-14 09:51:14.782223+00; dropped stats: 
2/16384/27512/0 2/16384/27515/0 2/16384/27516/0

It's not clear which one generates the error (don't pay attention to the actual
values, the issue comes from the new refilenode stats that I removed from the
output).

Attached a tiny patch to report the stat that generates the error. The patch 
uses
errdetail_internal() as the extra details don't seem to be useful to average
users.

[1]: 
https://www.postgresql.org/message-id/ZbIdgTjR2QcFJ2mE%40ip-10-97-1-34.eu-west-3.compute.internal

Looking forward to your feedback,

Regards

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From f296d4092f5e87ed63a323db3f1bc9cfa807e2e8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Tue, 14 May 2024 09:10:50 +
Subject: [PATCH v1] Log details for stats dropped more than once

Adding errdetail_internal() in pgstat_drop_entry_internal() to display the
PgStat_HashKey when the "can only drop stats once" error is reported.
---
 src/backend/utils/activity/pgstat_shmem.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index 91591da395..1308c4439a 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -785,7 +785,11 @@ pgstat_drop_entry_internal(PgStatShared_HashEntry *shent,
 	 * backends to release their references.
 	 */
 	if (shent->dropped)
-		elog(ERROR, "can only drop stats once");
+		ereport(ERROR,
+errmsg("can only drop stats once"),
+errdetail_internal("Stats kind=%s dboid=%u objoid=%u.",
+   pgstat_get_kind_info(shent->key.kind)->name,
+   shent->key.dboid, shent->key.objoid));
 	shent->dropped = true;
 
 	/* release refcount marking entry as not dropped */
-- 
2.34.1



Re: Avoid orphaned objects dependencies, take 3

2024-05-09 Thread Bertrand Drouvot
Hi,

On Tue, Apr 30, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> But I've discovered yet another possibility to get a broken dependency.

Thanks for the testing and the report!

> Please try this script:
> res=0
> numclients=20
> for ((i=1;i<=100;i++)); do
> for ((c=1;c<=numclients;c++)); do
>   echo "
> CREATE SCHEMA s_$c;
> CREATE CONVERSION myconv_$c FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8;
> ALTER CONVERSION myconv_$c SET SCHEMA s_$c;
>   " | psql >psql1-$c.log 2>&1 &
>   echo "DROP SCHEMA s_$c RESTRICT;" | psql >psql2-$c.log 2>&1 &
> done
> wait
> pg_dump -f db.dump || { echo "on iteration $i"; res=1; break; }
> for ((c=1;c<=numclients;c++)); do
>   echo "DROP SCHEMA s_$c CASCADE;" | psql >psql3-$c.log 2>&1
> done
> done
> psql -c "SELECT * FROM pg_conversion WHERE connamespace NOT IN (SELECT oid 
> FROM pg_namespace);"
> 
> It fails for me (with the v4 patch applied) as follows:
> pg_dump: error: schema with OID 16392 does not exist
> on iteration 1
>   oid  | conname  | connamespace | conowner | conforencoding | contoencoding 
> |  conproc  | condefault
> ---+--+--+--++---+---+
>  16396 | myconv_6 |    16392 |   10 |  8 | 6 
> | iso8859_1_to_utf8 | f
> 

Thanks for sharing the test, I'm able to reproduce the issue with v4.

Oh I see, your test updates an existing dependency. v4 took care about brand 
new 
dependencies creation (recordMultipleDependencies()) but forgot to take care
about changing an existing dependency (which is done in another code path:
changeDependencyFor()).

Please find attached v5 that adds:

- a call to the new depLockAndCheckObject() function in changeDependencyFor().
- a test when altering an existing dependency.

With v5 applied, I don't see the issue anymore.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 6fc24a798210f730ab04833fa58074f142be968e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v5] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- alter a dependency (function and schema)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  54 
 src/backend/catalog/objectaddress.c   |  70 ++
 src/backend/catalog/pg_depend.c   |  12 ++
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_dependencies_locks/.gitignore|   3 +
 .../modules/test_dependencies_locks/Makefile  |  14 ++
 .../expected/test_dependencies_locks.out  | 129 ++
 .../test_dependencies_locks/meson.build   |  12 ++
 .../specs/test_dependencies_locks.spec|  89 
 12 files changed, 387 insertions(+)
  22.9% src/backend/catalog/
  43.7% src/test/modules/test_dependencies_locks/expected/
  27.2% src/test/modules/test_dependencies_locks/specs/
   4.5% src/test/modules/test_dependencies_locks/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..a49357bbe2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -15

Re: First draft of PG 17 release notes

2024-05-08 Thread Bertrand Drouvot
Hi,

On Thu, May 09, 2024 at 12:03:50AM -0400, Bruce Momjian wrote:
> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
> 
>   https://momjian.us/pgsql_docs/release-17.html

Thanks for working on that!
 
> I welcome feedback.

> Add system view pg_wait_events that reports wait event types (Michael 
> Paquier) 

Michael is the committer for 1e68e43d3f, the author is me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-05-08 Thread Bertrand Drouvot
Hi,

On Mon, Apr 29, 2024 at 11:58:09AM +, Zhijie Hou (Fujitsu) wrote:
> On Monday, April 29, 2024 5:11 PM shveta malik  wrote:
> > 
> > On Mon, Apr 29, 2024 at 11:38 AM shveta malik 
> > wrote:
> > >
> > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot
> >  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Since the standby_slot_names patch has been committed, I am
> > > > > > attaching the last doc patch for review.
> > > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > 1 ===
> > > > >
> > > > > +   continue subscribing to publications now on the new primary
> > > > > + server
> > > > > without
> > > > > +   any data loss.
> > > > >
> > > > > I think "without any data loss" should be re-worded in this
> > > > > context. Data loss in the sense "data committed on the primary and
> > > > > not visible on the subscriber in case of failover" can still occurs 
> > > > > (in case
> > synchronous replication is not used).
> > > > >
> > > > > 2 ===
> > > > >
> > > > > +   If the result (failover_ready) of both above 
> > > > > steps is
> > > > > +   true, existing subscriptions will be able to continue without data
> > loss.
> > > > > +  
> > > > >
> > > > > I don't think that's true if synchronous replication is not used.
> > > > > Say,
> > > > >
> > > > > - synchronous replication is not used
> > > > > - primary is not able to reach the standby anymore and
> > > > > standby_slot_names is set
> > > > > - new data is inserted into the primary
> > > > > - then not replicated to subscriber (due to standby_slot_names)
> > > > >
> > > > > Then I think the both above steps will return true but data would
> > > > > be lost in case of failover.
> > > >
> > > > Thanks for the comments, attach the new version patch which reworded
> > > > the above places.

Thanks!

> Here is the V3 doc patch.

Thanks! A few comments:

1 ===

+   losing any data that has been flushed to the new primary server.

Worth to add a few words about possible data loss, something like?

Please note that in case synchronous replication is not used and 
standby_slot_names
is set correctly, it might be possible to lose data that would have been 
committed
on the old primary server (in case the standby was not reachable during that 
time 
for example).

2 ===

+test_sub=# SELECT
+   array_agg(slotname) AS slots
+   FROM
+   ((
+   SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_', 
srrelid, '_', ctl.system_identifier) AS slotname
+   FROM pg_control_system() ctl, pg_subscription_rel r, 
pg_subscription s
+   WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover
+   ) UNION (

I guess this format comes from ReplicationSlotNameForTablesync(). What about
creating a SQL callable function on top of it and make use of it in the query
above? (that would ensure to keep the doc up to date even if the format changes
in ReplicationSlotNameForTablesync()).

3 ===

+test_sub=# SELECT
+   MAX(remote_lsn) AS remote_lsn_on_subscriber
+   FROM
+   ((
+   SELECT (CASE WHEN r.srsubstate = 'f' THEN 
pg_replication_origin_progress(CONCAT('pg_', r.srsubid, '_', r.srrelid), false)
+   WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn 
END) AS remote_lsn
+   FROM pg_subscription_rel r, pg_subscription s
+   WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND 
s.subfailover
+   ) UNION (
+   SELECT pg_replication_origin_progress(CONCAT('pg_', s.oid), 
false) AS remote_lsn
+   FROM pg_subscription s
+   WHERE s.subfailover
+   ));

What about adding a join to pg_replication_origin to get rid of the "hardcoded"
format "CONCAT('pg_', r.srsubid, '_', r.srrelid)" and "CONCAT('pg_', s.oid)"?

Idea behind 2 ===  and 3 === is to have the queries as generic as possible and
not rely on a hardcoded format (that would be more difficult to maintain should
those formats change in the future).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-04-25 Thread Bertrand Drouvot
Hi,

On Thu, Apr 25, 2024 at 09:00:00AM +0300, Alexander Lakhin wrote:
> Hi,
> 
> 25.04.2024 08:00, Bertrand Drouvot wrote:
> > 
> > > though concurrent create/drop operations can still produce
> > > the "cache lookup failed" error, which is probably okay, except that it is
> > > an INTERNAL_ERROR, which assumed to be not easily triggered by users.
> > I did not see any of those "cache lookup failed" during my testing 
> > with/without
> > your script. During which test(s) did you see those with v3 applied?
> 
> You can try, for example, table-trigger, or other tests that check for
> "cache lookup failed" psql output only (maybe you'll need to increase the
> iteration count). For instance, I've got (with v4 applied):
> 2024-04-25 05:48:08.102 UTC [3638763] ERROR:  cache lookup failed for 
> function 20893
> 2024-04-25 05:48:08.102 UTC [3638763] STATEMENT:  CREATE TRIGGER modified_c1 
> BEFORE UPDATE OF c ON t
>     FOR EACH ROW WHEN (OLD.c <> NEW.c) EXECUTE PROCEDURE 
> trigger_func('modified_c');
> 
> Or with function-function:
> 2024-04-25 05:52:31.390 UTC [3711897] ERROR:  cache lookup failed for 
> function 32190 at character 54
> 2024-04-25 05:52:31.390 UTC [3711897] STATEMENT:  CREATE FUNCTION f1() 
> RETURNS int LANGUAGE SQL RETURN f() + 1;
> --
> 2024-04-25 05:52:37.639 UTC [3720011] ERROR:  cache lookup failed for 
> function 34465 at character 54
> 2024-04-25 05:52:37.639 UTC [3720011] STATEMENT:  CREATE FUNCTION f1() 
> RETURNS int LANGUAGE SQL RETURN f() + 1;

I see, so this is during object creation.

It's easy to reproduce this kind of issue with gdb. For example set a breakpoint
on SearchSysCache1() and during the create function f1() once it breaks on:

#0  SearchSysCache1 (cacheId=45, key1=16400) at syscache.c:221
#1  0x5ad305beacd6 in func_get_detail (funcname=0x5ad308204d50, fargs=0x0, 
fargnames=0x0, nargs=0, argtypes=0x72ff9cc0, expand_variadic=true, 
expand_defaults=true, include_out_arguments=false, funcid=0x72ff9ba0, 
rettype=0x72ff9b9c, retset=0x72ff9b94, nvargs=0x72ff9ba4,
vatype=0x72ff9ba8, true_typeids=0x72ff9bd8, 
argdefaults=0x72ff9be0) at parse_func.c:1622
#2  0x5ad305be7dd0 in ParseFuncOrColumn (pstate=0x5ad30823be98, 
funcname=0x5ad308204d50, fargs=0x0, last_srf=0x0, fn=0x5ad308204da0, 
proc_call=false, location=55) at parse_func.c:266
#3  0x5ad305bdffb0 in transformFuncCall (pstate=0x5ad30823be98, 
fn=0x5ad308204da0) at parse_expr.c:1474
#4  0x5ad305bdd2ee in transformExprRecurse (pstate=0x5ad30823be98, 
expr=0x5ad308204da0) at parse_expr.c:230
#5  0x5ad305bdec34 in transformAExprOp (pstate=0x5ad30823be98, 
a=0x5ad308204e20) at parse_expr.c:990
#6  0x5ad305bdd1a0 in transformExprRecurse (pstate=0x5ad30823be98, 
expr=0x5ad308204e20) at parse_expr.c:187
#7  0x5ad305bdd00b in transformExpr (pstate=0x5ad30823be98, 
expr=0x5ad308204e20, exprKind=EXPR_KIND_SELECT_TARGET) at parse_expr.c:131
#8  0x5ad305b96b7e in transformReturnStmt (pstate=0x5ad30823be98, 
stmt=0x5ad308204ee0) at analyze.c:2395
#9  0x5ad305b92213 in transformStmt (pstate=0x5ad30823be98, 
parseTree=0x5ad308204ee0) at analyze.c:375
#10 0x5ad305c6321a in interpret_AS_clause (languageOid=14, 
languageName=0x5ad308204c40 "sql", funcname=0x5ad308204ad8 "f100", as=0x0, 
sql_body_in=0x5ad308204ee0, parameterTypes=0x0, inParameterNames=0x0, 
prosrc_str_p=0x72ffa208, probin_str_p=0x72ffa200, 
sql_body_out=0x72ffa210,
queryString=0x5ad3082040b0 "CREATE FUNCTION f100() RETURNS int LANGUAGE SQL 
RETURN f() + 1;") at functioncmds.c:953
#11 0x5ad305c63c93 in CreateFunction (pstate=0x5ad308186310, 
stmt=0x5ad308204f00) at functioncmds.c:1221

then drop function f() in another session. Then the create function f1() would:

postgres=# CREATE FUNCTION f() RETURNS int LANGUAGE SQL RETURN f() + 1;
ERROR:  cache lookup failed for function 16400

This stuff does appear before we get a chance to call the new 
depLockAndCheckObject()
function.

I think this is what Tom was referring to in [1]:

"
So the only real fix for this would be to make every object lookup in the entire
system do the sort of dance that's done in RangeVarGetRelidExtended.
"

The fact that those kind of errors appear also somehow ensure that no orphaned
dependencies can be created.

The patch does ensure that no orphaned depencies can occur after those "initial"
look up are done (should the dependent object be dropped).

I'm tempted to not add extra complexity to avoid those kind of errors and keep 
the
patch as it is. All of those servicing the same goal: no orphaned depencies are
created.

[1]: https://www.postgresql.org/message-id/2872252.1630851337%40sss.pgh.pa.us

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-04-24 Thread Bertrand Drouvot
Hi,

On Wed, Apr 24, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> 24.04.2024 11:38, Bertrand Drouvot wrote:
> > I gave more thought to v2 and the approach seems reasonable to me. 
> > Basically what
> > it does is that in case the object is already dropped before we take the 
> > new lock
> > (while recording the dependency) then the error message is a "generic" one 
> > (means
> > it does not provide the description of the "already" dropped object). I 
> > think it
> > makes sense to write the patch that way by abandoning the patch's ambition 
> > to
> > tell the description of the dropped object in all the cases.
> > 
> > Of course, I would be happy to hear others thought about it.
> > 
> > Please find v3 attached (which is v2 with more comments).
> 
> Thank you for the improved version!
> 
> I can confirm that it fixes that case.

Great, thanks for the test!

> I've also tested other cases that fail on master (most of them also fail
> with v1), please try/look at the attached script.

Thanks for all those tests!

> (There could be other broken-dependency cases, of course, but I think I've
> covered all the representative ones.)

Agree. Furthermore the way the patch is written should be agnostic to the
object's kind that are part of the dependency. Having said that, that does not
hurt to add more tests in this patch, so v4 attached adds some of your tests (
that would fail on the master branch without the patch applied).

The way the tests are written in the patch are less "racy" that when triggered
with your script. Indeed, I think that in the patch the dependent object can not
be removed before the new lock is taken when recording the dependency. We may
want to add injection points in the game if we feel the need.

> All tested cases work correctly with v3 applied —

Yeah, same on my side, I did run them too and did not find any issues.

> I couldn't get broken
> dependencies,

Same here.

> though concurrent create/drop operations can still produce
> the "cache lookup failed" error, which is probably okay, except that it is
> an INTERNAL_ERROR, which assumed to be not easily triggered by users.

I did not see any of those "cache lookup failed" during my testing with/without
your script. During which test(s) did you see those with v3 applied?

Attached v4, simply adding more tests to v3.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a9b34955fab0351b7b5a7ba6eb36f199f5a5822c Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v4] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after the new lock attempt, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

If the object is dropped before the new lock attempt is triggered then the patch
would also report an error (but with less details).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and arg type
- function and return type
- function and function
- domain and domain
- table and type
- server and foreign data wrapper
---
 src/backend/catalog/dependency.c  |  54 +
 src/backend/catalog/objectaddress.c   |  70 +++
 src/backend/catalog/pg_depend.c   |   6 +
 src/include/catalog/dependency.h  |   1 +
 src/include/catalog/objectaddress.h   |   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../test_dependencies_locks/.gitignore|   3 +
 .../modules/test_dependencies_locks/Makefile  |  14 +++
 .../expected/test_dependencies_locks.out  | 113 ++
 .../test_dependencies_locks/meson.build   |  12 ++
 .../specs/test_dependencies_locks.spec|  78 
 12 files changed

Re: Avoid orphaned objects dependencies, take 3

2024-04-24 Thread Bertrand Drouvot
Hi,

On Tue, Apr 23, 2024 at 04:20:46PM +, Bertrand Drouvot wrote:
> Please find attached v2 that should not produce the issue anymore (I launched 
> a
> lot of attempts without any issues). v1 was not strong enough as it was not
> always checking for the dependent object existence. v2 now always checks if 
> the
> object still exists after the additional lock acquisition attempt while 
> recording
> the dependency.
> 
> I still need to think about v2 but in the meantime could you please also give
> v2 a try on you side?

I gave more thought to v2 and the approach seems reasonable to me. Basically 
what
it does is that in case the object is already dropped before we take the new 
lock
(while recording the dependency) then the error message is a "generic" one 
(means
it does not provide the description of the "already" dropped object). I think it
makes sense to write the patch that way by abandoning the patch's ambition to
tell the description of the dropped object in all the cases.

Of course, I would be happy to hear others thought about it.

Please find v3 attached (which is v2 with more comments).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 32945912ceddad6b171fd8b345adefa4432af357 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v3] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after locking the object, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type
---
 src/backend/catalog/dependency.c  | 54 ++
 src/backend/catalog/objectaddress.c   | 70 +++
 src/backend/catalog/pg_depend.c   |  6 ++
 src/include/catalog/dependency.h  |  1 +
 src/include/catalog/objectaddress.h   |  1 +
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../test_dependencies_locks/.gitignore|  3 +
 .../modules/test_dependencies_locks/Makefile  | 14 
 .../expected/test_dependencies_locks.out  | 49 +
 .../test_dependencies_locks/meson.build   | 12 
 .../specs/test_dependencies_locks.spec| 39 +++
 12 files changed, 251 insertions(+)
  40.5% src/backend/catalog/
  29.6% src/test/modules/test_dependencies_locks/expected/
  18.7% src/test/modules/test_dependencies_locks/specs/
   8.3% src/test/modules/test_dependencies_locks/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..a49357bbe2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,60 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */
+	LockDatabaseObject(object->classId, object->objectId, 0, AccessShareLock);
+
+	/* check if object still exists */
+	if (!ObjectByIdExist(object, false))
+	{
+		/*
+		 * It might be possible that we are creating it (for example creating
+		 * a composite type while creating a relation), so bypass the syscache

Re: Avoid orphaned objects dependencies, take 3

2024-04-23 Thread Bertrand Drouvot
Hi,

On Tue, Apr 23, 2024 at 04:59:09AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> > 22.04.2024 13:52, Bertrand Drouvot wrote:
> > > 
> > > That's weird, I just launched it several times with the patch applied and 
> > > I'm not
> > > able to see the seg fault (while I can see it constently failing on the 
> > > master
> > > branch).
> > > 
> > > Are you 100% sure you tested it against a binary with the patch applied?
> > > 
> > 
> > Yes, at least I can't see what I'm doing wrong. Please try my
> > self-contained script attached.
> 
> Thanks for sharing your script!
> 
> Yeah your script ensures the patch is applied before the repro is executed.
> 
> I do confirm that I can also see the issue with the patch applied (but I had 
> to
> launch multiple attempts, while on master one attempt is enough).
> 
> I'll have a look.

Please find attached v2 that should not produce the issue anymore (I launched a
lot of attempts without any issues). v1 was not strong enough as it was not
always checking for the dependent object existence. v2 now always checks if the
object still exists after the additional lock acquisition attempt while 
recording
the dependency.

I still need to think about v2 but in the meantime could you please also give
v2 a try on you side?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From f80fdfc32e791463555aa318f26ff5e7363ac3ac Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v2] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after locking the object, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type
---
 src/backend/catalog/dependency.c  | 48 +
 src/backend/catalog/objectaddress.c   | 70 +++
 src/backend/catalog/pg_depend.c   |  6 ++
 src/include/catalog/dependency.h  |  1 +
 src/include/catalog/objectaddress.h   |  1 +
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../test_dependencies_locks/.gitignore|  3 +
 .../modules/test_dependencies_locks/Makefile  | 14 
 .../expected/test_dependencies_locks.out  | 49 +
 .../test_dependencies_locks/meson.build   | 12 
 .../specs/test_dependencies_locks.spec| 39 +++
 12 files changed, 245 insertions(+)
  38.1% src/backend/catalog/
  30.7% src/test/modules/test_dependencies_locks/expected/
  19.5% src/test/modules/test_dependencies_locks/specs/
   8.7% src/test/modules/test_dependencies_locks/

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index d4b5b2ade1..e3b66025dd 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1517,6 +1517,54 @@ AcquireDeletionLock(const ObjectAddress *object, int flags)
 	}
 }
 
+/*
+ * depLockAndCheckObject
+ *
+ * Lock the object that we are about to record a dependency on.
+ * After it's locked, verify that it hasn't been dropped while we
+ * weren't looking.  If the object has been dropped, this function
+ * does not return!
+ */
+void
+depLockAndCheckObject(const ObjectAddress *object)
+{
+	char	   *object_description;
+
+	/*
+	 * Those don't rely on LockDatabaseObject() when being dropped (see
+	 * AcquireDeletionLock()). Also it looks like they can not produce
+	 * orphaned dependent objects when being dropped.
+	 */
+	if (object->classId == RelationRelationId || object->classId == AuthMemRelationId)
+		return;
+
+	object_description = getObjectDescription(object, true);
+
+	/* assume we should lock the whole object not a sub-object */

Re: Avoid orphaned objects dependencies, take 3

2024-04-22 Thread Bertrand Drouvot
Hi,

On Mon, Apr 22, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> 22.04.2024 13:52, Bertrand Drouvot wrote:
> > 
> > That's weird, I just launched it several times with the patch applied and 
> > I'm not
> > able to see the seg fault (while I can see it constently failing on the 
> > master
> > branch).
> > 
> > Are you 100% sure you tested it against a binary with the patch applied?
> > 
> 
> Yes, at least I can't see what I'm doing wrong. Please try my
> self-contained script attached.

Thanks for sharing your script!

Yeah your script ensures the patch is applied before the repro is executed.

I do confirm that I can also see the issue with the patch applied (but I had to
launch multiple attempts, while on master one attempt is enough).

I'll have a look.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Avoid orphaned objects dependencies, take 3

2024-04-22 Thread Bertrand Drouvot
Hi,

On Mon, Apr 22, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 22.04.2024 11:45, Bertrand Drouvot wrote:
> > Hi,
> > 
> > This new thread is a follow-up of [1] and [2].
> > 
> > Problem description:
> > 
> > We have occasionally observed objects having an orphaned dependency, the
> > most common case we have seen is functions not linked to any namespaces.
> > 
> > ...
> > 
> > Looking forward to your feedback,
> 
> This have reminded me of bug #17182 [1].

Thanks for the link to the bug!

> Unfortunately, with the patch applied, the following script:
> 
> for ((i=1;i<=100;i++)); do
>   ( { for ((n=1;n<=20;n++)); do echo "DROP SCHEMA s;"; done } | psql ) 
> >psql1.log 2>&1 &
>   echo "
> CREATE SCHEMA s;
> CREATE FUNCTION s.func1() RETURNS int LANGUAGE SQL AS 'SELECT 1;';
> CREATE FUNCTION s.func2() RETURNS int LANGUAGE SQL AS 'SELECT 2;';
> CREATE FUNCTION s.func3() RETURNS int LANGUAGE SQL AS 'SELECT 3;';
> CREATE FUNCTION s.func4() RETURNS int LANGUAGE SQL AS 'SELECT 4;';
> CREATE FUNCTION s.func5() RETURNS int LANGUAGE SQL AS 'SELECT 5;';
>   "  | psql >psql2.log 2>&1 &
>   wait
>   psql -c "DROP SCHEMA s CASCADE" >psql3.log
> done
> echo "
> SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid FROM 
> pg_proc pp
>   LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS NULL" 
> | psql
> 
> still ends with:
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> 
> 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|LOG:  server process (PID
> 1388378) was terminated by signal 11: Segmentation fault
> 2024-04-22 09:54:39.171 UTC|||662633dc.152bbc|DETAIL:  Failed process was
> running: SELECT pg_identify_object('pg_proc'::regclass, pp.oid, 0), pp.oid
> FROM pg_proc pp
>   LEFT JOIN pg_namespace pn ON pp.pronamespace = pn.oid WHERE pn.oid IS 
> NULL
> 

Thanks for sharing the script.

That's weird, I just launched it several times with the patch applied and I'm 
not
able to see the seg fault (while I can see it constently failing on the master
branch).

Are you 100% sure you tested it against a binary with the patch applied?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Disallow changing slot's failover option in transaction block

2024-04-22 Thread Bertrand Drouvot
Hi,

On Mon, Apr 22, 2024 at 11:32:14AM +0530, shveta malik wrote:
> On Mon, Apr 22, 2024 at 5:57 AM Zhijie Hou (Fujitsu)
>  wrote:
> > Attach the V3 patch which also addressed Shveta[1] and Bertrand[2]'s 
> > comments.

Thanks!

>  Tested the patch, works well.

Same here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Patch to avoid orphaned dependencies

2024-04-22 Thread Bertrand Drouvot
Hi,

On Wed, Mar 23, 2022 at 12:49:06PM -0400, Tom Lane wrote:
> Realistically, if we want to prevent this type of problem, then all
> creation DDL will have to take a lock on each referenced object that'd
> conflict with a lock taken by DROP.  This might not be out of reach:
> I think we do already take such locks while dropping objects.  The
> reference-side lock could be taken by the recordDependency mechanism
> itself, ensuring that we don't miss anything; and that would also
> allow us to not bother taking such a lock on pinned objects, which'd
> greatly cut the cost (though not to zero).

Thanks for the idea (and sorry for the delay replying to it)! I had a look at it
and just created a new thread [1] based on your proposal.

[1]: 
https://www.postgresql.org/message-id/flat/ZiYjn0eVc7pxVY45%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Avoid orphaned objects dependencies, take 3

2024-04-22 Thread Bertrand Drouvot
Hi,

This new thread is a follow-up of [1] and [2].

Problem description:

We have occasionally observed objects having an orphaned dependency, the 
most common case we have seen is functions not linked to any namespaces.

Examples to produce such orphaned dependencies:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

A patch has been initially proposed to fix this particular 
(function-to-namespace) dependency (see [1]), but there could be much 
more scenarios (like the function-to-datatype one highlighted by Gilles 
in [1] that could lead to a function having an invalid parameter datatype).

As Tom said there are dozens more cases that would need to be 
considered, and a global approach to avoid those race conditions should 
be considered instead.

A first global approach attempt has been proposed in [2] making use of a dirty
snapshot when recording the dependency. But this approach appeared to be "scary"
and it was still failing to close some race conditions (see [2] for details).

Then, Tom proposed another approach in [2] which is that "creation DDL will have
to take a lock on each referenced object that'd conflict with a lock taken by
DROP".

This is what the attached patch is trying to achieve.

It does the following:

1) A new lock (that conflicts with a lock taken by DROP) has been put in place
when the dependencies are being recorded.

Thanks to it, the drop schema in scenario 2 would be locked (resulting in an
error should session 1 committs).

2) After locking the object while recording the dependency, the patch checks
that the object still exists.

Thanks to it, session 2 in scenario 1 would be locked and would report an error
once session 1 committs (that would not be the case should session 1 abort the
transaction).

The patch also adds a few tests for some dependency cases (that would currently
produce orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type (which is I think problematic enough, as involving a table into
the game, to fix this stuff as a whole).

[1]: 
https://www.postgresql.org/message-id/flat/a4f55089-7cbd-fe5d-a9bb-19adc6418...@darold.net#9af5cdaa9e80879beb1def3604c976e8
[2]: 
https://www.postgresql.org/message-id/flat/8369ff70-0e31-f194-2954-787f4d9e21dd%40amazon.com

Please note that I'm not used to with this area of the code so that the patch
might not take the option proposed by Tom the "right" way.

Adding the patch to the July CF.

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 6798e85b0679dfccfa665007b06d9f1a0e39e0c6 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 29 Mar 2024 15:43:26 +
Subject: [PATCH v1] Avoid orphaned objects dependencies

It's currently possible to create orphaned objects dependencies, for example:

Scenario 1:

session 1: begin; drop schema schem;
session 2: create a function in the schema schem
session 1: commit;

With the above, the function created in session 2 would be linked to a non
existing schema.

Scenario 2:

session 1: begin; create a function in the schema schem
session 2: drop schema schem;
session 1: commit;

With the above, the function created in session 1 would be linked to a non
existing schema.

To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP)
has been put in place when the dependencies are being recorded. With this in
place, the drop schema in scenario 2 would be locked.

Also, after locking the object, the patch checks that the object still exists:
with this in place session 2 in scenario 1 would be locked and would report an
error once session 1 committs (that would not be the case should session 1 abort
the transaction).

The patch adds a few tests for some dependency cases (that would currently produce
orphaned objects):

- schema and function (as the above scenarios)
- function and type
- table and type
---
 src/backend/catalog/dependency.c  | 42 ++
 src/backend/catalog/objectaddress.c   | 57 +++
 src/backend/catalog/pg_depend.c   |  6 ++
 src/include/catalog/dependency.h  |  1 +
 src/include/catalog/objectaddress.h   |  1 +
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../test_dependencies_locks/.gitignore|  3 +
 .../modules/test_dependencies_locks/Makefile  | 14 +
 .../expected/test_dependencies_locks.out  | 49 
 .../te

Re: Disallow changing slot's failover option in transaction block

2024-04-19 Thread Bertrand Drouvot
Hi,

On Fri, Apr 19, 2024 at 12:39:40AM +, Zhijie Hou (Fujitsu) wrote:
> Here is V2 patch which includes the changes for CREATE SUBSCRIPTION as
> suggested. Since we don't connect pub to alter slot when (create_slot=false)
> anymore, the restriction that disallows failover=true when connect=false is
> also removed.

Thanks!

+  specified in the subscription. When creating the slot, ensure the 
slot
+  property failover matches the counterpart 
parameter
+  of the subscription.

The slot could be created before or after the subscription is created, so I 
think
it needs a bit of rewording (as here it sounds like the sub is already created),
, something like?

"Always ensure the slot property failover matches the
counterpart parameter of the subscription and vice versa."

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-18 Thread Bertrand Drouvot
Hi,

On Thu, Apr 18, 2024 at 05:36:05PM +0530, shveta malik wrote:
> Please find v8 attached. Changes are:

Thanks!

A few comments:

1 ===

@@ -1440,7 +1461,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t 
startup_data_len)
 * slotsync_worker_onexit() but that will need the connection to be made
 * global and we want to avoid introducing global for this purpose.
 */
-   before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
+   before_shmem_exit(slotsync_worker_disconnect, PointerGetDatum(wrconn));

The comment above this change still states "Register the failure callback once
we have the connection", I think it has to be reworded a bit now that v8 is
making use of slotsync_worker_disconnect().

2 ===

+* Register slotsync_worker_onexit() before we register
+* ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit 
of
+* slot sync worker, ReplicationSlotShmemExit() is called first, 
followed
+* by slotsync_worker_onexit(). Startup process during promotion waits 
for

Worth to mention in shmem_exit() (where it "while (--before_shmem_exit_index >= 
0)"
or before the shmem_exit() definition) that ReplSlotSyncWorkerMain() relies on
this LIFO behavior? (not sure if there is other "strong" LIFO requirement in
other part of the code).

3 ===

+* Startup process during promotion waits for slot sync to finish and it
+* does that by checking the 'syncing' flag.

worth to mention ShutDownSlotSync()?

4 ===

I did a few tests manually (launching ShutDownSlotSync() through gdb / with and
without sync worker and with / without pg_sync_replication_slots() running
concurrently) and it looks like it works as designed.

Having said that, the logic that is in place to take care of the corner cases
described up-thread seems reasonable to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
> On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
>  wrote:
> > I personally feel adding the additional check for sync_replication_slots may
> > not improve the situation here. Because the GUC sync_replication_slots can
> > change at any point, the GUC could be false when performing this addition 
> > check
> > and is set to true immediately after the check, so It could not simplify 
> > the logic
> > anyway.
> 
> +1.
> I feel doc and "cannot synchronize replication slots concurrently"
> check should suffice.
> 
> In the scenario which Hou-San pointed out,  if after performing the
> GUC check in SQL function, this GUC is enabled immediately and say
> worker is started sooner than the function could get chance to sync,
> in that case as well, SQL function will ultimately get error "cannot
> synchronize replication slots concurrently", even though GUC is
> enabled.  Thus, I feel we should stick with samer error in all
> scenarios.

Okay, fine by me, let's forget about checking sync_replication_slots then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Disallow changing slot's failover option in transaction block

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 06:32:11AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> Kuroda-San reported an issue off-list that:
> 
> If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> and rollback, only the subscription option change can be rolled back, while 
> the
> replication slot's failover change is preserved.

Nice catch, thanks!

> To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> (failover) inside a txn block, which is also consistent to the ALTER
> SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> patch to address this.

Agree. The patch looks pretty straightforward to me. Worth to add this
case in the doc? (where we already mention that "Commands ALTER SUBSCRIPTION ...
REFRESH PUBLICATION and ALTER SUBSCRIPTION ... {SET|ADD|DROP} PUBLICATION ...
with refresh option as true cannot be executed inside a transaction block"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 10:00:04AM +0530, shveta malik wrote:
> Please find v5 addressing above comments.

Thanks!

@@ -1634,9 +1677,14 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 {
PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, 
PointerGetDatum(wrconn));
{
+   check_flags_and_set_sync_info(InvalidPid);
+

Given the fact that if the sync worker is running it won't be possible to 
trigger
a manual sync with pg_sync_replication_slots(), what about also checking the 
"sync_replication_slots" value at the start of SyncReplicationSlots() and
emmit an error if sync_replication_slots is set to on? (The message could 
explicitly
states that it's not possible to use the function if sync_replication_slots is
set to on).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Bertrand Drouvot
Hi,

On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
> On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
> > > On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
> > >  wrote:
> > > >
> > > > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > > > > Now both worker and SQL
> > > > > function set 'syncing' when they start and reset it when they exit.
> > > >
> > > > It means that it's not possible anymore to trigger a manual sync if
> > > > sync_replication_slots is on. Indeed that would trigger:
> > > >
> > > > postgres=# select pg_sync_replication_slots();
> > > > ERROR:  cannot synchronize replication slots concurrently
> > > >
> > > > That looks like an issue to me, thoughts?
> > > >
> > >
> > > This is intentional as of now for the sake of keeping
> > > implementation/code simple. It is not difficult to allow them but I am
> > > not sure whether we want to add another set of conditions allowing
> > > them in parallel.
> >
> > I think that the ability to launch a manual sync before a switchover would 
> > be
> > missed. Except for this case I don't think that's an issue to prevent them 
> > to
> > run in parallel.
> >
> 
> I think if the slotsync worker is available, it can do that as well.

Right, but one has no control as to when the sync is triggered. 

> There is no clear use case for allowing them in parallel and I feel it
> would add more confusion when it can work sometimes but not other
> times. However, if we receive some report from the field where there
> is a real demand for such a thing, it should be easy to achieve. For
> example, I can imagine that we can have sync_state that has values
> 'started', 'in_progress' , and 'finished'. This should allow us to
> achieve what the current proposed patch is doing along with allowing
> the API to work in parallel when the sync_state is not 'in_progress'.
> 
> I think for now let's restrict their usage in parallel and make the
> promotion behavior consistent both for worker and API.

Okay, let's do it that way. Is it worth to add a few words in the doc related to
pg_sync_replication_slots() though? (to mention it can not be used if the sync
slot worker is running).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Bertrand Drouvot
Hi,

On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
> On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
>  wrote:
> >
> > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > > Now both worker and SQL
> > > function set 'syncing' when they start and reset it when they exit.
> >
> > It means that it's not possible anymore to trigger a manual sync if
> > sync_replication_slots is on. Indeed that would trigger:
> >
> > postgres=# select pg_sync_replication_slots();
> > ERROR:  cannot synchronize replication slots concurrently
> >
> > That looks like an issue to me, thoughts?
> >
> 
> This is intentional as of now for the sake of keeping
> implementation/code simple. It is not difficult to allow them but I am
> not sure whether we want to add another set of conditions allowing
> them in parallel.

I think that the ability to launch a manual sync before a switchover would be
missed. Except for this case I don't think that's an issue to prevent them to
run in parallel.

> And that too in an unpredictable way as the API will
> work only for the time slot sync worker is not performing the sync.

Yeah but then at least you would know that there is "really" a sync in progress
(which is not the case currently with v4, as the sync worker being started is
enough to prevent a manual sync even if a sync is not in progress).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Bertrand Drouvot
Hi,

On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 12, 2024 at 5:25 PM shveta malik  wrote:
> > >
> > > Please find v3 addressing race-condition and one other comment.
> > >
> > > Up-thread it was suggested that,  probably, checking
> > > SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
> > > re-thinking, it might not be. Slot sync worker sets and resets
> > > 'syncing' with each sync-cycle, and thus we need to rely on worker's
> > > pid in ShutDownSlotSync(), as there could be a window where promotion
> > > is triggered and 'syncing' is not set for worker, while the worker is
> > > still running. This implementation of setting and resetting syncing
> > > with each sync-cycle looks better as compared to setting syncing
> > > during the entire life-cycle of the worker. So, I did not change it.
> > >
> >
> > To retain this we need to have different handling for 'syncing' for
> > workers and function which seems like more maintenance burden than the
> > value it provides. Moreover, in SyncReplicationSlots(), we are calling
> > a function after acquiring spinlock which is not our usual coding
> > practice.
> 
> Okay.  Changed it to consistent handling.

Thanks for the patch!

> Now both worker and SQL
> function set 'syncing' when they start and reset it when they exit.

It means that it's not possible anymore to trigger a manual sync if 
sync_replication_slots is on. Indeed that would trigger:

postgres=# select pg_sync_replication_slots();
ERROR:  cannot synchronize replication slots concurrently

That looks like an issue to me, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-06 Thread Bertrand Drouvot
Hi,

On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote:
> On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
>  wrote:
> 
> I think the new LSN can be visible only when the corresponding WAL is
> written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can
> make it visible. In your experiment below, isn't it possible that in
> the meantime WAL writer has written that WAL due to which you are
> seeing an updated location?

What I did is:

session 1:  select pg_current_wal_lsn();\watch 1
session 2:  select pg_backend_pid();

terminal 1: tail -f logfile | grep -i snap 
terminal 2 : gdb -p ) at standby.c:1346
1346{
(gdb) n
1350
 
Then next, next until the DEBUG message is emitted (confirmed in terminal 1).

At this stage the DEBUG message shows the new LSN while session 1 still displays
the previous LSN.

Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then
session 1 displays the new LSN.

This is reproducible as desired.

With more debugging I can see that when the spinlock is released in 
XLogSetAsyncXactLSN()
then XLogWrite() is doing its job and then session 1 does see the new value 
(that
happens in this order, and as you said that's expected).

My point is that while the DEBUG message is emitted session 1 still 
see the old LSN (until the new LSN is vsible). I think that we should emit the
DEBUG message once session 1 can see the new value (If not, I think the 
timestamp
of the DEBUG message can be missleading during debugging purpose).

> I think I am missing how exactly moving DEBUG2 can confirm the above theory.

I meant to say that instead of seeing:

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 
739 next xid 740)
2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
SELECT '0/360' <= replay_lsn AND state = 'streaming'

We would probably see something like:

2024-04-05 04:37:05. UTC [3866475][client backend][2/4:0] LOG:  
statement: SELECT '0/360' <= replay_lsn AND state = 'streaming'
2024-04-05 04:37:05.+xx UTC [3854278][background writer][:0] DEBUG:  
snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest 
complete 739 next xid 740)

And then it would be clear that the query has ran before the new LSN is visible.

> > If the theory is proven then I'm not sure we need the extra complexity of
> > injection point here, maybe just relying on the slots created via SQL API 
> > could
> > be enough.
> >
> 
> Yeah, that could be the first step. We can probably add an injection
> point to control the bgwrite behavior and then add tests involving
> walsender performing the decoding. But I think it is important to have
> sufficient tests in this area as I see they are quite helpful in
> uncovering the issues.
>

Yeah agree. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 02:35:42PM +, Bertrand Drouvot wrote:
> I think that maybe as a first step we should move the "elog(DEBUG2," message 
> as
> proposed above to help debugging (that could help to confirm the above 
> theory).

If you agree and think that makes sense, pleae find attached a tiny patch doing
so.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a414e81a4c5c88963b2412d1641f3de1262c15c0 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 5 Apr 2024 14:49:51 +
Subject: [PATCH v1] Move DEBUG message in LogCurrentRunningXacts()

Indeed the new LSN is visible to others session (say through pg_current_wal_lsn())
only after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is released.

So moving the DEBUG message after the XLogSetAsyncXactLSN() call seems more
appropriate for debugging purpose.
---
 src/backend/storage/ipc/standby.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)
 100.0% src/backend/storage/ipc/

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 87b04e51b3..ba549acf50 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1366,6 +1366,17 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
+	/*
+	 * Ensure running_xacts information is synced to disk not too far in the
+	 * future. We don't want to stall anything though (i.e. use XLogFlush()),
+	 * so we let the wal writer do it during normal operation.
+	 * XLogSetAsyncXactLSN() conveniently will mark the LSN as to-be-synced
+	 * and nudge the WALWriter into action if sleeping. Check
+	 * XLogBackgroundFlush() for details why a record might not be flushed
+	 * without it.
+	 */
+	XLogSetAsyncXactLSN(recptr);
+
 	if (CurrRunningXacts->subxid_overflow)
 		elog(DEBUG2,
 			 "snapshot of %d running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
@@ -1383,17 +1394,6 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 			 CurrRunningXacts->latestCompletedXid,
 			 CurrRunningXacts->nextXid);
 
-	/*
-	 * Ensure running_xacts information is synced to disk not too far in the
-	 * future. We don't want to stall anything though (i.e. use XLogFlush()),
-	 * so we let the wal writer do it during normal operation.
-	 * XLogSetAsyncXactLSN() conveniently will mark the LSN as to-be-synced
-	 * and nudge the WALWriter into action if sleeping. Check
-	 * XLogBackgroundFlush() for details why a record might not be flushed
-	 * without it.
-	 */
-	XLogSetAsyncXactLSN(recptr);
-
 	return recptr;
 }
 
-- 
2.34.1



Re: Synchronizing slots from primary to standby

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote:
> On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila  wrote:
> Thinking more on this, it doesn't seem related to
> c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change
> any locking or something like that which impacts write positions.

Agree.

> I think what has happened here is that running_xact record written by
> the background writer [1] is not written to the kernel or disk (see
> LogStandbySnapshot()), before pg_current_wal_lsn() checks the
> current_lsn to be compared with replayed LSN.

Agree, I think it's not visible through pg_current_wal_lsn() yet.

Also I think that the DEBUG message in LogCurrentRunningXacts() 

"
elog(DEBUG2,
 "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid 
%u latest complete %u next xid %u)",
 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
 LSN_FORMAT_ARGS(recptr),
 CurrRunningXacts->oldestRunningXid,
 CurrRunningXacts->latestCompletedXid,
 CurrRunningXacts->nextXid);
"

should be located after the XLogSetAsyncXactLSN() call. Indeed, the new LSN is
visible after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is
released, see:

\watch on Session 1 provides: 

 pg_current_wal_lsn

 0/87D110
(1 row)

Until:

Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at xlog.c:2579 
2579XLogRecPtr  WriteRqstPtr = asyncXactLSN;
(gdb) n
2581boolwakeup = false;
(gdb)
2584SpinLockAcquire(>info_lck);
(gdb)
2585RefreshXLogWriteResult(LogwrtResult);
(gdb)
2586sleeping = XLogCtl->WalWriterSleeping;
(gdb)
2587prevAsyncXactLSN = XLogCtl->asyncXactLSN;
(gdb)
2588if (XLogCtl->asyncXactLSN < asyncXactLSN)
(gdb)
2589XLogCtl->asyncXactLSN = asyncXactLSN;
(gdb)
2590SpinLockRelease(>info_lck);
(gdb) p p/x (uint32) XLogCtl->asyncXactLSN
$1 = 0x87d148

Then session 1 provides:

 pg_current_wal_lsn

 0/87D148
(1 row)

So, when we see in the log:

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 
739 next xid 740)
2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
SELECT '0/360' <= replay_lsn AND state = 'streaming'

It's indeed possible that the new LSN was not visible yet (spinlock not 
released?)
before the query began (because we can not rely on the time the DEBUG message 
has
been emitted).

> Note that the reason why
> walsender has picked the running_xact written by background writer is
> because it has checked after pg_current_wal_lsn() query, see LOGs [2].
> I think we can probably try to reproduce manually via debugger.
> 
> If this theory is correct

It think it is.

> then I think we will need to use injection
> points to control the behavior of bgwriter or use the slots created
> via SQL API for syncing in tests.
> 
> Thoughts?

I think that maybe as a first step we should move the "elog(DEBUG2," message as
proposed above to help debugging (that could help to confirm the above theory).

If the theory is proven then I'm not sure we need the extra complexity of
injection point here, maybe just relying on the slots created via SQL API could
be enough.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 04:09:01PM +0530, shveta malik wrote:
> On Fri, Apr 5, 2024 at 10:09 AM Bertrand Drouvot
>  wrote:
> >
> > What about something like?
> >
> > ereport(LOG,
> > errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs 
> > from remote slot",
> > remote_slot->name),
> > errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.",
> > LSN_FORMAT_ARGS(remote_slot->restart_lsn),
> > LSN_FORMAT_ARGS(slot->data.restart_lsn));
> >
> > Regards,
> 
> +1. Better than earlier. I will update and post the patch.
> 

Thanks!

BTW, I just realized that the LSN I used in my example in the LSN_FORMAT_ARGS()
are not the right ones.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-05 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 11:21:43AM +0530, Bharath Rupireddy wrote:
> On Wed, Apr 3, 2024 at 9:57 PM Bertrand Drouvot
>  wrote:
> Please find the attached v36 patch.

Thanks!

A few comments:

1 ===

+   
+The timeout is measured from the time since the slot has become
+inactive (known from its
+inactive_since value) until it gets
+used (i.e., its active is set to true).
+   

That's right except when it's invalidated during the checkpoint (as the slot
is not acquired in CheckPointReplicationSlots()).

So, what about adding: "or a checkpoint occurs"? That would also explain that
the invalidation could occur during checkpoint.

2 ===

+   /* If the slot has been invalidated, recalculate the resource limits */
+   if (invalidated)
+   {

/If the slot/If a slot/?

3 ===

+ * NB - this function also runs as part of checkpoint, so avoid raising errors

s/NB - this/NB: This function/? (that looks more consistent with other comments
in the code)

4 ===

+ * Note that having a new function for RS_INVAL_INACTIVE_TIMEOUT cause instead

I understand it's "the RS_INVAL_INACTIVE_TIMEOUT cause" but reading "cause 
instead"
looks weird to me. Maybe it would make sense to reword this a bit.

5 ===

+* considered not active as they don't actually perform logical 
decoding.

Not sure that's 100% accurate as we switched in fast forward logical
in 2ec005b4e2.

"as they perform only fast forward logical decoding (or not at all)", maybe?

6 ===

+   if (RecoveryInProgress() && slot->data.synced)
+   return false;
+
+   if (replication_slot_inactive_timeout == 0)
+   return false;

What about just using one if? It's more a matter of taste but it also probably
reduces the object file size a bit for non optimized build.

7 ===

+   /*
+* Do not invalidate the slots which are currently being synced 
from
+* the primary to the standby.
+*/
+   if (RecoveryInProgress() && slot->data.synced)
+   return false;

I think we don't need this check as the exact same one is done just before.

8 ===

+sub check_for_slot_invalidation_in_server_log
+{
+   my ($node, $slot_name, $offset) = @_;
+   my $invalidated = 0;
+
+   for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; 
$i++)
+   {
+   $node->safe_psql('postgres', "CHECKPOINT");

Wouldn't be better to wait for the replication_slot_inactive_timeout time before
instead of triggering all those checkpoints? (it could be passed as an extra arg
to wait_for_slot_invalidation()).

9 ===

# Synced slot mustn't get invalidated on the standby, it must sync invalidation
# from the primary. So, we must not see the slot's invalidation message in 
server
# log.
ok( !$standby1->log_contains(
"invalidating obsolete replication slot \"lsub1_sync_slot\"",
$standby1_logstart),
'check that syned slot has not been invalidated on the standby');

Would that make sense to trigger a checkpoint on the standby before this test?
I mean I think that without a checkpoint on the standby we should not see the
invalidation in the log anyway.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-04 Thread Bertrand Drouvot
Hi,

On Fri, Apr 05, 2024 at 09:43:35AM +0530, shveta malik wrote:
> On Fri, Apr 5, 2024 at 9:22 AM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote:
> > > On Thu, Apr 4, 2024 at 2:59 PM shveta malik  
> > > wrote:
> > 2 ===
> >
> > +   if (slot->data.confirmed_flush != 
> > remote_slot->confirmed_lsn)
> > +   elog(LOG,
> > +"could not synchronize local slot 
> > \"%s\" LSN(%X/%X)"
> > +" to remote slot's LSN(%X/%X) ",
> > +remote_slot->name,
> > +
> > LSN_FORMAT_ARGS(slot->data.confirmed_flush),
> > +
> > LSN_FORMAT_ARGS(remote_slot->confirmed_lsn));
> >
> > I don't think that the message is correct here. Unless I am missing 
> > something
> > there is nothing in the following code path that would prevent the slot to 
> > be
> > sync during this cycle.
> 
> This is a sanity check,  I will put a comment to indicate the same.

Thanks!

> We
> want to ensure if anything changes in future, we get correct logs to
> indicate that.

Right, understood that way.

> If required, the LOG msg can be changed. Kindly suggest if you have
> anything better in mind.
> 

What about something like?

ereport(LOG,
errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from 
remote slot",
remote_slot->name),
errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.",
LSN_FORMAT_ARGS(remote_slot->restart_lsn),
LSN_FORMAT_ARGS(slot->data.restart_lsn));

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-04 Thread Bertrand Drouvot
Hi,

On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote:
> On Thu, Apr 4, 2024 at 2:59 PM shveta malik  wrote:
> >
> >
> > Prior to commit 2ec005b, this check was okay, as we did not expect
> > restart_lsn of the synced slot to be ahead of remote since we were
> > directly copying the lsns. But now when we use 'advance' to do logical
> > decoding on standby, there is a possibility that restart lsn of the
> > synced slot is ahead of remote slot, if there are running txns records
> > found after reaching consistent-point while consuming WALs from
> > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance
> > may end up serializing snapshots and setting restart_lsn to the
> > serialized snapshot point, ahead of remote one.
> >
> > Fix:
> > The sanity check needs to be corrected. Attached a patch to address the 
> > issue.
> 

Thanks for reporting, explaining the issue and providing a patch.

Regarding the patch:

1 ===

+* Attempt to sync lsns and xmins only if remote slot is ahead of local

s/lsns/LSNs/?

2 ===

+   if (slot->data.confirmed_flush != 
remote_slot->confirmed_lsn)
+   elog(LOG,
+"could not synchronize local slot 
\"%s\" LSN(%X/%X)"
+" to remote slot's LSN(%X/%X) ",
+remote_slot->name,
+
LSN_FORMAT_ARGS(slot->data.confirmed_flush),
+
LSN_FORMAT_ARGS(remote_slot->confirmed_lsn));

I don't think that the message is correct here. Unless I am missing something
there is nothing in the following code path that would prevent the slot to be
sync during this cycle. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Autogenerate some wait events code and documentation

2024-04-04 Thread Bertrand Drouvot
Hi,

On Thu, Apr 04, 2024 at 07:14:47PM +0900, Michael Paquier wrote:
> On Thu, Apr 04, 2024 at 09:28:36AM +0000, Bertrand Drouvot wrote:
> > +# No "Backpatch" region here as code is generated automatically.
> > 
> > What about "region here as has its own C code" (that would be more 
> > consistent
> > with the comment in the "header" for the file). Done that way in v4.
> 
> I'd add a "as -this section- has its own C code", for clarity.  This
> just looked a bit strange here.

Sounds good, done that way in v5 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 1e24cdb02360147363495122fbfe79c2758ae4e8 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 4 Apr 2024 15:34:37 +0900
Subject: [PATCH v5] Add "ABI_compatibility" regions in wait_event_names.txt

When backpatching, adding an event will renumber others, which can make an
extension report the wrong event until recompiled. This is due to the fact that
generate-wait_event_types.pl sorts events to position them. The "ABI_compatibility"
region added here ensures no ordering is done for the wait events found after
this delimiter.
---
 .../activity/generate-wait_event_types.pl | 27 ++-
 .../utils/activity/wait_event_names.txt   | 17 
 2 files changed, 43 insertions(+), 1 deletion(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/generate-wait_event_types.pl b/src/backend/utils/activity/generate-wait_event_types.pl
index f1adf0e8e7..9768406db5 100644
--- a/src/backend/utils/activity/generate-wait_event_types.pl
+++ b/src/backend/utils/activity/generate-wait_event_types.pl
@@ -38,7 +38,9 @@ die "Not possible to specify --docs and --code simultaneously"
 
 open my $wait_event_names, '<', $ARGV[0] or die;
 
+my @abi_compatibility_lines;
 my @lines;
+my $abi_compatibility = 0;
 my $section_name;
 my $note;
 my $note_name;
@@ -59,10 +61,27 @@ while (<$wait_event_names>)
 	{
 		$section_name = $_;
 		$section_name =~ s/^.*- //;
+		$abi_compatibility = 0;
 		next;
 	}
 
-	push(@lines, $section_name . "\t" . $_);
+	# ABI_compatibility region, preserving ABI compatibility of the code
+	# generated.  Any wait events listed in this part of the file
+	# will not be sorted during the code generation.
+	if (/^ABI_compatibility:$/)
+	{
+		$abi_compatibility = 1;
+		next;
+	}
+
+	if ($gen_code && $abi_compatibility)
+	{
+		push(@abi_compatibility_lines, $section_name . "\t" . $_);
+	}
+	else
+	{
+		push(@lines, $section_name . "\t" . $_);
+	}
 }
 
 # Sort the lines based on the second column.
@@ -70,6 +89,12 @@ while (<$wait_event_names>)
 my @lines_sorted =
   sort { uc((split(/\t/, $a))[1]) cmp uc((split(/\t/, $b))[1]) } @lines;
 
+# If we are generating code, concat @lines_sorted and then @abi_compatibility_lines.
+if ($gen_code)
+{
+	push(@lines_sorted, @abi_compatibility_lines);
+}
+
 # Read the sorted lines and populate the hash table
 foreach my $line (@lines_sorted)
 {
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d..5169be238c 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -26,6 +26,12 @@
 # When adding a new wait event, make sure it is placed in the appropriate
 # ClassName section.
 #
+# Wait events added in stable branches should be appended to the lists in
+# the "ABI_compatibility:" region of their related ClassName section to preserve
+# ABI compatibility of the C code generated from this file's data, respecting
+# the order of any wait event already listed there.  The "ABI_compatibility:"
+# regions should remain empty on the master branch and on unreleased branches.
+#
 # WaitEventLWLock and WaitEventLock have their own C code for their wait event
 # enums and function names.  Hence, for these, only the SGML documentation is
 # generated.
@@ -61,6 +67,7 @@ WAL_SENDER_MAIN	"Waiting in main loop of WAL sender process."
 WAL_SUMMARIZER_WAL	"Waiting in WAL summarizer for more WAL to be generated."
 WAL_WRITER_MAIN	"Waiting in main loop of WAL writer process."
 
+ABI_compatibility:
 
 #
 # Wait Events - Client
@@ -83,6 +90,7 @@ WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
+ABI_compatibility:
 
 #
 # Wait Events - IPC
@@ -150,6 +158,7 @@ WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for st
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to b

  1   2   3   4   >